How do I copy files from S3 to Amazon EMR HDFS? How do I copy files from S3 to Amazon EMR HDFS? hadoop hadoop

How do I copy files from S3 to Amazon EMR HDFS?


the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):

% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile

This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.

I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.


Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .

S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services (AWS), particularly Amazon Simple Storage Service (Amazon S3). You use S3DistCp by adding it as a step in a job flow. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3

Example Copy log files from Amazon S3 to HDFS

This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.

elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\--dest,hdfs:///output,\--srcPattern,.*daemons.*-hadoop-.*'


Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".

According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).