How do I write the output of an EMR streaming job to HDFS? How do I write the output of an EMR streaming job to HDFS? hadoop hadoop

How do I write the output of an EMR streaming job to HDFS?


I am not sure how it can be done using mrJob, but with hadoop and streaming jobs written in java, we do it as follows:

  1. Launch the cluster
  2. Get the data from s3 using s3distcp to HDFS of the cluster
  3. Execute the step 1 of our job with input as HDFS
  4. Execute the step 2 or our job with the same input as above...

Using the EMR CLI, we do it as follows:

> export jobflow=$(elastic-mapreduce --create --alive --plain-output> --master-instance-type m1.small --slave-instance-type m1.xlarge --num-instances 21 --name "Custer Name" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args> "--mapred-config-file,s3://myBucket/conf/custom-mapred-config-file.xml")> > > elastic-mapreduce -j $jobflow --jar> s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar> --arg --src --arg 's3://myBucket/input/' --arg --dest --arg 'hdfs:///input'> > elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step1.jar> --arg hdfs:///input --arg hdfs:///output-step1 --step-name "Step 1"> > elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step2.jar> --arg hdfs:///input,hdfs:///output-step1 --arg s3://myBucket/output/ --step-name "Step 2"


It must be an S3 bucket because EMR cluster would not persist normally after the job is done. So, the only way to persist the output is outside the cluster, and the next closest place is S3.


Saving the output of a MRJob EMR job is currently not possible. There is currently an open freature request for this at https://github.com/Yelp/mrjob/issues/887 .