How do I write the output of an EMR streaming job to HDFS?
I am not sure how it can be done using mrJob, but with hadoop and streaming jobs written in java, we do it as follows:
- Launch the cluster
- Get the data from s3 using s3distcp to HDFS of the cluster
- Execute the step 1 of our job with input as HDFS
- Execute the step 2 or our job with the same input as above...
Using the EMR CLI, we do it as follows:
> export jobflow=$(elastic-mapreduce --create --alive --plain-output> --master-instance-type m1.small --slave-instance-type m1.xlarge --num-instances 21 --name "Custer Name" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args> "--mapred-config-file,s3://myBucket/conf/custom-mapred-config-file.xml")> > > elastic-mapreduce -j $jobflow --jar> s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar> --arg --src --arg 's3://myBucket/input/' --arg --dest --arg 'hdfs:///input'> > elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step1.jar> --arg hdfs:///input --arg hdfs:///output-step1 --step-name "Step 1"> > elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step2.jar> --arg hdfs:///input,hdfs:///output-step1 --arg s3://myBucket/output/ --step-name "Step 2"
Saving the output of a MRJob EMR job is currently not possible. There is currently an open freature request for this at https://github.com/Yelp/mrjob/issues/887 .