How do I write the output of an EMR streaming job to HDFS?

I am not sure how it can be done using mrJob, but with hadoop and streaming jobs written in java, we do it as follows:

Launch the cluster
Get the data from s3 using s3distcp to HDFS of the cluster
Execute the step 1 of our job with input as HDFS
Execute the step 2 or our job with the same input as above...

Using the EMR CLI, we do it as follows:

> export jobflow=$(elastic-mapreduce --create --alive --plain-output> --master-instance-type m1.small --slave-instance-type m1.xlarge --num-instances 21 --name "Custer Name" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args> "--mapred-config-file,s3://myBucket/conf/custom-mapred-config-file.xml")> > > elastic-mapreduce -j $jobflow --jar> s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar> --arg --src --arg 's3://myBucket/input/' --arg --dest --arg 'hdfs:///input'> > elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step1.jar> --arg hdfs:///input --arg hdfs:///output-step1 --step-name "Step 1"> > elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step2.jar> --arg hdfs:///input,hdfs:///output-step1 --arg s3://myBucket/output/ --step-name "Step 2"

python hadoop emr mrjob

It must be an S3 bucket because EMR cluster would not persist normally after the job is done. So, the only way to persist the output is outside the cluster, and the next closest place is S3.

python hadoop emr mrjob

Saving the output of a MRJob EMR job is currently not possible. There is currently an open freature request for this at https://github.com/Yelp/mrjob/issues/887 .

CodeHunter

How do I write the output of an EMR streaming job to HDFS?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last