How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR? How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR? hadoop hadoop

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?


In the context of Amazon Elastic MapReduce (Amazon EMR), you are looking for Bootstrap Actions:

Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data. [emphasis mine]

Section Running Custom Bootstrap Actions from the CLI provides a generic usage example:

& ./elastic-mapreduce --create --stream --alive \--input s3n://elasticmapreduce/samples/wordcount/input \--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \--output s3n://myawsbucket --bootstrap-action s3://elasticmapreduce/bootstrap-actions/download.sh  

In particular, there are separate bootstrap actions to configure Hadoop and Java:

Hadoop (cluster)

You can specify Hadoop settings via bootstrap action Configure Hadoop, which allows you to set cluster-wide Hadoop settings, for example:

$ ./elastic-mapreduce --create \--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \--args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.task.timeout=0"     

Java (JVM)

You can specify custom JVM settings via bootstrap action Configure Daemons:

This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.

The provided example sets the heap size to 2048 and configures the Java namenode option:

$ ./elastic-mapreduce –create –alive \  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \  --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19   


I believe if you want to set these on a per-job basis, then you need to

A) for custom Jars, pass them into your jar as arguments, and process them yourself. I believe this can be automated as follows:

public static void main(String[] args) throws Exception {  Configuration conf = new Configuration();  args = new GenericOptionsParser(conf, args).getRemainingArgs();  //....}

Then create the job in this manner (haven't verified if works though):

 > elastic-mapreduce --jar s3://mybucket/mycode.jar \    --args "-D,mapred.reduce.tasks=0"    --arg s3://mybucket/input \    --arg s3://mybucket/output

The GenericOptionsParser should automatically transfer the -D and -jobconf parameters into Hadoop's job setup. More details: http://hadoop.apache.org/docs/r0.20.0/api/org/apache/hadoop/util/GenericOptionsParser.html

B) for the hadoop streaming jar, you also just pass the configuration change to the command

> elastic-mapreduce --jobflow j-ABABABABA \   --stream --jobconf mapred.task.timeout=600000 \   --mapper s3://mybucket/mymapper.sh \   --reducer s3://mybucket/myreducer.sh \   --input s3://mybucket/input \   --output s3://mybucket/output \   --jobconf mapred.reduce.tasks=0

More details: https://forums.aws.amazon.com/thread.jspa?threadID=43872 and elastic-mapreduce --help