How do I pass a parameter to a python Hadoop streaming job?
The argument to the command line option -reducer
can be any command, so you can try:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input inputDirs \ -output outputDir \ -mapper myMapper.py \ -reducer 'myReducer.py 1 2 3' \ -file myMapper.py \ -file myReducer.py
assuming myReducer.py
is made executable. Disclaimer: I have not tried it, but I have passed similar complex strings to -mapper
and -reducer
before.
That said, have you tried the
-cmdenv name=value
option, and just have your Python reducer get its value from the environment? It's just another way to do things.
In your Python code,
import os(...)os.environ["PARAM_OPT"]
In your Hapdoop command include:
hadoop jar \(...)-cmdenv PARAM_OPT=value\(...)
You can -reducer
as the below command
hadoop jar hadoop-streaming.jar \-mapper 'count_mapper.py arg1 arg2' -file count_mapper.py \-reducer 'count_reducer.py arg3' -file count_reducer.py \
you can revise this Link