Hadoop Streaming: Mapper 'wrapping' a binary executable Hadoop Streaming: Mapper 'wrapping' a binary executable hadoop hadoop

Hadoop Streaming: Mapper 'wrapping' a binary executable


After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.

$ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py

Then you need to format you streaming command like the following template:

$ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \-file /local/file/system/data/data.txt \-file /local/file/system/mapper.py \-file /local/file/system/reducer.py \-cacheFile hdfs://localhost:9000/user/you/module.py#module.py \-input data.txt \-output output/ \-mapper mapper.py \-reducer reducer.py \-verbose

If you're linking a python module you'll need to add the following code to your mapper/reducer scripts:

import sys sys.path.append('.')import module

If you're accessing a binary via subprocessing your command should look something like this:

cli = "./binary %s" % (argument)cli_parts = shlex.split(cli)mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)mp.communicate()[0]

Hope this helps.


Got it running finally

$pid = open2 (my $out, my $in, "./binary") or die "could not run open2";