How to use "typedbytes" or "rawbytes" in Hadoop Streaming? How to use "typedbytes" or "rawbytes" in Hadoop Streaming? hadoop hadoop

How to use "typedbytes" or "rawbytes" in Hadoop Streaming?


Okay, I've found a combination that works, but it's weird.

  1. Prepare a valid typedbytes file in your local filesystem, following the documentation or by imitating typedbytes.py.

  2. Use

    hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb

    to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.

  3. Use

    hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ...

    to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that -io typedbytes or -D stream.map.input=typedbytes should not be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:

    1. 32-bit signed integer, representing length (note: no type character)
    2. block of raw binary with that length: this is the key
    3. '\t' (tab character... why?)
    4. 32-bit signed integer, representing length
    5. block of raw binary with that length: this is the value
    6. '\n' (newline character... ?)
  4. If you want to additionally send raw data from mapper to reducer, add

    -D stream.map.output=typedbytes -D stream.reduce.input=typedbytes

    to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.

The only documentation on stream.map.output and stream.reduce.input that I could find was in the HADOOP-1722 exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)

This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the -inputformat. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.


We solved the binary data issue using hexaencoding the data at split level when streaming down data to the Mapper. This would utilize and increase the Parallel efficiency of your operation instead of first tranforming your data before processing on a node.


Apparently there is a patch for a JustBytes IO mode for streaming, that feeds a whole input file to the mapper command:

https://issues.apache.org/jira/browse/MAPREDUCE-5018