How to use "typedbytes" or "rawbytes" in Hadoop Streaming?

Okay, I've found a combination that works, but it's weird.

Prepare a valid typedbytes file in your local filesystem, following the documentation or by imitating typedbytes.py.

Use

hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb

to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.

Use
```
hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ...
```
to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that -io typedbytes or -D stream.map.input=typedbytes should not be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:
1. 32-bit signed integer, representing length (note: no type character)
2. block of raw binary with that length: this is the key
3. '\t' (tab character... why?)
4. 32-bit signed integer, representing length
5. block of raw binary with that length: this is the value
6. '\n' (newline character... ?)
If you want to additionally send raw data from mapper to reducer, add
```
-D stream.map.output=typedbytes -D stream.reduce.input=typedbytes
```
to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.

The only documentation on stream.map.output and stream.reduce.input that I could find was in the HADOOP-1722 exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)

This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the -inputformat. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.

hadoop binary streaming

We solved the binary data issue using hexaencoding the data at split level when streaming down data to the Mapper. This would utilize and increase the Parallel efficiency of your operation instead of first tranforming your data before processing on a node.

hadoop binary streaming

Apparently there is a patch for a JustBytes IO mode for streaming, that feeds a whole input file to the mapper command:

https://issues.apache.org/jira/browse/MAPREDUCE-5018

CodeHunter

How to use "typedbytes" or "rawbytes" in Hadoop Streaming?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last