Handling large output values from reduce step in Hadoop Handling large output values from reduce step in Hadoop hadoop hadoop

Handling large output values from reduce step in Hadoop


Not that i understand why you would want to have a huge value, but there is a way you can do this.

If you write your own OutputFormat, you can fix the behaviour of the RecordWriter.write(Key, Value) method to handle value concatenation based upon whether the Key value is null or not.

This way, in your reducer, you can write your code as follows (the first output for the key is the actual key, and everything after that is null key:

public void reduce(Text key, Iterator<Text> values,                OutputCollector<Text, Text> output, Reporter reporter) {  boolean firstKey = true;  for (Text value : values) {    output.collect(firstKey ? key : null, value);    firstKey = false;  }}

The actual RecordWriter.write() then has the following logic to handle the null key / value concatenation logic:

    public synchronized void write(K key, V value) throws IOException {        boolean nullKey = key == null || key instanceof NullWritable;        boolean nullValue = value == null || value instanceof NullWritable;        if (nullKey && nullValue) {            return;        }        if (!nullKey) {            // if we've written data before, append a new line            if (dataWritten) {                out.write(newline);            }            // write out the key and separator            writeObject(key);            out.write(keyValueSeparator);        } else if (!nullValue) {            // write out the value delimiter            out.write(valueDelimiter);        }        // write out the value        writeObject(value);        // track that we've written some data        dataWritten = true;    }    public synchronized void close(Reporter reporter) throws IOException {        // if we've written out any data, append a closing newline        if (dataWritten) {            out.write(newline);        }        out.close();    }

You'll notice the close method has also been amended to write a trailing newline to the last record written out

Full code listing can be found on pastebin, and here's the test output:

key1    value1key2    value1,value2,value3key3    value1,value2


If single output key-value can be bigger then memory it means that standard output mechanism is not suited - since, by inerface design it require passing of key-value pair and not a stream.
I think simplest solution would be to stream output right to the HDFS file.
If you have reasons to pass data via output format - I would suggest the following solution:a) To write to the local temporary dir
b) To pass the name of the file as a value for the output format.

Probabbly most effective but a bit complicated solution would be usage of the memory mapped file as a buffer. It will be in memory as long as there is enough memory, and, when needed OS will care about efficient spill to the disk.