Handling large output values from reduce step in Hadoop

java memory hadoop mapreduce stringbuilder

Not that i understand why you would want to have a huge value, but there is a way you can do this.

If you write your own OutputFormat, you can fix the behaviour of the RecordWriter.write(Key, Value) method to handle value concatenation based upon whether the Key value is null or not.

This way, in your reducer, you can write your code as follows (the first output for the key is the actual key, and everything after that is null key:

public void reduce(Text key, Iterator<Text> values,                OutputCollector<Text, Text> output, Reporter reporter) {  boolean firstKey = true;  for (Text value : values) {    output.collect(firstKey ? key : null, value);    firstKey = false;  }}

The actual RecordWriter.write() then has the following logic to handle the null key / value concatenation logic:

    public synchronized void write(K key, V value) throws IOException {        boolean nullKey = key == null || key instanceof NullWritable;        boolean nullValue = value == null || value instanceof NullWritable;        if (nullKey && nullValue) {            return;        }        if (!nullKey) {            // if we've written data before, append a new line            if (dataWritten) {                out.write(newline);            }            // write out the key and separator            writeObject(key);            out.write(keyValueSeparator);        } else if (!nullValue) {            // write out the value delimiter            out.write(valueDelimiter);        }        // write out the value        writeObject(value);        // track that we've written some data        dataWritten = true;    }    public synchronized void close(Reporter reporter) throws IOException {        // if we've written out any data, append a closing newline        if (dataWritten) {            out.write(newline);        }        out.close();    }

You'll notice the close method has also been amended to write a trailing newline to the last record written out

Full code listing can be found on pastebin, and here's the test output:

key1    value1key2    value1,value2,value3key3    value1,value2

java memory hadoop mapreduce stringbuilder

If single output key-value can be bigger then memory it means that standard output mechanism is not suited - since, by inerface design it require passing of key-value pair and not a stream.
I think simplest solution would be to stream output right to the HDFS file.
If you have reasons to pass data via output format - I would suggest the following solution:a) To write to the local temporary dir
b) To pass the name of the file as a value for the output format.

Probabbly most effective but a bit complicated solution would be usage of the memory mapped file as a buffer. It will be in memory as long as there is enough memory, and, when needed OS will care about efficient spill to the disk.

CodeHunter

Handling large output values from reduce step in Hadoop

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last