Handling large output values from reduce step in Hadoop
Not that i understand why you would want to have a huge value, but there is a way you can do this.
If you write your own OutputFormat, you can fix the behaviour of the RecordWriter.write(Key, Value)
method to handle value concatenation based upon whether the Key value is null or not.
This way, in your reducer, you can write your code as follows (the first output for the key is the actual key, and everything after that is null key:
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) { boolean firstKey = true; for (Text value : values) { output.collect(firstKey ? key : null, value); firstKey = false; }}
The actual RecordWriter.write()
then has the following logic to handle the null key / value concatenation logic:
public synchronized void write(K key, V value) throws IOException { boolean nullKey = key == null || key instanceof NullWritable; boolean nullValue = value == null || value instanceof NullWritable; if (nullKey && nullValue) { return; } if (!nullKey) { // if we've written data before, append a new line if (dataWritten) { out.write(newline); } // write out the key and separator writeObject(key); out.write(keyValueSeparator); } else if (!nullValue) { // write out the value delimiter out.write(valueDelimiter); } // write out the value writeObject(value); // track that we've written some data dataWritten = true; } public synchronized void close(Reporter reporter) throws IOException { // if we've written out any data, append a closing newline if (dataWritten) { out.write(newline); } out.close(); }
You'll notice the close method has also been amended to write a trailing newline to the last record written out
Full code listing can be found on pastebin, and here's the test output:
key1 value1key2 value1,value2,value3key3 value1,value2
If single output key-value can be bigger then memory it means that standard output mechanism is not suited - since, by inerface design it require passing of key-value pair and not a stream.
I think simplest solution would be to stream output right to the HDFS file.
If you have reasons to pass data via output format - I would suggest the following solution:a) To write to the local temporary dir
b) To pass the name of the file as a value for the output format.
Probabbly most effective but a bit complicated solution would be usage of the memory mapped file as a buffer. It will be in memory as long as there is enough memory, and, when needed OS will care about efficient spill to the disk.