mapreduce.reduce.shuffle.memory.limit.percent, mapreduce.reduce.shuffle.input.buffer.percent and mapreduce.reduce.shuffle.merge.percent mapreduce.reduce.shuffle.memory.limit.percent, mapreduce.reduce.shuffle.input.buffer.percent and mapreduce.reduce.shuffle.merge.percent hadoop hadoop

mapreduce.reduce.shuffle.memory.limit.percent, mapreduce.reduce.shuffle.input.buffer.percent and mapreduce.reduce.shuffle.merge.percent


I would like to share my understanding on these properties, hope it would help. Advise me if anything is wrong.

mapreduce.reduce.shuffle.input.buffer.percent tells about the percentage of the reducer's heap memory to be allocated for the circular buffer to store the intermediate outputs copied from multiple mappers.

mapreduce.reduce.shuffle.memory.limit.percent tells about the maximum percentage of the above memory buffer that a single shuffle (output copied from single Map task) should take. The shuffle's size above this size will not be copied to the memory buffer, instead they will be directly written to the disk of the reducer.

mapreduce.reduce.shuffle.merge.percent tells about the threshold percentage by where the in-memory merger thread will run to merge the available shuffle contents on the memory buffer into a single file and immediately spills the merged file into the disk.

It is obvious that in-memory merger thread should require at least 2 shuffle files to be present in memory buffer to initiate merge. So at any time the mapreduce.reduce.shuffle.merge.percent should be higher than any single shuffle file in memory controlled by mapreduce.reduce.shuffle.memory.limit.percent property, by which it mandates that there should be at least more than one shuffle file to be present in the buffer for the merge process.