Hadoop reduce shuffle merge in memory Hadoop reduce shuffle merge in memory hadoop hadoop

Hadoop reduce shuffle merge in memory


Well to part one of your question. Its not enough to give hadoop simply more memory and expect thats its using it and things get faster automatically (would be very nice though!).But you need to adapt the configuration properties to make use of your memory. I.e. io.sort.mb is one setting which can help to speed up the merge/shuffle phase.

http://hadoop.apache.org/docs/r0.20.2/mapred-default.html is a list of most of the configuration properties.http://www.slideshare.net/cloudera/mr-perf gives some explicit advice on speeding up the merge (Slide 15).

Enabling compression of intermediate output (mapred.compress.map.output) usually speed up things as well.

HTHJohannes


This is not actually an answer on all questions, but I can explain, why so much data transferred.

CoGroup mark each key with tag of the originating input. So, if you data consists only of 2 keys, it easy to see, that data can be easily doubled in size (small key + tag of similar size). Thats gives you 17GB of data.

Next, 353 mappers each process 17mb (very small, do you have many small input files?, by default each mapper should recieve blocksize data (mapr don't expose it's size in job.xml, so don't know how big you block is), but in case of 64GB you should process this data with much smaller amount of mappers (~100).

I don't actually know how Mapr Direct Shuffle (tm) works (right now investigating), but looks like mappers outputs was written/exposed to maprfs. So, shuffle/sort phase in reducer download those parts directly from maprfs. Thats gives us 17GB (you can sum all sizes in you reducer log). But from where additional 6GB appears - don't know. May be this question can be sent to mapr support.