Hadoop reduce shuffle merge in memory

performance hadoop shuffle reduce

Well to part one of your question. Its not enough to give hadoop simply more memory and expect thats its using it and things get faster automatically (would be very nice though!).But you need to adapt the configuration properties to make use of your memory. I.e. io.sort.mb is one setting which can help to speed up the merge/shuffle phase.

http://hadoop.apache.org/docs/r0.20.2/mapred-default.html is a list of most of the configuration properties.http://www.slideshare.net/cloudera/mr-perf gives some explicit advice on speeding up the merge (Slide 15).

Enabling compression of intermediate output (mapred.compress.map.output) usually speed up things as well.

HTHJohannes

performance hadoop shuffle reduce

This is not actually an answer on all questions, but I can explain, why so much data transferred.

CoGroup mark each key with tag of the originating input. So, if you data consists only of 2 keys, it easy to see, that data can be easily doubled in size (small key + tag of similar size). Thats gives you 17GB of data.

Next, 353 mappers each process 17mb (very small, do you have many small input files?, by default each mapper should recieve blocksize data (mapr don't expose it's size in job.xml, so don't know how big you block is), but in case of 64GB you should process this data with much smaller amount of mappers (~100).

I don't actually know how Mapr Direct Shuffle (tm) works (right now investigating), but looks like mappers outputs was written/exposed to maprfs. So, shuffle/sort phase in reducer download those parts directly from maprfs. Thats gives us 17GB (you can sum all sizes in you reducer log). But from where additional 6GB appears - don't know. May be this question can be sent to mapr support.

CodeHunter

Hadoop reduce shuffle merge in memory

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last