How does mapreduce sort and shuffle work? How does mapreduce sort and shuffle work? hadoop hadoop

How does mapreduce sort and shuffle work?


The only way to 'sort' the values is to use a composite key which contains some information from the value itself. Your key's compareTo method can then ensure that the keys are sorted first by the actual key component, then by the value component. Finally you'll need a group partitioner to ensure that in the reducer all the keys with the same 'key' component (the actual key) are considered equal, and the associated values iterated over in one call to the reduce method.

This is known as a 'secondary sort', a question similar to this one provides some links to examples.


The local MRjob just uses the operating system 'sort' on the mapper output.

The mapper writes out in the format:

key<-tab->value\n

Thus you end up with the keys sorted primarily by key, but secondarily by value.

As noted, this doesn't happen in the real hadoop version, just the 'local' simulation.


The sort & shuffle phase doesn't gaurantee on the order of values that the reducer gets for a given key.