Is there any non-commutative reducer in mapreduce that can be executed in parallel? Is there any non-commutative reducer in mapreduce that can be executed in parallel? hadoop hadoop

Is there any non-commutative reducer in mapreduce that can be executed in parallel?


I don't know if "commutative" is the right word to use here, but I understand what you are saying.

In hadoop, the post-mapping phase is actually divided into two steps: a Combiner and a Reducer, with the same signature. The Combiner runs on mappers to reduce the size of the output before it gets key-sorted and sent to reducers. If you just specify a Reducer, then it will be used for both; but you can separate them to do surprisingly more than you think.

The simple case of doing a counting operation uses a counting reducer, which can be used for both the combine step and the reduce step. This reduces the need to have the same key sent over the wire multiple times.

You can achieve similar efficiency for computing the mean by defining different combiners and reducers. For example, mappers output a value (number, 1) corresponding to a numerical value and a count of 1. The combiner can map a collection of values to either a (sum, count) tuple or a (mean, count) tuple, and the reducer can aggregate these using the counted weights to produce an average. (As an aside: you can greatly reduce error in adding a lot of numbers using Kahan summation). This allows mappers to do some of the combining just as in a simple counting example.

You can do a lot of clever things in a single map-reduce step. However, I don't think this is possible for the median; in that case you will actually have to send all the numbers through the state of one machine.