Sorting in MapReduce Hadoop Sorting in MapReduce Hadoop hadoop hadoop

Sorting in MapReduce Hadoop


1.Assume if 100 mappers were executed and zero reducer. Will it generate 100 files?

Yes.

All individual are sorted?

No. If no reducers are used, then the output of mappers are not sorted. Sorting only takes place when there is a reduce phase.

Across all mapper output are sorted?

No, for the same reason, as above.

2.Input for reducer is Key -> Values. For each key, all values are sorted?

No. However, the keys are sorted. After the shuffling phase, in which the reducer gets the output of the mappers, it merge-sorts the sorted output keys of the mappers (since there IS a reduce phase) and when it starts reducing, the keys are sorted.

3.Assume if 50 reducers were executed. Will it generate 50 files?

Yes. (unless you use MultipleOutputs)

All individual files are sorted?

No. The sorted input does not guarantee a sorted output. The output depends on the algorithm that you use in the reduce method.

Across all reducer output are sorted?

No, for the same reason as above. However, if you use an Identity Reducer, i.e., you just write the input of the reducer as you get it, the reducer's output will be sorted PER REDUCER, not globally.

Is there any place where guaranteed sorting happens in MapReduce?

Sorting takes place when there is a reduce phase and it is applied in the output keys of each mapper and the input keys of each reducer. If you want to globally sort the input of the reducer, you can either use a single reducer, or a TotalOrderPartitioner, which is a bit tricky...