Copy operations in shuffle and sort phase of MapReduce Copy operations in shuffle and sort phase of MapReduce hadoop hadoop

Copy operations in shuffle and sort phase of MapReduce


Suppose you have 3 mappers and 1 reducer. Each mapper task outputs 1 file (sorted by key) that is written to the local filesystem of where the map function ran from. So, we will have 3 such output files spread around the cluster.

Since reducers do not take advantage of data locality optimisation, and since we have only 1 reducer - it will need to copy the 3 different output files that each mapper task produced across the network.

Hence, there are m x n = 3 x 1 = 3 copy operations involved in this scenario.