Why don't EMR instances have as many reducers as mappers? Why don't EMR instances have as many reducers as mappers? hadoop hadoop

Why don't EMR instances have as many reducers as mappers?


Mappers extract data from their input stream (the mapper's STDIN), and what they emit is much more compact. That outbound stream (the mapper's STDOUT) is also then sorted by the key. Therefore, the reducers have smaller, sorted data in their incoming.

That is pretty much the reason why the default configuration for any Hadoop MapReduce cluster, not just EMR, is to have more mappers than reducers, proportional to the number of cores available to the jobtracker.

You have the ability to control the number of mappers and reducers through the jobconf parameter. The configuration variables are mapred.map.tasks and mapred.reduce.tasks.