How many mappers/reducers should be set when configuring Hadoop cluster? How many mappers/reducers should be set when configuring Hadoop cluster? hadoop hadoop

How many mappers/reducers should be set when configuring Hadoop cluster?


There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.


Quoting from "Hadoop The Definite Guide, 3rd edition", page 306

Because MapReduce jobs are normally I/O-bound, it makes sense to have more tasks than processors to get better utilization.

The amount of oversubscription depends on the CPU utilization of jobs you run, but a good rule of thumb is to have a factor of between one and two more tasks (counting both map and reduce tasks) than processors.

A processor in the quote above is equivalent to one logical core.

But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.


Probably, you should also look at reducer lazy loading, which allows reducers to start later when required, so basically, number of maps slots can be increased. Don't have much idea on this though but, seems useful.