Hadoop MapReduce: default number of mappers

hadoop mapreduce

Adding more to what Chris added above:

The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps.
The right level of parallelism for maps seems to be around 10-100 maps/node, although this can go upto 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
You can increased number of Map task by modifying JobConf's conf.setNumMapTasks(int num). Note: This could increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.

Finally controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size.

Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.

hadoop mapreduce

It depends on a number of factors:

Input format and particular configuration properties for the format
for file based input formats (TextInputFormat, SequenceFileInputFormat etc):
- Number of input files / paths
- are the files splittable (typically compressed files are not, SequenceFiles are an exception to this)
- block size of the files

There are probably more, but you hopefully get the idea

CodeHunter

Hadoop MapReduce: default number of mappers

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last