using a reducer slows the mapper

hadoop io mapreduce

When you set the number of reducers to 0, you are doing a map only job. This means that the data won't be sorted nor shuffled and the output of the mappers will be written directly to disk. However, if you use reducers, then you have two cases: when you only need to sort the data, and when you also need to perform some aggregation or some operations with the data.

If you only need to sort the data, you can go with the identity reducer, which will sort the data, perform the shuffle, feed it to the reducers and then writing it to disk. In the second case, the reducers take extra time to perform the operations you wish to do, wether it's aggregation or any other thing.

So yes, there is a big difference in time when doing a map only job, and when also writing a reduce phase. Consider the following picture, all the steps you don't have to go through if after the map you write it directly to disk:

map reduce phases

EDIT: when adding a reduce phase, you see that the mappers reach 100% but don't appear as completed because there is some presorting being done during the map phase for efficiency reasons, also making some buffering writes in memory. Therefore, when you wrote your job as map only, this was not done and it completed much faster. However, now that you also use a reducer, once it reaches 100% of the mapper, it starts with the presorting and buffering in memory, and it does not appear as "Completed" until this is done.

map side

Hope it is more clear now!

CodeHunter

using a reducer slows the mapper

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last