In-depth understanding of internal working of map phase in a Map reduce job in hadoop? In-depth understanding of internal working of map phase in a Map reduce job in hadoop? hadoop hadoop

In-depth understanding of internal working of map phase in a Map reduce job in hadoop?


Map-only jobs work differently than Map-and-Reduce jobs. It's not inconsistent, just different.

how will I be able to get the intermediate file that has both partitions and sorted data from the mapper output.

You can't. There isn't a hook to be able to get pieces of data from intermediate stages of MapReduce. Same is true for getting data after the partitioner, or after a record reader, etc.

It is interesting to note that running mapper alone does not produce sorted output contradicting the points that data send to reducer is not sorted. More details here

It does not contradict. Mappers sort because the reducer needs it sorted to be able to do a merge. If there are no reducers, it has no reason to to sort, so it doesn't. This is the right behavior because I don't want it sorted in a map only job which would make my processing slower. I've never had a situation where I wanted my map output to be locally sorted.

Even no combiner is applied if No only Mapper is run: More details here

Combiners are an optimization. There is no guarantee that they actually run or over what data. Combiners are mostly there to make the reducers more efficient. So, again, just like the local sorting, combiners do not run if there are no reducers because it has no reason to.

If you want combiner-like behavior, I suggest writing data into a buffer (hashmap perhaps) and then writing out locally-summarized data in the cleanup function that runs when a Mapper finishes. Be careful of memory usage if you want to do this. This is a better approach because combiners are specified as a good-to-have optimization and you can't count on them running... even when they do run.