Hadoop MapReduce - How to improve parallelism Hadoop MapReduce - How to improve parallelism hadoop hadoop

Hadoop MapReduce - How to improve parallelism


To enhance the performance and paralellism of your job, I would suggest the following enhancements:

  • add a combiner for accumulating all delays and entries per mapper
  • use a compound key of year and airport to increase the number of reducers and save the time of writing a custom partitioner
  • now two options: 1)either use tez for a map -> reduce -> reduce or 2) write a file per airport and year and merge them afterwards with HDFS utils


Your approach is good and it will achieve good parallelism as more data comes in to the picture (more number of years).

Still if you want your approach to utilize maximum number of reduce slots for parallelism, you can make sure you create that many necessary number of partitions by doing necessary change in your partitioner. And you need to handle overhead of merging the files with same year in single sorted file.