Hadoop MapReduce - How to improve parallelism

To enhance the performance and paralellism of your job, I would suggest the following enhancements:

add a combiner for accumulating all delays and entries per mapper
use a compound key of year and airport to increase the number of reducers and save the time of writing a custom partitioner
now two options: 1)either use tez for a map -> reduce -> reduce or 2) write a file per airport and year and merge them afterwards with HDFS utils

Your approach is good and it will achieve good parallelism as more data comes in to the picture (more number of years).

Still if you want your approach to utilize maximum number of reduce slots for parallelism, you can make sure you create that many necessary number of partitions by doing necessary change in your partitioner. And you need to handle overhead of merging the files with same year in single sorted file.

CodeHunter

Hadoop MapReduce - How to improve parallelism

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last