How to ensure that MapReduce tasks are independent of each other?

hadoop parallel-processing mapreduce batch-processing hadoop2

If the data IS related it is your job to ensure that the information is passed along. MapReduce breaks up the data and processes it regardless of any (not implemented) relations:

Map just reads data in blocks from the input files and passes them to the map-function one "record" at a time. Default-record is a line (but can be modified).

You can annotate the data in Map with its origin but what you can basically do with Map is: categorize the data. You emit a new key and new values and MapReduce groups by the new key. So if there are relations between different records: choose the same (or similiar *1) key for emitting them, so they are grouped together.

For Reduce the data is partitioned/sorted (that is where the grouping takes places) and afterwards the reduce-function receives all data from one group: one key and all its associated values. Now you can aggregate over the values. That's it.

So you have an over-all group-by implemented by MapReduce. Everything else is your responsibility. You want a cross product from two sources? Implement it for example by introducing artifical keys and multi-emitting (fragment and replicate join). Your imagination is the limit. And: you can always pass the data through another job.

*1: similiar, because you can influence the choice of grouping later on. normally it is group be identity-function, but you can change this.

CodeHunter

How to ensure that MapReduce tasks are independent of each other?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last