Configure Map Side join for multiple mappers in Hadoop Map/Reduce

hadoop mapreduce inner-join

There are map- and reduce side joins. You proposed to use a map side join, which is executed inside a mapper and not before it.Both sides must have the same key and value types. So you can't join a LongWritable and a Text, although they might have the same value.

There are subtle more things to note:

input files have to be sorted, so it has likely to be a reducer output
You can control the number of mappers in your join-map-phase by setting the number of reducers in the job that should've sorted the datasets

The whole procedure basically works like this: You have dataset A and dataset B, both share the same key, let's say LongWritable.

Run two jobs that sort the two datasetsby their keys, both jobs HAVE TO set the number of reducers to an equal number, say 2.
this will result in 2 sorted files for each dataset
now you setup your job that joins the datasets, this job will spawn with 2 mappers. It could be more if you're setting the reduce numbers higher in the previous job.
do whatever you like in the reduce step.

If the number of the files to be joined is not equal it will result in an exception during job setup.

Setting up a join is kind of painful, mainly because you have to use the old API for mapper and reducer if your version is less than 0.21.x.

This document describes very well how it works. Scroll all the way to the bottom, sadly this documentation is somehow missing in the latest Hadoop docs.

Another good reference is "Hadoop the Definitive Guide", which explains all of this in more detail and with examples.

hadoop mapreduce inner-join

I think you're missing the point. You don't control the number of mappers. It's the number of reducers that you have control over. Simply emit the correct keys from your mapper. Then run 10 reducers.

CodeHunter

Configure Map Side join for multiple mappers in Hadoop Map/Reduce

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last