Join of two datasets in Mapreduce/Hadoop

hadoop join mapreduce distributed

Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

So the map output will be something like:

 tile0, _point0 tile1, _point0 tile2, _point1  ... tileX, *lineL tileY, *lineK ...

Then, at the reducer, your input will have this structure:

 tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]

and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:

tileX (lineK, pointP)tileX (lineK, pointR)...

If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:

 List<String> points; List<String> lines;

Then do the cross-product using 2 nested for loops.Then iterate through the resulting list and for each element output:

tile(current key), element_of_the_resulting_cross_product_list

hadoop join mapreduce distributed

So basically you have two options here.Reduce side join or Map Side Join .

Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient.

If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.

CodeHunter

Join of two datasets in Mapreduce/Hadoop

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last