Join of two datasets in Mapreduce/Hadoop Join of two datasets in Mapreduce/Hadoop hadoop hadoop

Join of two datasets in Mapreduce/Hadoop


Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

So the map output will be something like:

 tile0, _point0 tile1, _point0 tile2, _point1  ... tileX, *lineL tileY, *lineK ...

Then, at the reducer, your input will have this structure:

 tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]

and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:

tileX (lineK, pointP)tileX (lineK, pointR)...

If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:

 List<String> points; List<String> lines;

Then do the cross-product using 2 nested for loops.Then iterate through the resulting list and for each element output:

tile(current key), element_of_the_resulting_cross_product_list


So basically you have two options here.Reduce side join or Map Side Join .

Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient.

If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.