How to remove duplicate columns after a JOIN in Pig? How to remove duplicate columns after a JOIN in Pig? hadoop hadoop

How to remove duplicate columns after a JOIN in Pig?


Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.

So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.

Pre-Requisite:

Name of all the fields should be present

You need to download this UDF file and make it jar so as to use it.

UDF file location from GitHub :GitHub UDF Java File Location

We will take the above question as example.

--Data Set A contains this data-- 1,5.3-- 2,4.9-- 3,4.9--Data Set B contains this data-- 1,Anju,3.6,IT,A,1.6,0.3-- 2,Remya,3.3,EEE,B,1.6,0.3-- 3,Akhila,3.3,IT,C,1.3,0.3

PIG Script:

REGISTER /home/user/DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;

We will get column name after joining as DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

For making it to DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

DSB::ROLLNO:int is removed.

We need to use the UDF as

JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));

Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.

This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.