How to remove duplicate columns after a JOIN in Pig?

java hadoop join apache-pig

Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.

So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.

Pre-Requisite:

Name of all the fields should be present

You need to download this UDF file and make it jar so as to use it.

UDF file location from GitHub :GitHub UDF Java File Location

We will take the above question as example.

--Data Set A contains this data-- 1,5.3-- 2,4.9-- 3,4.9--Data Set B contains this data-- 1,Anju,3.6,IT,A,1.6,0.3-- 2,Remya,3.3,EEE,B,1.6,0.3-- 3,Akhila,3.3,IT,C,1.3,0.3

PIG Script:

REGISTER /home/user/DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;

We will get column name after joining as DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

For making it to DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

DSB::ROLLNO:int is removed.

We need to use the UDF as

JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));

Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.

This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

CodeHunter

How to remove duplicate columns after a JOIN in Pig?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last