hadoop hive using row_number()
Given:
create table dupes (id string,democode string,extract_timestamp string);
And:
insert into dupes (id, democode,extract_timestamp) values ('1','code','2020'),('2','code2','2020'),('2','code22','2021'),('3','code3','2020'),('3','code33','2021'),('3','code333','2012');
When:
SELECT id,democode,extract_timestampFROM ( SELECT id,democode,extract_timestamp, ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC) AS row_num FROM dupes ) t1WHERE row_num = 1;
Then:
+-----+-----------+--------------------+--+| id | democode | extract_timestamp |+-----+-----------+--------------------+--+| 1 | code | 2020 || 2 | code22 | 2021 || 3 | code33 | 2021 |+-----+-----------+--------------------+--+
Note that often tables are partitioned and that we might want to deduplicate within each partition. In which case we would add the partition key(s) into the OVER
statement. For example if the table was partition by report_date DATE
then we might use:
ROW_NUMBER() OVER (PARTITION BY id, report_date ORDER BY extract_timestamp DESC) AS row_num