How can I remove duplicate rows? How can I remove duplicate rows? sql-server sql-server

How can I remove duplicate rows?


Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:

DELETE FROM MyTableLEFT OUTER JOIN (   SELECT MIN(RowId) as RowId, Col1, Col2, Col3    FROM MyTable    GROUP BY Col1, Col2, Col3) as KeepRows ON   MyTable.RowId = KeepRows.RowIdWHERE   KeepRows.RowId IS NULL

In case you have a GUID instead of an integer, you can replace

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))


Another possible way of doing this is

; --Ensure that any immediately preceding statement is terminated with a semicolon aboveWITH cte     AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3                                        ORDER BY ( SELECT 0)) RN         FROM   #MyTable)DELETE FROM cteWHERE  RN > 1;

I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.

To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC

Execution Plans

The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.

Execution Plans

This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.

The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.

Execution Plans

Factors which might favour the hash aggregate approach would be

  • No useful index on the partitioning columns
  • relatively fewer groups with relatively more duplicates in each group

In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.


There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.

I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:

DELETE dupesFROM MyTable dupes, MyTable fullTableWHERE dupes.dupField = fullTable.dupField AND dupes.secondDupField = fullTable.secondDupField AND dupes.uniqueField > fullTable.uniqueField