How can I remove duplicate rows?

sql-server tsql duplicates

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:

DELETE FROM MyTableLEFT OUTER JOIN (   SELECT MIN(RowId) as RowId, Col1, Col2, Col3    FROM MyTable    GROUP BY Col1, Col2, Col3) as KeepRows ON   MyTable.RowId = KeepRows.RowIdWHERE   KeepRows.RowId IS NULL

In case you have a GUID instead of an integer, you can replace

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

sql-server tsql duplicates

Another possible way of doing this is

; --Ensure that any immediately preceding statement is terminated with a semicolon aboveWITH cte     AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3                                        ORDER BY ( SELECT 0)) RN         FROM   #MyTable)DELETE FROM cteWHERE  RN > 1;

I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.

To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC

Execution Plans

The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.

This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.

The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.

Factors which might favour the hash aggregate approach would be

No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group

In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.

sql-server tsql duplicates

There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.

I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:

DELETE dupesFROM MyTable dupes, MyTable fullTableWHERE dupes.dupField = fullTable.dupField AND dupes.secondDupField = fullTable.secondDupField AND dupes.uniqueField > fullTable.uniqueField

CodeHunter

How can I remove duplicate rows?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last