How to delete duplicate entries?

sql postgresql duplicate-removal unique-constraint sql-delete

Some of these approaches seem a little complicated, and I generally do this as:

Given table table, want to unique it on (field1, field2) keeping the row with the max field3:

DELETE FROM table USING table alias   WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND    table.max_field < alias.max_field

For example, I have a table, user_accounts, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).

DELETE FROM user_accounts USING user_accounts ua2  WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

Note - USING is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.

sql postgresql duplicate-removal unique-constraint sql-delete

For example you could:

CREATE TABLE tmp ...INSERT INTO tmp SELECT DISTINCT * FROM t;DROP TABLE t;ALTER TABLE tmp RENAME TO t;

sql postgresql duplicate-removal unique-constraint sql-delete

Instead of creating a new table, you can also re-insert unique rows into the same table after truncating it. Do it all in one transaction.

This approach is only useful where there are lots of rows to delete from all over the table. For just a few duplicates, use a plain DELETE.

You mentioned millions of rows. To make the operation fast you want to allocate enough temporary buffers for the session. The setting has to be adjusted before any temp buffer is used in your current session. Find out the size of your table:

SELECT pg_size_pretty(pg_relation_size('tbl'));

Set temp_buffers at least a bit above that.

SET temp_buffers = 200MB;   -- example valueBEGIN;CREATE TEMP TABLE t_tmp AS  -- retains temp for duration of sessionSELECT DISTINCT * FROM tbl  -- DISTINCT folds duplicatesORDER  BY id;               -- optionally "cluster" dataTRUNCATE tbl;INSERT INTO tblSELECT * FROM t_tmp;        -- retains order (implementation detail)COMMIT;

This method can be superior to creating a new table if depending objects exist. Views, indexes, foreign keys or other objects referencing the table. TRUNCATE makes you begin with a clean slate anyway (new file in the background) and is much faster than DELETE FROM tbl with big tables (DELETE can actually be faster with small tables).

For big tables, it is regularly faster to drop indexes and foreign keys (FK), refill the table and recreate these objects. As far as FK constraints are concerned you have to be certain the new data is valid, of course, or you'll run into exceptions on trying to create the FK.

Note that TRUNCATE requires more aggressive locking than DELETE. This may be an issue for tables with heavy, concurrent load. But it's still less disruptive than to drop and replace the table completely.

If TRUNCATE is not an option or generally for small to medium tables there is a similar technique with a data-modifying CTE (Postgres 9.1+):

WITH del AS (DELETE FROM tbl RETURNING *)INSERT INTO tblSELECT DISTINCT * FROM del;ORDER  BY id; -- optionally "cluster" data while being at it.

Slower for big tables, because TRUNCATE is faster there. But may be faster (and simpler!) for small tables.

If you have no depending objects at all, you might create a new table and delete the old one, but you hardly gain anything over this universal approach.

For very big tables that would not fit into available RAM, creating a new table will be considerably faster. You'll have to weigh this against possible troubles / overhead with depending objects.

CodeHunter

How to delete duplicate entries?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last