Best way to delete millions of rows by ID

sql postgresql bigdata sql-delete postgresql-performance

It all depends ...

Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all.
Delete all indexes (possibly except the ones needed for the delete itself).
Recreate them afterwards. That's typically much faster than incremental updates to indexes.
Check if you have triggers that can safely be deleted / disabled temporarily.
Do foreign keys reference your table? Can they be deleted? Temporarily deleted?
Depending on your autovacuum settings it may help to run VACUUM ANALYZE before the operation.
Some of the points listed in the related chapter of the manual Populating a Database may also be of use, depending on your setup.
If you delete large portions of the table and the rest fits into RAM, the fastest and easiest way may be this:

BEGIN; -- typically faster and safer wrapped in a single transactionSET LOCAL temp_buffers = '1000MB'; -- enough to hold the temp tableCREATE TEMP TABLE tmp ASSELECT t.*FROM   tbl tLEFT   JOIN del_list d USING (id)WHERE  d.id IS NULL;      -- copy surviving rows into temporary tableTRUNCATE tbl;             -- empty table - truncate is very fast for big tablesINSERT INTO tblSELECT * FROM tmp;        -- insert back surviving rows.-- ORDER BY ?             -- optionally order favorably while being at itCOMMIT;

This way you don't have to recreate views, foreign keys or other depending objects. And you get a pristine (sorted) table without bloat.

Read about the temp_buffers setting in the manual. This method is fast as long as the table fits into memory, or at least most of it. The transaction wrapper defends against losing data if your server crashes in the middle of this operation.

Run VACUUM ANALYZE afterwards. Or VACUUM FULL ANALYZE if you want to bring it to minimum size (takes exclusive lock). For big tables consider the alternatives CLUSTER / pg_repack or similar:

Optimize Postgres timestamp query range

For small tables, a simple DELETE instead of TRUNCATE is often faster:

DELETE FROM tbl tUSING  del_list dWHERE  t.id = d.id;

Read the Notes section for TRUNCATE in the manual. In particular (as Pedro also pointed out in his comment):

TRUNCATE cannot be used on a table that has foreign-key referencesfrom other tables, unless all such tables are also truncated in thesame command. [...]

And:

TRUNCATE will not fire any ON DELETE triggers that might exist forthe tables.

sql postgresql bigdata sql-delete postgresql-performance

We know the update/delete performance of PostgreSQL is not as powerful as Oracle. When we need to delete millions or 10's of millions of rows, it's really difficult and takes a long time.

However, we can still do this in production dbs. The following is my idea:

First, we should create a log table with 2 columns - id & flag (id refers to the id you want to delete; flag can be Y or null, with Y signifying the record is successfully deleted).

Later, we create a function. We do the delete task every 10,000 rows. You can see more details on my blog. Though it's in Chinese, you can still can get the info you want from the SQL code there.

Make sure the id column of both tables are indexes, as it will run faster.

sql postgresql bigdata sql-delete postgresql-performance

I just hit this issue myself and for me the, by far, fastest method was using WITH Queries in combination with USING

Basically the WITH-query creates a temporary table with the primary keys to delete in the table you want to delete from.

WITH to_delete AS (   SELECT item_id FROM other_table WHERE condition_x = true)DELETE FROM table USING to_delete WHERE table.item_id = to_delete.item_id   AND NOT to_delete.item_id IS NULL;

Ofcourse the SELECT inside the WITH-query can be as complex as any other select with multiple joins etc. It just has to return one or more columns that are used to identify the items in the target table that need to be deleted.

NOTE: AND NOT to_delete.item_id IS NULL most likely is not necessary, but I didn't dare to try.

Other things to consider are

creating indexes on other tables referring to this one via foreign key. Which can reduce a delete taking hours to mere seconds in certain situations
deferring constraint checks: It's not clear how much, if any improvement this achieves, but according to this it can increase performance. Downside is, if you have a foreign key violation you will learn it only at the very last moment.
DANGEROUS but big possible boost: disable constaint checks and triggers during the delete

CodeHunter

Best way to delete millions of rows by ID

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last