DELETE records which do not have a match in another table DELETE records which do not have a match in another table postgresql postgresql

DELETE records which do not have a match in another table


I benchmarked four typical queries, with different settings for {work_mem, effective_cache_size, random_page_cost}, these settings have the largest influence on the selected plan. I first did a "run in" with my default settings to warm the cache.Note: the test-set is small enough to allow all needed pages to be present in cache.

The test-set

SET search_path=tmp;/************************/DROP SCHEMA tmp CASCADE;CREATE SCHEMA tmp ;SET search_path=tmp;CREATE TABLE one        ( id SERIAL NOT NULL PRIMARY KEY        , payload varchar        );CREATE TABLE two        ( id SERIAL NOT NULL PRIMARY KEY        , one_id INTEGER REFERENCES one        , payload varchar        );INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;UPDATE two tSET one_id = o.idFROM one oWHERE o.id = t.idAND random() < 0.1;INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;VACUUM ANALYZE one;VACUUM ANALYZE two;/***************/

The queries:

\echo NOT EXISTS()EXPLAIN ANALYZEDELETE FROM one oWHERE NOT EXISTS ( SELECT * FROM two t        WHERE t.one_id = o.id        );\echo NOT IN()EXPLAIN ANALYZE DELETE FROM one oWHERE o.id NOT IN ( SELECT one_id FROM two t)        ;\echo USING (subquery self LEFT JOIN two where NULL)EXPLAIN ANALYZEDELETE FROM one oUSING (        SELECT o2.id        FROM one o2        LEFT JOIN two t ON t.one_id = o2.id        WHERE t.one_id IS NULL        ) sqWHERE sq.id = o.id        ;\echo USING (subquery self WHERE NOT EXISTS(two)))EXPLAIN ANALYZEDELETE FROM one oUSING (        SELECT o2.id        FROM one o2        WHERE NOT EXISTS ( SELECT *                FROM two t WHERE t.one_id = o2.id                )        ) sqWHERE sq.id = o.id        ;

The result (summarised)

                        NOT EXISTS()    NOT IN()        USING(LEFT JOIN NULL)   USING(NOT EXISTS)1) rpc=4.0.csz=1M wmm=64        80.358  14389.026       77.620                  72.9172) rpc=4.0.csz=1M wmm=64000     60.527  69.104          51.851                  51.0043) rpc=1.5.csz=1M wmm=64        69.804  10758.480       80.402                  77.3564) rpc=1.5.csz=1M wmm=64000     50.872  69.366          50.763                  53.3395) rpc=4.0.csz=1G wmm=64        84.117  7625.792        69.790                  69.6276) rpc=4.0.csz=1G wmm=64000     49.964  67.018          49.968                  49.3807) rpc=1.5.csz=1G wmm=64        68.567  3650.008        70.283                  69.9338) rpc=1.5.csz=1G wmm=64000     49.800  67.298          50.116                  50.345legend: rpc := "random_page_cost"csz := "effective_cache_size"wmm := "work_mem"

As you can see, the NOT IN() variant is very sensitive to shortage of work_mem. Agreed, the setting 64(KB) is very low, but this `more or less* corresponds to large data sets, which won't fit in hashtables, either.

EXTRA: during the warm-in phase, the NOT EXISTS() query suffered from extreme FK-trigger contention. This apears to be a result of a conflict with the vacuum deamon, which is still active after the table set-up.:

PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bitNOT EXISTS()                                                           QUERY PLAN-------------------------------------------------------------------------------------------------------------------------------- Delete on one o  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)   ->  Hash Anti Join  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)         Hash Cond: (o.id = t.one_id)         ->  Seq Scan on one o  (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)         ->  Hash  (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)               Buckets: 32768  Batches: 1  Memory Usage: 1015kB               ->  Seq Scan on two t  (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1) Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050 Total runtime: 467824.652 ms(9 rows)


First off: your text says:

I would like to delete those orphan records from item_tbl.

But your code says:

DELETE FROM link_tbl lnk ...

Update:On rereading the Q I find it more likely you want to delete orphaned rows in link_tbl. The row-counts point in that direction. @Lucas) query would be correct in this case. But I am afraid, NOT EXISTS is actually slower than NOT IN in this case.

To verify I ran a test case, that's remotely like your setup. Couldn't make it much bigger, or SQLfiddle would run into a timeout.

-> SQLfiddle.

NOT EXISTS would be faster for the reversed case. (I tested that, too.) EXISTS is better suited for testing the "many"-side. And generally, there is more to gain with EXISTS than with NOT EXISTS - that form has to check the whole table anyway. It's much harder to prove something does not exist than to prove that something exists. This universal truth also applies to databases.

Divide and conquer

This operation is suited to be split up. Especially if you have concurrent transactions (but even without) I would consider splitting the DELETE into several slices, so that the transaction can COMMIT after a decent amount of time.

Something like:

DELETE FROM link_tbl lWHERE  l.item_id < 1000000AND    l.item_id NOT IN (SELECT i.id FROM item_tbl i)

Then l.item_id BETWEEN 100001 AND 200000, etc.

You cannot automate this with a function. That would wrap everything into a transaction and defy the purpose. So you'd have to script it from any client.
Or you could use ..

dblink

This additional module lets you run separate transactions in any database including the one it's running in. And that can be done via persistent connection, which should remove most of the connection overhead.For instructions how to install it:
How to use (install) dblink in PostgreSQL?

DO would do the job (PostgreSQL 9.0 or later). Running 100 DELETE commands for 50000 item_id at a time:

DO$$DECLARE   _sql text;BEGINPERFORM dblink_connect('port=5432 dbname=mydb');  -- your connection parametersFOR i IN 0 .. 100LOOP   _sql := format('   DELETE FROM link_tbl l   WHERE  l.item_id BETWEEN %s AND %s   AND    l.item_id NOT IN (SELECT i.id FROM item_tbl i)'   , (50000 * i)::text   , (50000 * (i+1))::text);   PERFORM  dblink_exec(_sql);END LOOP;PERFORM dblink_disconnect();END$$

If the script should get interrupted: dblink_connect writes to the DB log what it executed, so you see what's done already.


Perhaps this:

DELETE FROM link_tbl lnkWHERE NOT EXISTS  ( SELECT 1 FROM item_tbl item WHERE item.id = lnk.item_id );

When dealing with large numbers of records, it can be much more efficient to create a temp table, perform INSERT INTO SELECT * FROM ... then drop the original table, rename the temp table, then add your indexes back...