Postgres not using index when index scan is much better option Postgres not using index when index scan is much better option sql sql

Postgres not using index when index scan is much better option


Index (Only) Scan --> Bitmap Index Scan --> Sequential Scan

For few rows it pays to run an index scan. If enough data pages are visible to all (= vacuumed enough, and not too much concurrent write load) and the index can provide all column values needed, then a faster index only scan is used. With more rows expected to be returned (higher percentage of the table and depending on data distribution, value frequencies and row width) it becomes more likely to find several rows on one data page. Then it pays to switch to a bitmap index scans. (Or to combine multiple distinct indexes.) Once a large percentage of data pages has to be visited anyway, it's cheaper to run a sequential scan, filter surplus rows and skip the overhead for indexes altogether.

Index usage becomes (much) cheaper and more likely when accessing data pages in random order is not (much) more expensive than accessing them in sequential order. That's the case when using SSD instead of spinning disks, or even more so the more is cached in RAM - and the respective configuration parameters random_page_cost and effective_cache_size are set accordingly.

In your case, Postgres switches to a sequential scan, expecting to find rows=263962, that's already 3 % of the whole table. (While only rows=47935 are actually found, see below.)

More in this related answer:

Beware of forcing query plans

You cannot force a certain planner method directly in Postgres, but you can make other methods seem extremely expensive for debugging purposes. See Planner Method Configuration in the manual.

SET enable_seqscan = off (like suggested in another answer) does that to sequential scans. But that's intended for debugging purposes in your session only. Do not use this as a general setting in production unless you know exactly what you are doing. It can force ridiculous query plans. The manual:

These configuration parameters provide a crude method of influencingthe query plans chosen by the query optimizer. If the default planchosen by the optimizer for a particular query is not optimal, atemporary solution is to use one of these configuration parameters to force the optimizer to choose a different plan. Better ways toimprove the quality of the plans chosen by the optimizer includeadjusting the planner cost constants (see Section 19.7.2),running ANALYZE manually, increasing the value of thedefault_statistics_target configuration parameter, andincreasing the amount of statistics collected for specific columnsusing ALTER TABLE SET STATISTICS.

That's already most of the advice you need.

In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_id than are actually found:

estimated rows=227007 vs. actual ... rows=40789
estimated rows=263962 vs. actual ... rows=47935

If you run this query often it will pay to have ANALYZE look at a bigger sample for more accurate statistics on the particular column. Your table is big (~ 10M rows), so make that:

ALTER TABLE email_activities ALTER COLUMN email_recipient_idSET STATISTICS 3000;  -- max 10000, default 100

Then ANALYZE email_activities;

Measure of last resort

In very rare cases you might resort to force an index with SET LOCAL enable_seqscan = off in a separate transaction or in a function with its own environment. Like:

CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int)  RETURNS bigint AS$func$   SELECT COUNT(DISTINCT a.email_recipient_id)   FROM   email_activities a   WHERE  a.email_recipient_id IN (      SELECT id      FROM   email_recipients      WHERE  email_campaign_id = $1      LIMIT  $2)       -- or consider query below$func$  LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off;

The setting only applies to the local scope of the function.

Warning: This is just a proof of concept. Even this much less radical manual intervention might bite you in the long run. Cardinalities, value frequencies, your schema, global Postgres settings, everything changes over time. You are going to upgrade to a new Postgres version. The query plan you force now, may become a very bad idea later.

And typically this is just a workaround for a problem with your setup. Better find and fix it.

Alternative query

Essential information is missing in the question, but this equivalent query is probably faster and more likely to use an index on (email_recipient_id) - increasingly so for a bigger LIMIT.

SELECT COUNT(*) AS ctFROM  (   SELECT id   FROM   email_recipients   WHERE  email_campaign_id = 1607   LIMIT  43000   ) rWHERE  EXISTS (   SELECT FROM email_activities   WHERE  email_recipient_id = r.id);


A sequential scan can be more efficient, even when an index exists. In this case, postgres seems to estimate things rather wrong.An ANALYZE <TABLE> on all related tables can help in such cases. If it doesnt, you can set the variable enable_seqscan to OFF, to force postgres to use an index whenever technically possible, at the expense, that sometimes an index-scan will be used when a sequential scan would perform better.