Count the number of matching rows for high magnitude counts (100,000+) Count the number of matching rows for high magnitude counts (100,000+) postgresql postgresql

Count the number of matching rows for high magnitude counts (100,000+)


Pre-PostgreSQL 9.2 the implementation of MVCC required any query to visit each row of the table to check if that version of the row was visible to the current transaction. This would happen, even if the query only involved indexed columns. This manifests as slow counts on large tables, even for simple cases.

PostgreSQL 9.2 implements index only scans, which may help alleviate this issue for some workloads.

If you are stuck below v9.2, there are some known workarounds if you only need an approximate row count on a simple query. See http://wiki.postgresql.org/wiki/Count_estimate .


Keep a table of incidents aggregated by day.

create table incidents_agreggated_by_day (    "day" date primary key, total integer);

Everyday run:

insert into events_agreggated_by_day ("day", total) valuesselect date_trunc('day', time_occurred), count(*) totalfrom incidentswhere     time_occurred < current_date    and date_trunc('day', time_occurred) not in (        select "day" from incidents_agreggated_by_day    )group by 1

Suppose you want the total between '2013-01-01 10:37' and '2013-03-02 11:20':

select(    select sum(total)    from incidents_aggregated_by_day    where "day" >= '2013-01-02'::date and "day" < '2013-03-02'::date) +(    select count(*)    from incidents    where         time_ocurred >= '2013-01-01 10:37':timestamp        and time_ocurred < '2013-01-02'        or        time_ocurred <= '2013-03-02 11:20':timestamp        and time_ocurred >= '2013-01-02') total

In instead of reading 100 million rows you will read hundreds or thousands. If properly indexed it will be fast.


Another approach might be to partition the table. This guy seems to have solved a very similar problem with partitioning:

http://www.if-not-true-then-false.com/2009/performance-testing-between-partitioned-and-non-partitioned-postgresql-tables-part-3/

My concern with using his approach would be maintainability. In his example (you have to click through to part 1 of the tutorial to see how he created the partitions), he manually creates each child table and has hard-coded routing to the child tables in a trigger. If your table is constantly growing, you would be doing a lot of DBA work.

However, he does seem to get a big performance boost. So, if you can figure out how to make it more maintainable, this might be a good way to proceed.