Count the number of matching rows for high magnitude counts (100,000+)

sql postgresql count nosql counter

Pre-PostgreSQL 9.2 the implementation of MVCC required any query to visit each row of the table to check if that version of the row was visible to the current transaction. This would happen, even if the query only involved indexed columns. This manifests as slow counts on large tables, even for simple cases.

PostgreSQL 9.2 implements index only scans, which may help alleviate this issue for some workloads.

If you are stuck below v9.2, there are some known workarounds if you only need an approximate row count on a simple query. See http://wiki.postgresql.org/wiki/Count_estimate .

sql postgresql count nosql counter

Keep a table of incidents aggregated by day.

create table incidents_agreggated_by_day (    "day" date primary key, total integer);

Everyday run:

insert into events_agreggated_by_day ("day", total) valuesselect date_trunc('day', time_occurred), count(*) totalfrom incidentswhere     time_occurred < current_date    and date_trunc('day', time_occurred) not in (        select "day" from incidents_agreggated_by_day    )group by 1

Suppose you want the total between '2013-01-01 10:37' and '2013-03-02 11:20':

select(    select sum(total)    from incidents_aggregated_by_day    where "day" >= '2013-01-02'::date and "day" < '2013-03-02'::date) +(    select count(*)    from incidents    where         time_ocurred >= '2013-01-01 10:37':timestamp        and time_ocurred < '2013-01-02'        or        time_ocurred <= '2013-03-02 11:20':timestamp        and time_ocurred >= '2013-01-02') total

In instead of reading 100 million rows you will read hundreds or thousands. If properly indexed it will be fast.

sql postgresql count nosql counter

Another approach might be to partition the table. This guy seems to have solved a very similar problem with partitioning:

http://www.if-not-true-then-false.com/2009/performance-testing-between-partitioned-and-non-partitioned-postgresql-tables-part-3/

My concern with using his approach would be maintainability. In his example (you have to click through to part 1 of the tutorial to see how he created the partitions), he manually creates each child table and has hard-coded routing to the child tables in a trigger. If your table is constantly growing, you would be doing a lot of DBA work.

However, he does seem to get a big performance boost. So, if you can figure out how to make it more maintainable, this might be a good way to proceed.

CodeHunter

Count the number of matching rows for high magnitude counts (100,000+)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last