Select first row in each GROUP BY group? Select first row in each GROUP BY group? postgresql postgresql

Select first row in each GROUP BY group?


On databases that support CTE and windowing functions:

WITH summary AS (    SELECT p.id,            p.customer,            p.total,            ROW_NUMBER() OVER(PARTITION BY p.customer                                  ORDER BY p.total DESC) AS rank      FROM PURCHASES p) SELECT *   FROM summary WHERE rank = 1

Supported by any database:

But you need to add logic to break ties:

  SELECT MIN(x.id),  -- change to MAX if you want the highest         x.customer,          x.total    FROM PURCHASES x    JOIN (SELECT p.customer,                 MAX(total) AS max_total            FROM PURCHASES p        GROUP BY p.customer) y ON y.customer = x.customer                              AND y.max_total = x.totalGROUP BY x.customer, x.total


In PostgreSQL this is typically simpler and faster (more performance optimization below):

SELECT DISTINCT ON (customer)       id, customer, totalFROM   purchasesORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

SELECT DISTINCT ON (2)       id, customer, totalFROM   purchasesORDER  BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

...ORDER  BY customer, total DESC NULLS LAST, id;

Major points

DISTINCT ON is a PostgreSQL extension of the standard (where only DISTINCT on the whole SELECT list is defined).

List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates. The manual:

Obviously, two rows are considered distinct if they differ in at leastone column value. Null values are considered equal in thiscomparison.

Bold emphasis mine.

DISTINCT ON can be combined with ORDER BY. Leading expressions in ORDER BY must be in the set of expressions in DISTINCT ON, but you can rearrange order among those freely. Example.
You can add additional expressions to ORDER BY to pick a particular row from each group of peers. Or, as the manual puts it:

The DISTINCT ON expression(s) must match the leftmost ORDER BYexpression(s). The ORDER BY clause will normally contain additionalexpression(s) that determine the desired precedence of rows withineach DISTINCT ON group.

I added id as last item to break ties:
"Pick the row with the smallest id from each group sharing the highest total."

To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another ORDER BY. Example.

If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. See:

The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way. (Not needed in the simple case above):

  • You don't have to include any of the expressions in DISTINCT ON or ORDER BY.

  • You can include any other expression in the SELECT list. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.

I tested with Postgres versions 8.3 – 13. But the feature has been there at least since version 7.1, so basically always.

Index

The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

May be too specialized. But use it if read performance for the particular query is crucial. If you have DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is applicable.

Effectiveness / Performance optimization

Weigh cost and benefit before creating tailored indexes for each query. The potential of above index largely depends on data distribution.

The index is used because it delivers pre-sorted data. In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.

For few rows per customer (high cardinality in column customer), this is very efficient. Even more so if you need sorted output anyway. The benefit shrinks with a growing number of rows per customer.
Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk. But generally setting work_mem too high can have adverse effects. Consider SET LOCAL for exceptionally big queries. Find how much you need with EXPLAIN ANALYZE. Mention of "Disk:" in the sort step indicates the need for more:

For many rows per customer (low cardinality in column customer), a loose index scan (a.k.a. "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 14. (An implementation for index-only scans is in development for Postgres 15. See here and here.)
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:

Benchmarks

See separate answer.


Benchmarks

Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in purchases and 10k distinct customer_id (avg. 20 rows per customer).

For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).

Setup

Main table

CREATE TABLE purchases (  id          serial, customer_id int  -- REFERENCES customer, total       int  -- could be amount of money in Cent, some_column text -- to make the row bigger, more realistic);

I use a serial (PK constraint added below) and an integer customer_id since that's a more typical setup. Also added some_column to make up for typically more columns.

Dummy data, PK, index - a typical table also has some dead tuples:

INSERT INTO purchases (customer_id, total, some_column)    -- insert 200k rowsSELECT (random() * 10000)::int             AS customer_id  -- 10k customers     , (random() * random() * 100000)::int AS total          , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)FROM   generate_series(1,200000) g;ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);DELETE FROM purchases WHERE random() > 0.9; -- some dead rowsINSERT INTO purchases (customer_id, total, some_column)SELECT (random() * 10000)::int             AS customer_id  -- 10k customers     , (random() * random() * 100000)::int AS total          , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200kCREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);VACUUM ANALYZE purchases;

customer table - for superior query:

CREATE TABLE customer ASSELECT customer_id, 'customer_' || customer_id AS customerFROM   purchasesGROUP  BY 1ORDER  BY 1;ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);VACUUM ANALYZE customer;

In my second test for 9.5 I used the same setup, but with random() * 100000 to generate customer_id to get only few rows per customer_id.

Object sizes for table purchases

Generated with a query taken from this related answer:

               what                | bytes/ct | bytes_pretty | bytes_per_row-----------------------------------+----------+--------------+--------------- core_relation_size                | 20496384 | 20 MB        |           102 visibility_map                    |        0 | 0 bytes      |             0 free_space_map                    |    24576 | 24 kB        |             0 table_size_incl_toast             | 20529152 | 20 MB        |           102 indexes_size                      | 10977280 | 10 MB        |            54 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157 live_rows_in_text_representation  | 13729802 | 13 MB        |            68 ------------------------------    |          |              | row_count                         |   200045 |              | live_tuples                       |   200045 |              | dead_tuples                       |    19955 |              |

Queries

1. row_number() in CTE, (see other answer)

WITH cte AS (   SELECT id, customer_id, total        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn   FROM   purchases   )SELECT id, customer_id, totalFROM   cteWHERE  rn = 1;

2. row_number() in subquery (my optimization)

SELECT id, customer_id, totalFROM   (   SELECT id, customer_id, total        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn   FROM   purchases   ) subWHERE  rn = 1;

3. DISTINCT ON (see other answer)

SELECT DISTINCT ON (customer_id)       id, customer_id, totalFROM   purchasesORDER  BY customer_id, total DESC, id;

4. rCTE with LATERAL subquery (see here)

WITH RECURSIVE cte AS (   (  -- parentheses required   SELECT id, customer_id, total   FROM   purchases   ORDER  BY customer_id, total DESC   LIMIT  1   )   UNION ALL   SELECT u.*   FROM   cte c   ,      LATERAL (      SELECT id, customer_id, total      FROM   purchases      WHERE  customer_id > c.customer_id  -- lateral reference      ORDER  BY customer_id, total DESC      LIMIT  1      ) u   )SELECT id, customer_id, totalFROM   cteORDER  BY customer_id;

5. customer table with LATERAL (see here)

SELECT l.*FROM   customer c,      LATERAL (   SELECT id, customer_id, total   FROM   purchases   WHERE  customer_id = c.customer_id  -- lateral reference   ORDER  BY total DESC   LIMIT  1   ) l;

6. array_agg() with ORDER BY (see other answer)

SELECT (array_agg(id ORDER BY total DESC))[1] AS id     , customer_id     , max(total) AS totalFROM   purchasesGROUP  BY customer_id;

Results

Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

1. 273.274 ms  2. 194.572 ms  3. 111.067 ms  4.  92.922 ms  -- !5.  37.679 ms  -- winner6. 189.495 ms

B. Same as A. with Postgres 9.5

1. 288.006 ms2. 223.032 ms  3. 107.074 ms  4.  78.032 ms  -- !5.  33.944 ms  -- winner6. 211.540 ms  

C. Same as B., but with ~ 2.3 rows per customer_id

1. 381.573 ms2. 311.976 ms3. 124.074 ms  -- winner4. 710.631 ms5. 311.976 ms6. 421.679 ms

Retest with Postgres 13 on 2021-08-11

Simplified test setup: not deleting rows, because VACUUM ANALYZE cleans the table completely for the simple case.

Important changes:

  • General performance improvements.
  • CTEs can be inlined since Postgres 12, so query 1. and 2. now perform mostly identical (same query plan).

D. Like B. ~ 20 rows per customer_id

1. 103 ms2. 103 ms  3.  23 ms  -- winner  4.  71 ms  5.  22 ms  -- winner6.  81 ms  

db<>fiddle here

E. Like C. ~ 2.3 rows per customer_id

1. 127 ms2. 126 ms  3.  36 ms  -- winner  4. 620 ms  5. 145 ms6. 203 ms  

db<>fiddle here

Accented tests with Postgres 13

1M rows, 10.000 vs. 100 vs. 1.6 rows per customer.

F. with ~ 10.000 rows per customer

1. 526 ms2. 527 ms  3. 127 ms4.   2 ms  -- winner !5.   1 ms  -- winner !6. 356 ms  

db<>fiddle here

G. with ~ 100 rows per customer

1. 535 ms2. 529 ms  3. 132 ms4. 108 ms  -- !5.  71 ms  -- winner6. 376 ms  

db<>fiddle here

H. with ~ 1.6 rows per customer

1.  691 ms2.  684 ms  3.  234 ms  -- winner4. 4669 ms5. 1089 ms6. 1264 ms  

db<>fiddle here

Conclusions:

  • DISTINCT ON uses the index effectively and typically performs best for few rows per group. And it performs decently even with many rows per group.

  • For many rows per group, emulating an index skip scan with an rCTE performs best - second only to the query technique with a separate lookup table (if that's available).

  • Using row_number() (technique of the currently accepted answer) never wins any performance test. Not then, not now. It never comes even close to DISTINCT ON, not even when the data distribution is unfavorable for it. The only good thing about it: it does not scale terribly, just mediocre.

Related benchmarks

Here is a new one by "ogr" testing with 10M rows and 60k unique "customers" on Postgres 11.5 (current as of Sep. 2019). Results are still in line with what we have seen so far:

Original (outdated) benchmark from 2011

I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing @OMGPonies' first query (A) to the above DISTINCT ON solution (B):

  1. Select the whole table, results in 5958 rows in this case.
A: 567.218 msB: 386.673 ms
  1. Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.
A: 249.136 msB:  55.111 ms
  1. Select a single customer with WHERE customer = x.
A:   0.143 msB:   0.072 ms

Same test repeated with the index described in the other answer

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  1B: 193.547 ms2A: 249.796 ms -- special index not used  2B:  28.679 ms3A:   0.120 ms  3B:   0.048 ms