Optimize GROUP BY query to retrieve latest row per user
For best read performance you need a multicolumn index:
CREATE INDEX log_combo_idxON log (user_id, log_date DESC NULLS LAST);
To make index only scans possible, add the otherwise not needed column payload
in a covering index with the INCLUDE
clause (Postgres 11 or later):
CREATE INDEX log_combo_covering_idxON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);
See:
Fallback for older versions:
CREATE INDEX log_combo_covering_idxON log (user_id, log_date DESC NULLS LAST, payload);
Why DESC NULLS LAST
?
For few rows per user_id
or small tables DISTINCT ON
is typically fastest and simplest:
For many rows per user_id
an index skip scan (or loose index scan) is (much) more efficient. That's not implemented up to Postgres 12 - work is ongoing for Postgres 14. But there are ways to emulate it efficiently.
Common Table Expressions require Postgres 8.4+.LATERAL
requires Postgres 9.3+.
The following solutions go beyond what's covered in the Postgres Wiki.
1. No separate table with unique users
With a separate users
table, solutions in 2. below are typically simpler and faster. Skip ahead.
1a. Recursive CTE with LATERAL
join
WITH RECURSIVE cte AS ( ( -- parentheses required SELECT user_id, log_date, payload FROM log WHERE log_date <= :mydate ORDER BY user_id, log_date DESC NULLS LAST LIMIT 1 ) UNION ALL SELECT l.* FROM cte c CROSS JOIN LATERAL ( SELECT l.user_id, l.log_date, l.payload FROM log l WHERE l.user_id > c.user_id -- lateral reference AND log_date <= :mydate -- repeat condition ORDER BY l.user_id, l.log_date DESC NULLS LAST LIMIT 1 ) l )TABLE cteORDER BY user_id;
This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a. below.
1b. Recursive CTE with correlated subquery
WITH RECURSIVE cte AS ( ( -- parentheses required SELECT l AS my_row -- whole row FROM log l WHERE log_date <= :mydate ORDER BY user_id, log_date DESC NULLS LAST LIMIT 1 ) UNION ALL SELECT (SELECT l -- whole row FROM log l WHERE l.user_id > (c.my_row).user_id AND l.log_date <= :mydate -- repeat condition ORDER BY l.user_id, l.log_date DESC NULLS LAST LIMIT 1) FROM cte c WHERE (c.my_row).user_id IS NOT NULL -- note parentheses )SELECT (my_row).* -- decompose rowFROM cteWHERE (my_row).user_id IS NOT NULLORDER BY (my_row).user_id;
Convenient to retrieve a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.
To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).
More explanation for this query in chapter 2b. below.
Related:
2. With separate users
table
Table layout hardly matters as long as exactly one row per relevant user_id
is guaranteed. Example:
CREATE TABLE users ( user_id serial PRIMARY KEY , username text NOT NULL);
Ideally, the table is physically sorted in sync with the log
table. See:
Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition. If the physical sort order of the users
table happens to match the index on log
, this may be irrelevant.
2a. LATERAL
join
SELECT u.user_id, l.log_date, l.payloadFROM users uCROSS JOIN LATERAL ( SELECT l.log_date, l.payload FROM log l WHERE l.user_id = u.user_id -- lateral reference AND l.log_date <= :mydate ORDER BY l.log_date DESC NULLS LAST LIMIT 1 ) l;
JOIN LATERAL
allows to reference preceding FROM
items on the same query level. See:
Results in one index (-only) look-up per user.
Returns no row for users missing in the users
table. Typically, a foreign key constraint enforcing referential integrity would rule that out.
Also, no row for users without matching entry in log
- conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON true
instead of CROSS JOIN LATERAL
:
Use LIMIT n
instead of LIMIT 1
to retrieve more than one rows (but not all) per user.
Effectively, all of these do the same:
JOIN LATERAL ... ON trueCROSS JOIN LATERAL ..., LATERAL ...
The last one has lower priority, though. Explicit JOIN
binds before comma. That subtle difference can matters with more join tables. See:
2b. Correlated subquery
Good choice to retrieve a single column from a single row. Code example:
The same is possible for multiple columns, but you need more smarts:
CREATE TEMP TABLE combo (log_date date, payload int);SELECT user_id, (combo1).* -- note parenthesesFROM ( SELECT u.user_id , (SELECT (l.log_date, l.payload)::combo FROM log l WHERE l.user_id = u.user_id AND l.log_date <= :mydate ORDER BY l.log_date DESC NULLS LAST LIMIT 1) AS combo1 FROM users u ) sub;
Like LEFT JOIN LATERAL
above, this variant includes all users, even without entries in log
. You get NULL
for combo1
, which you can easily filter with a WHERE
clause in the outer query if need be.
Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULL
column in the subquery to avoid this ambiguity.
A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE
. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo
Finally, we do not want to decompose combo1
on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.
Related:
Demonstrating all 4 queries with 100k log entries and 1k users:
db<>fiddle here - pg 11
Old sqlfiddle
This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the users
table to exploit the locality of the index on log
.
SELECT u.user_id, l.log_date, l.payload FROM (SELECT user_id FROM users ORDER BY user_id) u, LATERAL (SELECT log_date, payload FROM log WHERE user_id = u.user_id -- lateral reference AND log_date <= :mydate ORDER BY log_date DESC NULLS LAST LIMIT 1) l;
The rationale is that index lookup is expensive if user_id
values are random. By sorting out user_id
first, the subsequent lateral join would be like a simple scan on the index of log
. Even though both query plans look alike, the running time would differ much especially for large tables.
The cost of the sorting is minimal especially if there is an index on the user_id
field.
Perhaps a different index on the table would help. Try this one: log(user_id, log_date)
. I am not positive that Postgres will make optimal use with distinct on
.
So, I would stick with that index and try this version:
select *from log lwhere not exists (select 1 from log l2 where l2.user_id = l.user_id and l2.log_date <= :mydate and l2.log_date > l.log_date );
This should replace the sorting/grouping with index look ups. It might be faster.