Aggregating the most recent joined records per week
You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN
between generate_series()
and goals
. The (possibly) expensive part is to get the current state
from updates
for each. Like @Paul already suggested, a LATERAL
join seems like the best tool. Do it only for updates
, though, and use a faster technique with LIMIT 1
.
And simplify date handling with date_trunc()
.
SELECT w_start , g.company_id , count(*) FILTER (WHERE u.status = 'green') AS green_count , count(*) FILTER (WHERE u.status = 'amber') AS amber_count , count(*) FILTER (WHERE u.status = 'red') AS red_countFROM generate_series(date_trunc('week', NOW() - interval '2 months') , date_trunc('week', NOW()) , interval '1 week') w_startCROSS JOIN goals gLEFT JOIN LATERAL ( SELECT status FROM updates WHERE goal_id = g.id AND created_at < w_start ORDER BY created_at DESC LIMIT 1 ) u ON trueGROUP BY w_start, g.company_idORDER BY w_start, g.company_id;
To make this fast you need a multicolumn index:
CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);
Descending order for created_at
is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)
Index columns in that order. Why?
And the third column status
is only appended to allow fast index-only scans on updates
. Related case:
1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.
w_start
represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:
EXTRACT(isoyear from w_start) AS year , EXTRACT(week from w_start) AS week
Best with ISOYEAR
, like @Paul explained.
Related:
This seems like a good use for LATERAL
joins:
SELECT EXTRACT(ISOYEAR FROM s) AS year, EXTRACT(WEEK FROM s) AS week, u.company_id, COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count, COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count, COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_countFROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w)LEFT OUTER JOIN LATERAL ( SELECT DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status FROM updates u2 INNER JOIN goals g ON g.id = u2.goal_id WHERE u2.created_at <= s.w ORDER BY g.company_id, u2.goal_id, u2.created_at DESC) u ON trueWHERE u.company_id IS NOT NULLGROUP BY year, week, u.company_idORDER BY u.company_id, year, week;
Btw I am extracting ISOYEAR
not YEAR
to ensure I get sensible results around the beginning of January. For instance EXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08')
is 2016
but EXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08')
is 53
!
EDIT: You should test on your real data, but I feel like this ought to be faster:
SELECT year, week, company_id, COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count, COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count, COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_countFROM ( SELECT EXTRACT(ISOYEAR FROM s) AS year, EXTRACT(WEEK FROM s) AS week, u.company_id, u.goal_id, (array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t) LEFT OUTER JOIN ( SELECT g.company_id, u2.goal_id, u2.created_at, u2.status FROM updates u2 INNER JOIN goals g ON g.id = u2.goal_id ) u ON s.t >= u.created_at WHERE u.company_id IS NOT NULL GROUP BY year, week, u.company_id, u.goal_id) xGROUP BY year, week, company_idORDER BY company_id, year, week;
Still no window functions though. :-) Also you can speed it up a bit more by replacing (array_agg(...))[1]
with a real first
function. You'll have to define that yourself, but there are implementations on the Postgres wiki that are easy to Google for.
I use PostgreSQL 9.3. I'm interested in your question. I examined your data structure. Than I create the following tables.
I insert the following records;
Company
Goals
Updates
After that I wrote the following query, for correction
SELECT c.id company_id, c.name company_name, u.status goal_status, EXTRACT(week from u.created_at) goal_status_week, EXTRACT(year from u.created_at) AS goal_status_year FROM company cINNER JOIN goals g ON g.company_id = c.id INNER JOIN updates u ON u.goal_id = g.idORDER BY goal_status_year DESC, goal_status_week DESC;
At last I merge this query with week series
SELECT gs.company_id, gs.company_name, gs.goal_status, EXTRACT(year from w) AS year, EXTRACT(week from w) AS week, COUNT(gs.*) cntFROM generate_series(NOW() - INTERVAL '3 MONTHS', NOW(), '1 week') wLEFT JOIN(SELECT c.id company_id, c.name company_name, u.status goal_status, EXTRACT(week from u.created_at) goal_status_week, EXTRACT(year from u.created_at) AS goal_status_year FROM company cINNER JOIN goals g ON g.company_id = c.id INNER JOIN updates u ON u.goal_id = g.id ) gs ON gs.goal_status_week = EXTRACT(week from w) AND gs.goal_status_year = EXTRACT(year from w)GROUP BY company_id, company_name, goal_status, year, weekORDER BY year DESC, week DESC;
I get this result
Have a good day.