DISTINCT ON in an aggregate function in postgres
Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT
and SUM
if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:
WITH tags AS ( SELECT photo.photo_id, json_agg(row_to_json(tag.*)) AS tags FROM photo LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id),comments AS ( SELECT photo.photo_id, json_agg(row_to_json(comment.*)) AS comments FROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id GROUP BY photo.photo_id)SELECT COALESCE(tags.photo_id, comments.photo_id) AS photo_id, tags.tags, comments.commentsFROM tagsFULL OUTER JOIN commentsON tags.photo_id = comments.photo_id
EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:
SELECT photo.photo_id, to_json(array_agg(DISTINCT tag.*)) AS tags, to_json(array_agg(DISTINCT comment.*)) AS commentsFROM photoLEFT OUTER JOIN comment ON comment.photo_id = photo.photo_idLEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_idLEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_idGROUP BY photo.photo_id
The cheapest and simplest DISTINCT
operation is .. not to multiply rows in a "proxy cross join" in the first place. Aggregate first, then join. See:
Best for returning few selected rows
Assuming you actually don't want to retrieve the whole table, but just one or few selected photos at a time, with aggregated details, the most elegant and probably fastest way is with LATERAL
subqueries:
SELECT *FROM photo pCROSS JOIN LATERAL ( SELECT json_agg(c) AS comments FROM comment c WHERE photo_id = p.photo_id ) c1CROSS JOIN LATERAL ( SELECT json_agg(t) AS tags FROM photo_tag pt JOIN tag t USING (tag_id) WHERE pt.photo_id = p.photo_id ) tWHERE p.photo_id = 2; -- arbitrary selection
This returns whole rows from comment
and tag
, aggregated into JSON arrays separately. Rows are not multiplies like in your attempt, but they are only as "distinct" as they are in your base tables.
To additionally fold duplicates in the base data, see below.
Notes:
LATERAL
andjson_agg()
require Postgres 9.3 or later.json_agg(c)
is short forjson_agg(c.*)
.We do not need to
LEFT JOIN
because an aggregate function likejson_agg()
always returns a row.
Typically, you'd only want a subset of columns - at least excluding the redundant photo_id
:
SELECT *FROM photo pCROSS JOIN LATERAL ( SELECT json_agg(json_build_object('comment_id', comment_id , 'comment', comment)) AS comments FROM comment WHERE photo_id = p.photo_id ) cCROSS JOIN LATERAL ( SELECT json_agg(t) AS tags FROM photo_tag pt JOIN tag t USING (tag_id) WHERE pt.photo_id = p.photo_id ) tWHERE p.photo_id = 2;
json_build_object()
was introduced with Postgres 9.4. Used to be cumbersome in older versions because a ROW
constructor doesn't preserve column names. But there are generic workarounds:
Also allows to choose JSON key names freely, you don't have to stick to column names.
Best for returning the whole table
To return all rows, this is more efficient:
SELECT p.* , COALESCE(c1.comments, '[]') AS comments , COALESCE(t.tags, '[]') AS tagsFROM photo pLEFT JOIN ( SELECT photo_id , json_agg(json_build_object('comment_id', comment_id , 'comment', comment)) AS comments FROM comment c GROUP BY 1 ) c1 USING (photo_id)LEFT JOIN LATERAL ( SELECT photo_id , json_agg(t) AS tags FROM photo_tag pt JOIN tag t USING (tag_id) GROUP BY 1 ) t USING (photo_id);
Once we retrieve enough rows, this gets cheaper than LATERAL
subqueries. Works for Postgres 9.3+.
Note the USING
clause in the join condition. This way we can conveniently use SELECT *
in the outer query without getting duplicate columns for photo_id
. I didn't use SELECT *
here because your deleted answer indicates you want empty JSON arrays instead of NULL for no tags / no comments.
Also remove existing duplicates in base tables
You can't just json_agg(DISTINCT json_build_object(...))
because there is no equality operator for the data type json
. See:
There are various better ways:
SELECT *FROM photo pCROSS JOIN LATERAL ( SELECT json_agg(to_json(c1.comment)) AS comments1 , json_agg(json_build_object('comment', c1.comment)) AS comments2 , json_agg(to_json(c1)) AS comments3 FROM ( SELECT DISTINCT c.comment -- folding dupes here FROM comment c WHERE c.photo_id = p.photo_id -- ORDER BY comment -- any particular order? ) c1 ) c2CROSS JOIN LATERAL ( SELECT jsonb_agg(DISTINCT t) AS tags -- demonstrating jsonb_agg FROM photo_tag pt JOIN tag t USING (tag_id) WHERE pt.photo_id = p.photo_id ) tWHERE p.photo_id = 2;
Demonstrating 4 different techniques in comments1
, comments2
, comments3
(redundantly) and tags
.
db<>fiddle here
Old SQL Fiddle backpatched to Postgres 9.3
Old SQL Fiddle for Postgres 9.6
The most simple thing I discovered is to use DISTINCT
over jsonb
(not json!). (jsonb_build_object
creates jsonb objects)
SELECT JSON_AGG( DISTINCT jsonb_build_object('tag_id', photo_tag.tag_id, 'name', tag.name)) AS tagsFROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_idGROUP BY photo.photo_id