DISTINCT ON in an aggregate function in postgres DISTINCT ON in an aggregate function in postgres postgresql postgresql

DISTINCT ON in an aggregate function in postgres


Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT and SUM if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:

WITH tags AS (  SELECT  photo.photo_id, json_agg(row_to_json(tag.*)) AS tags  FROM    photo  LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id  LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id  GROUP BY photo.photo_id),comments AS (  SELECT  photo.photo_id, json_agg(row_to_json(comment.*)) AS comments  FROM    photo  LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id  GROUP BY photo.photo_id)SELECT  COALESCE(tags.photo_id, comments.photo_id) AS photo_id,        tags.tags,        comments.commentsFROM    tagsFULL OUTER JOIN commentsON      tags.photo_id = comments.photo_id

EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:

SELECT  photo.photo_id,        to_json(array_agg(DISTINCT tag.*)) AS tags,        to_json(array_agg(DISTINCT comment.*)) AS commentsFROM    photoLEFT OUTER JOIN comment ON comment.photo_id = photo.photo_idLEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_idLEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_idGROUP BY photo.photo_id


The cheapest and simplest DISTINCT operation is .. not to multiply rows in a "proxy cross join" in the first place. Aggregate first, then join. See:

Best for returning few selected rows

Assuming you actually don't want to retrieve the whole table, but just one or few selected photos at a time, with aggregated details, the most elegant and probably fastest way is with LATERAL subqueries:

SELECT *FROM   photo pCROSS  JOIN LATERAL (   SELECT json_agg(c) AS comments   FROM   comment c   WHERE  photo_id = p.photo_id   ) c1CROSS  JOIN LATERAL (   SELECT json_agg(t) AS tags   FROM   photo_tag pt   JOIN   tag       t USING (tag_id)   WHERE  pt.photo_id = p.photo_id   ) tWHERE  p.photo_id = 2;  -- arbitrary selection

This returns whole rows from comment and tag, aggregated into JSON arrays separately. Rows are not multiplies like in your attempt, but they are only as "distinct" as they are in your base tables.

To additionally fold duplicates in the base data, see below.

Notes:

  • LATERAL and json_agg() require Postgres 9.3 or later.

  • json_agg(c) is short for json_agg(c.*).

  • We do not need to LEFT JOIN because an aggregate function like json_agg() always returns a row.

Typically, you'd only want a subset of columns - at least excluding the redundant photo_id:

SELECT *FROM   photo pCROSS  JOIN LATERAL (   SELECT json_agg(json_build_object('comment_id', comment_id                                   , 'comment', comment)) AS comments   FROM   comment   WHERE  photo_id = p.photo_id   ) cCROSS  JOIN LATERAL (   SELECT json_agg(t) AS tags   FROM   photo_tag pt   JOIN   tag       t USING (tag_id)   WHERE  pt.photo_id = p.photo_id   ) tWHERE  p.photo_id = 2;

json_build_object() was introduced with Postgres 9.4. Used to be cumbersome in older versions because a ROW constructor doesn't preserve column names. But there are generic workarounds:

Also allows to choose JSON key names freely, you don't have to stick to column names.

Best for returning the whole table

To return all rows, this is more efficient:

SELECT p.*     , COALESCE(c1.comments, '[]') AS comments     , COALESCE(t.tags, '[]') AS tagsFROM   photo pLEFT   JOIN (   SELECT photo_id        , json_agg(json_build_object('comment_id', comment_id                                   , 'comment', comment)) AS comments   FROM   comment c   GROUP  BY 1   ) c1 USING (photo_id)LEFT  JOIN LATERAL (   SELECT photo_id , json_agg(t) AS tags   FROM   photo_tag pt   JOIN   tag       t USING (tag_id)   GROUP  BY 1   ) t USING (photo_id);

Once we retrieve enough rows, this gets cheaper than LATERAL subqueries. Works for Postgres 9.3+.

Note the USING clause in the join condition. This way we can conveniently use SELECT * in the outer query without getting duplicate columns for photo_id. I didn't use SELECT * here because your deleted answer indicates you want empty JSON arrays instead of NULL for no tags / no comments.

Also remove existing duplicates in base tables

You can't just json_agg(DISTINCT json_build_object(...)) because there is no equality operator for the data type json. See:

There are various better ways:

SELECT *FROM   photo pCROSS  JOIN LATERAL (   SELECT json_agg(to_json(c1.comment)) AS comments1        , json_agg(json_build_object('comment', c1.comment)) AS comments2        , json_agg(to_json(c1)) AS comments3   FROM  (      SELECT DISTINCT c.comment  -- folding dupes here      FROM   comment c      WHERE  c.photo_id = p.photo_id   -- ORDER  BY comment --  any particular order?      ) c1   ) c2CROSS  JOIN LATERAL (   SELECT jsonb_agg(DISTINCT t) AS tags  -- demonstrating jsonb_agg   FROM   photo_tag pt   JOIN   tag       t USING (tag_id)   WHERE  pt.photo_id = p.photo_id   ) tWHERE  p.photo_id = 2;

Demonstrating 4 different techniques in comments1, comments2, comments3 (redundantly) and tags.

db<>fiddle here
Old SQL Fiddle backpatched to Postgres 9.3
Old SQL Fiddle for Postgres 9.6


The most simple thing I discovered is to use DISTINCT over jsonb (not json!). (jsonb_build_object creates jsonb objects)

SELECT    JSON_AGG(       DISTINCT jsonb_build_object('tag_id', photo_tag.tag_id,                                   'name', tag.name)) AS tagsFROM photo    LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id    LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id    LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_idGROUP BY photo.photo_id