Is there a standard approach for dealing with unordered arrays (sets) in PostgreSQL? Is there a standard approach for dealing with unordered arrays (sets) in PostgreSQL? postgresql postgresql

Is there a standard approach for dealing with unordered arrays (sets) in PostgreSQL?


There's no built-in way at this time.

As arrays

If you consistently normalize them on save you can treat arrays as sets, by always storing them sorted and de-duplicated. It'd be great if PostgreSQL had a built-in C function to do this, but it doesn't. I took a look at writing one but the C array API is horrible, so even though I've written a bunch of extensions I just backed carefully away from this one.

If you don't mind moderately icky performance you can do it in SQL:

CREATE OR REPLACE FUNCTION array_uniq_sort(anyarray) RETURNS anyarray AS $$SELECT array_agg(DISTINCT f ORDER BY f) FROM unnest($1) f;$$ LANGUAGE sql IMMUTABLE;

then wrap all saves in calls to array_uniq_sort or enforce it with a trigger. You can then just compare your arrays for equality. You could avoid the array_uniq_sort calls for data from the app if you instead just did the sort/unique on the app side instead.

If you do this please store your "sets" as array columns, like text[], not comma- or space-delimited text. See this question for some of the reasons.

You need to watch out for a few things, like the fact that casts between arrays are stricter than casts between their base types. E.g.:

regress=> SELECT 'a' = 'a'::varchar, 'b' = 'b'::varchar; ?column? | ?column? ----------+---------- t        | t(1 row)regress=> SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];ERROR:  operator does not exist: text[] = character varying[]LINE 1: SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];                              ^HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.regress=> SELECT ARRAY['a','b']::varchar[] = ARRAY['a','b']::varchar[]; ?column? ---------- t(1 row)

Such columns are GiST-indexable for operations like array-contains or array-overlaps; see the PostgreSQL documentation on array indexing.

As normalized rows

The other option is to just store normalized rows with a suitable key. I'd still use array_agg for sorting and comparing them, as SQL set operations can be clunky to use for this (especially given the lack of an XOR / double-sided set difference operation).

This is generally known as EAV (entity-attribute-value). I'm not a fan myself, but it does have its place occasionally. Except you'd be using it without the value component.

You create a table:

CREATE TABLE item_attributes (    item_id integer references items(id),    attribute_name text,    primary key(item_id, attribute_name));

and insert a row for each set entry for each item, instead of having each item have an array-valued column. The unique constraint enforced by the primary key ensures that no item may have duplicates of a given attribute. Attribute ordering is irrelevant/undefined.

Comparisions can be done with SQL set operators like EXCEPT, or using array_agg(attribute_name ORDER BY attribute_name) to form consistently sorted arrays for comparison.

Indexing is limited to determining whether a given item has/doesn't have a given attribute.

Personally I'd use arrays over this approach.

hstore

You can also use hstores with empty values to store sets, as hstore de-duplicates keys. 9.4's jsonb will also work for this.

regress=# create extension hstore;CREATE EXTENSIONregress=# SELECT hstore('a => 1, b => 1') = hstore('b => 1, a => 1, b => 1'); ?column? ---------- t(1 row)

It's only really useful for text types, though. e.g.:

regress=# SELECT hstore('"1.0" => 1, "2.0" => 1') = hstore('"1.00" => 1, "1.000" => 1, "2.0" => 1'); ?column? ---------- f(1 row)

and I think it's ugly. So again, I'd favour arrays.

For integer arrays only

The intarray extension provides useful, fast functions for treating arrays as sets. They're only available for integer arrays but they're really useful.