C - How to implement Set data structure? C - How to implement Set data structure? c c

C - How to implement Set data structure?


There are multiple ways of implementing set (and map) functionality, for example:

  • tree-based approach (ordered traversal)
  • hash-based approach (unordered traversal)

Since you mentioned value-indexed arrays, let's try the hash-based approach which builds naturally on top of the value-indexed array technique.

Beware of the advantages and disadvantages of hash-based vs. tree-based approaches.

You can design a hash-set (a special case of hash-tables) of pointers to hashable PODs, with chaining, internally represented as a fixed-size array of buckets of hashables, where:

  • all hashables in a bucket have the same hash value
  • a bucket can be implemented as a dynamic array or linked list of hashables
  • a hashable's hash value is used to index into the array of buckets (hash-value-indexed array)
  • one or more of the hashables contained in the hash-set could be (a pointer to) another hash-set, or even to the hash-set itself (i.e. self-inclusion is possible)

With large amounts of memory at your disposal, you can size your array of buckets generously and, in combination with a good hash method, drastically reduce the probability of collision, achieving virtually constant-time performance.

You would have to implement:

  • the hash function for the type being hashed
  • an equality function for the type being used to test whether two hashables are equal or not
  • the hash-set contains/insert/remove functionality.

You can also use open addressing as an alternative to maintaining and managing buckets.


Sets are usually implemented as some variety of a binary tree. Red black trees have good worst case performance.

These can also be used to build an map to allow key / value lookups.

This approach requires some sort of ordering on the elements of the set and the key values in a map.

I'm not sure how you would manage a set that could possibly contain itself using binary trees if you limit set membership to well defined types in C ... comparison between such constructs could be problematic. You could do it easily enough in C++, though.


The way to get genericity in C is by void *, so you're going to be using pointers anyway, and pointers to different objects are unique. This means you need a hash map or binary tree containing pointers, and this will work for all data objects.

The downside of this is that you can't enter rvalues independently. You can't have a set containing the value 5; you have to assign 5 to a variable, which means it won't match a random 5. You could enter it as (void *) 5, and for practical purposes this is likely to work with small integers, but if your integers can get into large enough sizes to compete with pointers this has a very small probability of failing.

Nor does this work with string values. Given char a[] = "Hello, World!"; char b[] = "Hello, World!";, a set of pointers would find a and b to be different. You would probably want to hash the values, but if you're concerned about hash collisions you should save the string in the set and do a strncmp() to compare the stored string with the probing string.

(There's similar problems with floating-point numbers, but trying to represent floating-point numbers in sets is a bad idea in the first place.)

Therefore, you'd probably want a tagged value, one tag for any sort of object, one for integer value, and one for string value, and possibly more for different sorts of values. It's complicated, but doable.