Algorithm: efficient way to remove duplicate integers from an array Algorithm: efficient way to remove duplicate integers from an array arrays arrays

Algorithm: efficient way to remove duplicate integers from an array


A solution suggested by my girlfriend is a variation of merge sort. The only modification is that during the merge step, just disregard duplicated values. This solution would be as well O(n log n). In this approach, the sorting/duplication removal are combined together. However, I'm not sure if that makes any difference, though.


I've posted this once before on SO, but I'll reproduce it here because it's pretty cool. It uses hashing, building something like a hash set in place. It's guaranteed to be O(1) in axillary space (the recursion is a tail call), and is typically O(N) time complexity. The algorithm is as follows:

  1. Take the first element of the array, this will be the sentinel.
  2. Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
  3. Move all elements for which the index is equal to the hash to the beginning of the array.
  4. Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
  5. What's left between the properly hashed elements and the duplicate elements will be the elements that couldn't be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.

This can be shown to be O(N) provided no pathological scenario in the hashing: Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it's slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it's amazingly fast.

Edit: In current implementations of D, hash_t is 32 bits. Everything about this algorithm assumes that there will be very few, if any, hash collisions in full 32-bit space. Collisions may, however, occur frequently in the modulus space. However, this assumption will in all likelihood be true for any reasonably sized data set. If the key is less than or equal to 32 bits, it can be its own hash, meaning that a collision in full 32-bit space is impossible. If it is larger, you simply can't fit enough of them into 32-bit memory address space for it to be a problem. I assume hash_t will be increased to 64 bits in 64-bit implementations of D, where datasets can be larger. Furthermore, if this ever did prove to be a problem, one could change the hash function at each level of recursion.

Here's an implementation in the D programming language:

void uniqueInPlace(T)(ref T[] dataIn) {    uniqueInPlaceImpl(dataIn, 0);}void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {    if(dataIn.length - start < 2)        return;    invariant T sentinel = dataIn[start];    T[] data = dataIn[start + 1..$];    static hash_t getHash(T elem) {        static if(is(T == uint) || is(T == int)) {            return cast(hash_t) elem;        } else static if(__traits(compiles, elem.toHash)) {            return elem.toHash;        } else {            static auto ti = typeid(typeof(elem));            return ti.getHash(&elem);        }    }    for(size_t index = 0; index < data.length;) {        if(data[index] == sentinel) {            index++;            continue;        }        auto hash = getHash(data[index]) % data.length;        if(index == hash) {            index++;            continue;        }        if(data[index] == data[hash]) {            data[index] = sentinel;            index++;            continue;        }        if(data[hash] == sentinel) {            swap(data[hash], data[index]);            index++;            continue;        }        auto hashHash = getHash(data[hash]) % data.length;        if(hashHash != hash) {            swap(data[index], data[hash]);            if(hash < index)                index++;        } else {            index++;        }    }    size_t swapPos = 0;    foreach(i; 0..data.length) {        if(data[i] != sentinel && i == getHash(data[i]) % data.length) {            swap(data[i], data[swapPos++]);        }    }    size_t sentinelPos = data.length;    for(size_t i = swapPos; i < sentinelPos;) {        if(data[i] == sentinel) {            swap(data[i], data[--sentinelPos]);        } else {            i++;        }    }    dataIn = dataIn[0..sentinelPos + start + 1];    uniqueInPlaceImpl(dataIn, start + swapPos + 1);}


If you are looking for the superior O-notation, then sorting the array with an O(n log n) sort then doing a O(n) traversal may be the best route. Without sorting, you are looking at O(n^2).

Edit: if you are just doing integers, then you can also do radix sort to get O(n).