Fastest way to check if duplicates exist in a python list / numpy ndarray Fastest way to check if duplicates exist in a python list / numpy ndarray numpy numpy

Fastest way to check if duplicates exist in a python list / numpy ndarray


Here are the four ways I thought of doing it.

TL;DR: if you expect very few (less than 1/1000) duplicates:

def contains_duplicates(X):    return len(np.unique(X)) != len(X)

If you expect frequent (more than 1/1000) duplicates:

def contains_duplicates(X):    seen = set()    seen_add = seen.add    for x in X:        if (x in seen or seen_add(x)):            return True    return False

The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.

>>> import numpy as np>>> X = np.random.normal(0,1,[10000])>>> def terhorst_early_exit(X):...:     elems = set()...:     for i in X:...:         if i in elems:...:             return True...:         elems.add(i)...:     return False>>> %timeit terhorst_early_exit(X)100 loops, best of 3: 10.6 ms per loop>>> def peterbe_early_exit(X):...:     seen = set()...:     seen_add = seen.add...:     for x in X:...:         if (x in seen or seen_add(x)):...:             return True...:     return False>>> %timeit peterbe_early_exit(X)100 loops, best of 3: 9.35 ms per loop>>> %timeit len(set(X)) != len(X)100 loops, best of 3: 4.54 ms per loop>>> %timeit len(np.unique(X)) != len(X)1000 loops, best of 3: 967 µs per loop

Do things change if you start with an ordinary Python list, and not a numpy.ndarray?

>>> X = X.tolist()>>> %timeit terhorst_early_exit(X)100 loops, best of 3: 9.34 ms per loop>>> %timeit peterbe_early_exit(X)100 loops, best of 3: 8.07 ms per loop>>> %timeit len(set(X)) != len(X)100 loops, best of 3: 3.09 ms per loop>>> %timeit len(np.unique(X)) != len(X)1000 loops, best of 3: 1.83 ms per loop

Edit: what if we have a prior expectation of the number of duplicates?

The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.

>>> X = np.random.normal(0, 1, [10000])>>> for n_duplicates in [1, 10, 100]:>>>     print("{} duplicates".format(n_duplicates))>>>     duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)>>>     X[duplicate_idx] = 0>>>     print("terhost_early_exit")>>>     %timeit terhorst_early_exit(X)>>>     print("peterbe_early_exit")>>>     %timeit peterbe_early_exit(X)>>>     print("set length")>>>     %timeit len(set(X)) != len(X)>>>     print("numpy unique length")>>>     %timeit len(np.unique(X)) != len(X)1 duplicatesterhost_early_exit100 loops, best of 3: 12.3 ms per looppeterbe_early_exit100 loops, best of 3: 9.55 ms per loopset length100 loops, best of 3: 4.71 ms per loopnumpy unique length1000 loops, best of 3: 1.31 ms per loop10 duplicatesterhost_early_exit1000 loops, best of 3: 1.81 ms per looppeterbe_early_exit1000 loops, best of 3: 1.47 ms per loopset length100 loops, best of 3: 5.44 ms per loopnumpy unique length1000 loops, best of 3: 1.37 ms per loop100 duplicatesterhost_early_exit10000 loops, best of 3: 111 µs per looppeterbe_early_exit10000 loops, best of 3: 99 µs per loopset length100 loops, best of 3: 5.16 ms per loopnumpy unique length1000 loops, best of 3: 1.19 ms per loop

So if you expect very few duplicates, the numpy.unique function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.


Depending on how large your array is, and how likely duplicates are, the answer will be different.

For example, if you expect the average array to have around 3 duplicates, early exit will cut your average-case time (and space) by 2/3rds; if you expect only 1 in 1000 arrays to have any duplicates at all, it will just add a bit of complexity without improving anything.

Meanwhile, if the arrays are big enough that building a temporary set as large as the array is likely to be expensive, sticking a probabilistic test like a bloom filter in front of it will probably speed things up dramatically, but if not, it's again just wasted effort.

Finally, you want to stay within numpy if at all possible. Looping over an array of floats (or whatever) and boxing each one into a Python object is going to take almost as much time as hashing and checking the values, and of course storing things in a Python set instead of optimized numpy storage is wasteful as well. But you have to trade that off against the other issues—you can't do early exit with numpy, and there may be nice C-optimized bloom filter implementations a pip install away but not be any that are numpy-friendly.

So, there's no one best solution for all possible scenarios.


Just to give an idea of how easy it is to write a bloom filter, here's one I hacked together in a couple minutes:

from bitarray import bitarray # pip3 install bitarraydef dupcheck(X):    # Hardcoded values to give about 5% false positives for 10000 elements    size = 62352    hashcount = 4    bits = bitarray(size)    bits.setall(0)    def check(x, hash=hash): # TODO: default-value bits, hashcount, size?        for i in range(hashcount):            if not bits[hash((x, i)) % size]: return False        return True    def add(x):        for i in range(hashcount):            bits[hash((x, i)) % size] = True    seen = set()    seen_add = seen.add    for x in X:        if check(x) or add(x):            if x in seen or seen_add(x):                return True    return False

This only uses 12KB (a 62352-bit bitarray plus a 500-float set) instead of 80KB (a 10000-float set or np.array). Which doesn't matter when you're only dealing with 10K elements, but with, say, 10B elements that use up more than half of your physical RAM, it would be a different story.

Of course it's almost certainly going to be an order of magnitude or so slower than using np.unique, or maybe even set, because we're doing all that slow looping in Python. But if this turns out to be worth doing, it should be a breeze to rewrite in Cython (and to directly access the numpy array without boxing and unboxing).


My timing tests differ from Scott for small lists. Using Python 3.7.3, set() is much faster than np.unique for a small numpy array from randint (length 8), but faster for a larger array (length 1000).

Length 8

Timing test iterations: 10000Function          Min      Avg Sec  Conclusion      p-value----------  ---------  -----------  ------------  ---------set_len     0          7.73486e-06  Baselineunique_len  9.644e-06  2.55573e-05  Slower                0

Length 1000

Timing test iterations: 10000Function           Min      Avg Sec  Conclusion      p-value----------  ----------  -----------  ------------  ---------set_len     0.00011066  0.000270466  Baselineunique_len  4.3684e-05  8.95608e-05  Faster                0

Then I tried my own implementation, but I think it would require optimized C code to beat set:

def check_items(key_rand, **kwargs):    for i, vali in enumerate(key_rand):        for j in range(i+1, len(key_rand)):            valj = key_rand[j]            if vali == valj:                break

Length 8

Timing test iterations: 10000Function            Min      Avg Sec  Conclusion      p-value-----------  ----------  -----------  ------------  ---------set_len      0           6.74221e-06  Baselineunique_len   0           2.14604e-05  Slower                0check_items  1.1138e-05  2.16369e-05  Slower                0

(using my randomized compare_time() function from easyinfo)