Determining duplicate values in an array Determining duplicate values in an array python python

Determining duplicate values in an array


As of numpy version 1.9.0, np.unique has an argument return_counts which greatly simplifies your task:

u, c = np.unique(a, return_counts=True)dup = u[c > 1]

This is similar to using Counter, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.

It's probably worth mentioning that even though np.unique is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter solution. np.unique is sort-based, so runs asymptotically in O(n log n) time. Counter is hash-based, so has O(n) complexity. This will not matter much for anything but the largest datasets.


I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.

>>> import numpy as np>>> from collections import Counter>>> a = np.array([1, 2, 1, 3, 3, 3, 0])>>> [item for item, count in Counter(a).items() if count > 1][1, 3]

note: This is similar to Burhan Khalid's answer, but the use of items without subscripting in the condition should be faster.


People have already suggested Counter variants, but here's one which doesn't use a listcomp:

>>> from collections import Counter>>> a = [1, 2, 1, 3, 3, 3, 0]>>> (Counter(a) - Counter(set(a))).keys()[1, 3]

[Posted not because it's efficient -- it's not -- but because I think it's cute that you can subtract Counter instances.]