numpy search array for multiple values, and returns their indices
A classic way of checking one array against another is adjust the shape and use '==':
In [250]: arr==query[:,None]Out[250]: array([[False, False, False, False, False, True], [False, True, False, False, False, False], [ True, False, False, False, False, False]], dtype=bool)In [251]: np.where(arr==query[:,None])Out[251]: (array([0, 1, 2]), array([5, 1, 0]))
If an element query
isn't found in a
, its 'row' will be missing, e.g. [0,2]
instead of [0,1,2]
In [261]: np.where(arr==np.array(['a','x','v'],dtype='S')[:,None])Out[261]: (array([0, 2]), array([5, 1]))
For this small example, it is considerably faster than a list comprehension equivalent:
np.hstack([(arr==i).nonzero()[0] for i in query])
It's a little slower than the searchsorted
solution. (In that solution i
is out of bounds if query
element is not found).
Stefano suggested fromiter
. It saves some time compared to hstack
of a list:
In [313]: timeit np.hstack([(arr==i).nonzero()[0] for i in query])10000 loops, best of 3: 49.5 us per loopIn [314]: timeit np.fromiter(((arr==i).nonzero()[0] for i in query), dtype=int, count=len(query))10000 loops, best of 3: 35.3 us per loop
But if raises an error is an element is missing, or if there are multiple occurances. hstack
can handle variable length entries, fromiter
cannot.
np.flatnonzero(arr==i)
is slower than ().nonzero()[0]
, but I haven't looked into why.
You can use np.searchsorted
on the sorted array, then revert the returned indices to the original array. For that you may use np.argsort
; as in:
>>> indx = a.argsort() # indices that would sort the array>>> i = np.searchsorted(a[indx], query) # indices in the sorted array>>> indx[i] # indices with respect to the original arrayarray([5, 1, 0])
if a
is of size n
and query
is of size k
, this will be O(n log n + k log n)
which would be faster than O(n k)
for linear search if log n < k
.