Finding entries containing a substring in a numpy array?
We can use np.core.defchararray.find
to find the position of foo
string in each element of bar
, which would return -1
if not found. Thus, it could be used to detect whether foo
is present in each element or not by checking for -1
on the output from find
. Finally, we would use np.flatnonzero
to get the indices of matches. So, we would have an implementation, like so -
np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Sample run -
In [91]: barOut[91]: array(['aaa', 'aab', 'aca'], dtype='|S3')In [92]: fooOut[92]: 'aa'In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)Out[93]: array([0, 1])In [94]: bar[2] = 'jaa'In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)Out[95]: array([0, 1, 2])
Look at some examples of using in
:
In [19]: bar = np.array(["aaa", "aab", "aca"])In [20]: 'aa' in barOut[20]: FalseIn [21]: 'aaa' in barOut[21]: TrueIn [22]: 'aab' in barOut[22]: TrueIn [23]: 'aab' in list(bar)
It looks like in
when used with an array works as though the array was a list. ndarray
does have a __contains__
method, so in
works, but it is probably simple.
But in any case, note that in alist
does not check for substrings. The strings
__contains__
does the substring test, but I don't know any builtin class that propagates the test down to the component strings.
As Divakar
shows there is a collection of numpy functions that applies string methods to individual elements of an array.
In [42]: np.char.find(bar, 'aa')Out[42]: array([ 0, 0, -1])
Docstring:
This module contains a set of functions for vectorized string operations and methods. The preferred alias fordefchararray
isnumpy.char
.
For operations like this I think the np.char
speeds are about same as with:
In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)Out[49]: array([0, 0, -1], dtype=object)In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)Out[50]: array([True, True, False], dtype=object)
Further tests suggest that the ndarray
__contains__
operates on the flat
version of the array - that is, shape doesn't affect its behavior.
The way you are trying to use np.where
is incorrect. The first argument of np.where
should be a boolean array, and you are simply passing it a boolean.
foo in bar>>> Falsenp.where(False)>>> (array([], dtype=int32),)np.where(np.array([True, True, False]))>>> (array([0, 1], dtype=int32),)
The problem is that numpy does not define the in
operator as an element-wise boolean operation.
One way you could accomplish what you want is with a list comprehension.
foo = 'aa'bar = np.array(['aaa', 'aab', 'aca'])out = [i for i, v in enumerate(bar) if foo in v]# out = [0, 1]bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']out = [i for i, v in enumerate(bar) if foo in v]# out = [2, 3]