Finding entries containing a substring in a numpy array? Finding entries containing a substring in a numpy array? numpy numpy

Finding entries containing a substring in a numpy array?


We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -

np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)

Sample run -

In [91]: barOut[91]: array(['aaa', 'aab', 'aca'],       dtype='|S3')In [92]: fooOut[92]: 'aa'In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)Out[93]: array([0, 1])In [94]: bar[2] = 'jaa'In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)Out[95]: array([0, 1, 2])


Look at some examples of using in:

In [19]: bar = np.array(["aaa", "aab", "aca"])In [20]: 'aa' in barOut[20]: FalseIn [21]: 'aaa' in barOut[21]: TrueIn [22]: 'aab' in barOut[22]: TrueIn [23]: 'aab' in list(bar) 

It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.

But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.

As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.

In [42]: np.char.find(bar, 'aa')Out[42]: array([ 0,  0, -1])

Docstring:
This module contains a set of functions for vectorized string operations and methods. The preferred alias for defchararray is numpy.char.

For operations like this I think the np.char speeds are about same as with:

In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)Out[49]: array([0, 0, -1], dtype=object)In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)Out[50]: array([True, True, False], dtype=object)

Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.


The way you are trying to use np.where is incorrect. The first argument of np.where should be a boolean array, and you are simply passing it a boolean.

foo in bar>>> Falsenp.where(False)>>> (array([], dtype=int32),)np.where(np.array([True, True, False]))>>> (array([0, 1], dtype=int32),)

The problem is that numpy does not define the in operator as an element-wise boolean operation.

One way you could accomplish what you want is with a list comprehension.

foo = 'aa'bar = np.array(['aaa', 'aab', 'aca'])out = [i for i, v in enumerate(bar) if foo in v]# out = [0, 1]bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']out = [i for i, v in enumerate(bar) if foo in v]# out = [2, 3]