Searching a sequence in a NumPy array Searching a sequence in a NumPy array numpy numpy

Searching a sequence in a NumPy array


Well, this is basically a template-matching problem that comes up in image-processing a lot. Listed in this post are two approaches: Pure NumPy based and OpenCV (cv2) based.

Approach #1: With NumPy, one can create a 2D array of sliding indices across the entire length of the input array. Thus, each row would be a sliding window of elements. Next, match up each row with the input sequence, which will bring in broadcasting for a vectorized solution. We look for all True rows indicating those are the ones that are the perfect matches and as such would be the starting indices of the matches. Finally, using those indices, create a range of indices extending up to the length of the sequence, to give us the desired output. The implementation would be -

def search_sequence_numpy(arr,seq):    """ Find sequence in an array using NumPy only.    Parameters    ----------        arr    : input 1D array    seq    : input 1D array    Output    ------        Output : 1D Array of indices in the input array that satisfy the     matching of input sequence in the input array.    In case of no match, an empty list is returned.    """    # Store sizes of input array and sequence    Na, Nseq = arr.size, seq.size    # Range of sequence    r_seq = np.arange(Nseq)    # Create a 2D array of sliding indices across the entire length of input array.    # Match up with the input sequence & get the matching starting indices.    M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)    # Get the range of those indices as final output    if M.any() >0:        return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]    else:        return []         # No match found

Approach #2: With OpenCV (cv2), we have a built-in function for template-matching : cv2.matchTemplate. Using this, we would have the starting matching indices. Rest of the steps would be same as for the previous approach. Here's the implementation with cv2 :

from cv2 import matchTemplate as cv2mdef search_sequence_cv2(arr,seq):    """ Find sequence in an array using cv2.    """    # Run a template match with input sequence as the template across    # the entire length of the input array and get scores.    S = cv2m(arr.astype('uint8'),seq.astype('uint8'),cv2.TM_SQDIFF)    # Now, with floating point array cases, the matching scores might not be     # exactly zeros, but would be very small numbers as compared to others.    # So, for that use a very small to be used to threshold the scorees     # against and decide for matches.    thresh = 1e-5 # Would depend on elements in seq. So, be careful setting this.    # Find the matching indices    idx = np.where(S.ravel() < thresh)[0]    # Get the range of those indices as final output    if len(idx)>0:        return np.unique((idx[:,None] + np.arange(seq.size)).ravel())    else:        return []         # No match found

Sample run

In [512]: arr = np.array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0])In [513]: seq = np.array([0,0])In [514]: search_sequence_numpy(arr,seq)Out[514]: array([1, 2, 3, 4, 8, 9])In [515]: search_sequence_cv2(arr,seq)Out[515]: array([1, 2, 3, 4, 8, 9])

Runtime test

In [477]: arr = np.random.randint(0,9,(100000))     ...: seq = np.array([3,6,8,4])     ...: In [478]: np.allclose(search_sequence_numpy(arr,seq),search_sequence_cv2(arr,seq))Out[478]: TrueIn [479]: %timeit search_sequence_numpy(arr,seq)100 loops, best of 3: 11.8 ms per loopIn [480]: %timeit search_sequence_cv2(arr,seq)10 loops, best of 3: 20.6 ms per loop

Seems like the Pure NumPy based one is the safest and fastest!


I find that the most succinct, intuitive and general way to do this is using regular expressions.

import reimport numpy as np# Set the threshold for string printing to infinitenp.set_printoptions(threshold=np.inf)# Remove spaces and linebreaks that would come through when printing your vectoryourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]# The next line is the most important, set the arguments in the braces# such that the first argument is the shortest sequence you want# and the second argument is the longest (using empty as infinite length)r = re.compile(r"[0]{1,}") zero_starts = [m.start() for m in r.finditer( yourarray_string )]zero_ends = [m.end() for m in r.finditer( yourarray_string )]