Pythonic and efficient way to do an elementwise "in" using numpy Pythonic and efficient way to do an elementwise "in" using numpy numpy numpy

Pythonic and efficient way to do an elementwise "in" using numpy


To take advantage of NumPy's broadcasting rules you should make array b squared first, which can be achieved using itertools.izip_longest:

from itertools import izip_longestc = np.array(list(izip_longest(*b))).astype(float)

resulting in:

array([[  1.,   2.,   5.,   7.],       [  2.,   8.,   6.,  nan],       [ 13.,   9.,  nan,  nan]])

Then, by doing np.isclose(c, a) you get a 2D array of Booleans showing the difference between each c[:, i] and a[i], according to the broadcasting rules, giving:

array([[ True,  True, False, False],       [False, False, False, False],       [False, False, False, False]], dtype=bool)

Which can be used to obtain your answer:

np.any(np.isclose(c, a), axis=0)#array([ True,  True, False, False], dtype=bool)


Is there an upper limit to the length of the small lists in b? If so, maybe you could make b a matrix of say 1000x5, and use nan to fill the gaps for the sub-arrays that are too short. You can then use numpy.any to get the answer you want, something like this:

In [42]: a = np.array([1, 2, 3, 4])    ...: b = np.array([[1, 2, 13], [2, 8, 9], [5, 6], [7]])In [43]: bb = np.full((len(b), max(len(i) for i in b)), np.nan)In [44]: for irow, row in enumerate(b):    ...:     bb[irow, :len(row)] = rowIn [45]: bbOut[45]: array([[  1.,   2.,  13.],       [  2.,   8.,   9.],       [  5.,   6.,  nan],       [  7.,  nan,  nan]])In [46]: a[:,np.newaxis] == bbOut[46]: array([[ True, False, False],       [ True, False, False],       [False, False, False],       [False, False, False]], dtype=bool)In [47]: np.any(a[:,np.newaxis] == bb, axis=1)Out[47]: array([ True,  True, False, False], dtype=bool)

No idea if this is faster for your data.


Summary

The approach from Sauldo Castro runs most quickly among those posted so far. The generator expression in the original post is second fastest.

Code to generate test data:

import numpyimport randomalength = 100a = numpy.array([random.randint(1, 6) for i in range(alength)])b = []for i in range(alength):    length = random.randint(1, 5)    element = []    for i in range(length):        element.append(random.randint(1, 6))    b.append(element)b = numpy.array(b)print a, b

The options:

from itertools import izip_longestdef magic_function1(a, b): # From OP Martin Fixman    return [x in y for x, y in zip(a, b)]  def magic_function2(a, b): # What I thought might be better.    bools = []    for x, y in zip(a,b):        found = False        for j in y:            if x == j:                found=True                break        bools.append(found)def magic_function3(a, b): # What I tried first    bools = []    for i in range(len(a)):        found = False        for j in range(len(b[i])):            if a[i] == b[i][j]:                found=True                break        bools.append(found)def magic_function4(a, b): # From Bas Swinkels    bb = numpy.full((len(b), max(len(i) for i in b)), numpy.nan)    for irow, row in enumerate(b):        bb[irow, :len(row)] = row    a[:,numpy.newaxis] == bb    return numpy.any(a[:,numpy.newaxis] == bb, axis=1)def magic_function5(a, b): # From Sauldo Castro, revised version    c = numpy.array(list(izip_longest(*b))).astype(float)    return numpy.isclose(c, a), axis=0)  

Time n_executions

n_executions = 100clock = timeit.Timer(stmt="magic_function1(a, b)", setup="from __main__ import magic_function1, a, b")print clock.timeit(n_executions), "seconds"# Repeat with each candidate function

The results:

  • 0.158078225475 seconds for magic_function1
  • 0.181080926835 seconds for magic_function2
  • 0.259621047822 seconds for magic_function3
  • 0.287054750224 seconds for magic_function4
  • 0.0839162196207 seconds for magic_function5