Removing elements from an array that are in another array Removing elements from an array that are in another array arrays arrays

Removing elements from an array that are in another array


there is a easy solution with list comprehension,

A = [i for i in A if i not in B]

Result

[[1, 1, 2], [1, 1, 3]]

List comprehension it's not removing the elements from the array, It's just reassigning,

if you want to remove the elements use this method

for i in B:     if i in A:     A.remove(i)


Based on this solution to Find the row indexes of several values in a numpy array, here's a NumPy based solution with less memory footprint and could be beneficial when working with large arrays -

dims = np.maximum(B.max(0),A.max(0))+1out = A[~np.in1d(np.ravel_multi_index(A.T,dims),np.ravel_multi_index(B.T,dims))]

Sample run -

In [38]: AOut[38]: array([[1, 1, 1],       [1, 1, 2],       [1, 1, 3],       [1, 1, 4]])In [39]: BOut[39]: array([[0, 0, 0],       [1, 0, 2],       [1, 0, 3],       [1, 0, 4],       [1, 1, 0],       [1, 1, 1],       [1, 1, 4]])In [40]: outOut[40]: array([[1, 1, 2],       [1, 1, 3]])

Runtime test on large arrays -

In [107]: def in1d_approach(A,B):     ...:     dims = np.maximum(B.max(0),A.max(0))+1     ...:     return A[~np.in1d(np.ravel_multi_index(A.T,dims),\     ...:                     np.ravel_multi_index(B.T,dims))]     ...: In [108]: # Setup arrays with B as large array and A contains some of B's rows     ...: B = np.random.randint(0,9,(1000,3))     ...: A = np.random.randint(0,9,(100,3))     ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)     ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)     ...: A[A_idx] = B[B_idx]     ...: 

Timings with broadcasting based solutions -

In [109]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]100 loops, best of 3: 4.64 ms per loop # @Kasramvd's solnIn [110]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]100 loops, best of 3: 3.66 ms per loop

Timing with less memory footprint based solution -

In [111]: %timeit in1d_approach(A,B)1000 loops, best of 3: 231 µs per loop

Further performance boost

in1d_approach reduces each row by considering each row as an indexing tuple. We can do the same a bit more efficiently by introducing matrix-multiplication with np.dot, like so -

def in1d_dot_approach(A,B):    cumdims = (np.maximum(A.max(),B.max())+1)**np.arange(B.shape[1])    return A[~np.in1d(A.dot(cumdims),B.dot(cumdims))]

Let's test it against the previous on much larger arrays -

In [251]: # Setup arrays with B as large array and A contains some of B's rows     ...: B = np.random.randint(0,9,(10000,3))     ...: A = np.random.randint(0,9,(1000,3))     ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)     ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)     ...: A[A_idx] = B[B_idx]     ...: In [252]: %timeit in1d_approach(A,B)1000 loops, best of 3: 1.28 ms per loopIn [253]: %timeit in1d_dot_approach(A, B)1000 loops, best of 3: 1.2 ms per loop


Here is a Numpythonic approach with broadcasting:

In [83]: A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]Out[83]: array([[1, 1, 2],       [1, 1, 3]])

Here is a timeit with other answer:

In [90]: def cal_diff(A, B):   ....:     A_rows = A.view([('', A.dtype)] * A.shape[1])   ....:     B_rows = B.view([('', B.dtype)] * B.shape[1])   ....:     return np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])   ....: In [93]: %timeit cal_diff(A, B)10000 loops, best of 3: 54.1 µs per loopIn [94]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]100000 loops, best of 3: 9.41 µs per loop# Even better with Divakar's suggestionIn [97]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]100000 loops, best of 3: 7.41 µs per loop

Well, if you are looking for a faster way you should looking for ways that reduce the number of comparisons. In this case (without considering the order) you can generate a unique number from your rows and compare the numbers which can be done with summing the items power of two.

Here is the benchmark with Divakar's in1d approach:

In [144]: def in1d_approach(A,B):   .....:         dims = np.maximum(B.max(0),A.max(0))+1   .....:         return A[~np.in1d(np.ravel_multi_index(A.T,dims),\   .....:                          np.ravel_multi_index(B.T,dims))]   .....: In [146]: %timeit in1d_approach(A, B)10000 loops, best of 3: 23.8 µs per loopIn [145]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]10000 loops, best of 3: 20.2 µs per loop

You can use np.diff to get the an order independent result:

In [194]: B=np.array([[0, 0, 0,], [1, 0, 2,], [1, 0, 3,], [1, 0, 4,], [1, 1, 0,], [1, 1, 1,], [1, 1, 4,], [4, 1, 1]])In [195]: A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]Out[195]: array([[1, 1, 2],       [1, 1, 3]])In [196]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]10000 loops, best of 3: 30.7 µs per loop

Benchmark with Divakar's setup:

In [198]: B = np.random.randint(0,9,(1000,3))In [199]: A = np.random.randint(0,9,(100,3))In [200]: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)In [201]: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)In [202]: A[A_idx] = B[B_idx]In [203]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]10000 loops, best of 3: 137 µs per loopIn [204]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]10000 loops, best of 3: 112 µs per loopIn [205]: %timeit in1d_approach(A, B)10000 loops, best of 3: 115 µs per loop

Timing with larger arrays (Divakar's solution is slightly faster):

In [231]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]1000 loops, best of 3: 1.01 ms per loopIn [232]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]1000 loops, best of 3: 880 µs per loopIn [233]:  %timeit in1d_approach(A, B)1000 loops, best of 3: 807 µs per loop