(Python) Find the indices of the values in two arrays that are equal to the values in two other arrays (Python) Find the indices of the values in two arrays that are equal to the values in two other arrays numpy numpy

(Python) Find the indices of the values in two arrays that are equal to the values in two other arrays


The numpy_indexed package (disclaimer: I am its author) contains functionality to do this kind of thing efficiently and elegantly. Memory requirements are linear, and computational requirements NlogN for this method. For the substantial arrays you are considering, the speed benefit relative to the currently accepted brute force method could easily be orders of magnitude:

import numpy as npimport numpy_indexed as npiA = np.asarray([400.5, 100,  700,   200,  15, 900])B = np.asarray([500.5, 200,  500, 600.5,   8, 999])X = np.asarray([400.5, 700,  100,   300,  15, 555, 900])Y = np.asarray([500.5, 500,600.5,   100,   8, 555, 999])AB = np.stack([A, B], axis=-1)XY = np.stack([X, Y], axis=-1)# casting the AB and XY arrays to npi.index first is not required, but a performance optimization; without this each call to npi.indices would have to re-index the arrays, which is the expensive partAB = npi.as_index(AB)XY = npi.as_index(XY)# npi.indices(list, items) is a vectorized nd-equivalent of list.index(item)indAB = npi.indices(AB, XY, missing='mask').compressed()indXY = npi.indices(XY, AB, missing='mask').compressed()

Note that you can choose how to handle missing values as well. Also take a look at the set-operations, such as npi.intersection(XY, AB); they might provider a simpler route to what it is you aim to achieve at a higher level.


Try this:

import numpy as npA = np.asarray([400.5, 100,  700,   200,  15, 900])B = np.asarray([500.5, 200,  500, 600.5,   8, 999])X = np.asarray([400.5, 700,  100,   300,  15, 555, 900])Y = np.asarray([500.5, 500,600.5,   100,   8, 555, 999])AB = np.stack([A, B], axis=-1)XY = np.stack([X, Y], axis=-1)eq = AB[:, np.newaxis, :] == XY[np.newaxis, :, :]eq = np.logical_and.reduce(eq, axis=-1)indAB, = np.where(np.logical_or.reduce(eq, axis=1))indXY, = np.where(np.logical_or.reduce(eq, axis=0))print("indAB", indAB)print("indXY", indXY)

Output:

indAB [0 2 4 5]indXY [0 1 4 6]

Explanation

AB and XY are just the arrays A and B and X and Y respectively "stacked" into two dimensional arrays. eq holds the all-against-all comparison of the elements in AB and XY; np.newaxis is used to add dimensions to AB and XY (note that AB gets a new dimension in position 1 and XY in position 0). The equality operator == broadcasts the arrays through their new dimensions. The first np.logical_and.reduce is to ensure that both "components" are equal (A to X and B to Y), and the np.logical_or.reduce operations check if there are any full equalities from AB to XY and from XY to AB. Finally, np.where gets the indices.

As a downside, note that this requires a boolean array of size len(A) x len(X) x 2, so if the original arrays are very large you might run into memory problems.

Update

As indicated, very large arrays could be an issue. If you want to make all the comparisons "in one go" there is not really a way around it (there size of the intermediate array is simply the number of comparisons). However, you can also run the algorithm "by pieces", for example something like this:

import numpy as npMAX_SIZE = 2  # Biggest array will be MAX_SIZE x MAX_SIZE x 2A = np.asarray([400.5, 100,  700,   200,  15, 900])B = np.asarray([500.5, 200,  500, 600.5,   8, 999])X = np.asarray([400.5, 700,  100,   300,  15, 555, 900])Y = np.asarray([500.5, 500,600.5,   100,   8, 555, 999])AB = np.stack([A, B], axis=-1)XY = np.stack([X, Y], axis=-1)maskAB = np.full(len(AB), False, dtype=bool)maskXY = np.full(len(XY), False, dtype=bool)for iAB in range(0, len(AB), MAX_SIZE):    pAB = np.expand_dims(AB[iAB:iAB + MAX_SIZE], axis=1)    for iXY in range(0, len(XY), MAX_SIZE):        pXY = np.expand_dims(XY[iXY:iXY + MAX_SIZE], axis=0)        eq = pAB == pXY        eq = np.logical_and.reduce(eq, axis=-1)        maskAB[iAB:iAB + MAX_SIZE] |= np.logical_or.reduce(eq, axis=1)        maskXY[iXY:iXY + MAX_SIZE] |= np.logical_or.reduce(eq, axis=0)indAB, = np.where(maskAB)indXY, = np.where(maskXY)print("indAB", indAB)print("indXY", indXY)

And the output is still:

indAB [0 2 4 5]indXY [0 1 4 6]

I'm using a MAX_SIZE of 2 just to show that it works in the example, but in practice you can choose it depending on the maximum amount of memory you are willing to use (e.g. for MAX_SIZE = 10000 it should be in the order of hundreds of megabytes). MAX_SIZE does not need to be smaller the size of the arrays, nor it has to be a divisor of their size.


Here's an alternative method. I dare say it's relatively clear, it should be efficient thanks to the use of sets and it only requires O( len(A) + len(X) ) memory.

numpy isn't even needed, but can be used for the arrays.

from collections import defaultdictA = [400.5, 100, 700, 200, 15, 900]B = [500.5, 200, 500, 600.5, 8, 999]X = [400.5, 700, 100, 300, 15, 555, 900]Y = [500.5, 500, 600.5, 100, 8, 555, 999]def get_indices(values):    d = defaultdict(set)    for i, value in enumerate(values):        d[value].add(i)    return diA, iB, iX, iY = [get_indices(values) for values in [A, B, X, Y]]print(iA)# {400.5: {0}, 100: {1}, 200: {3}, 900: {5}, 700: {2}, 15: {4}}print(iX)# {400.5: {0}, 100: {2}, 300: {3}, 900: {6}, 555: {5}, 700: {1}, 15: {4}}for i, (a, b) in enumerate(zip(A, B)):    common_indices = iX[a] & iY[b]    if common_indices:        print("A B : %d" % i)        print("X Y : %d" % common_indices.pop())        print()#   A B : 0#   X Y : 0#   A B : 2#   X Y : 1#   A B : 4#   X Y : 4#   A B : 5#   X Y : 6