(Python) Find the indices of the values in two arrays that are equal to the values in two other arrays
The numpy_indexed package (disclaimer: I am its author) contains functionality to do this kind of thing efficiently and elegantly. Memory requirements are linear, and computational requirements NlogN for this method. For the substantial arrays you are considering, the speed benefit relative to the currently accepted brute force method could easily be orders of magnitude:
import numpy as npimport numpy_indexed as npiA = np.asarray([400.5, 100, 700, 200, 15, 900])B = np.asarray([500.5, 200, 500, 600.5, 8, 999])X = np.asarray([400.5, 700, 100, 300, 15, 555, 900])Y = np.asarray([500.5, 500,600.5, 100, 8, 555, 999])AB = np.stack([A, B], axis=-1)XY = np.stack([X, Y], axis=-1)# casting the AB and XY arrays to npi.index first is not required, but a performance optimization; without this each call to npi.indices would have to re-index the arrays, which is the expensive partAB = npi.as_index(AB)XY = npi.as_index(XY)# npi.indices(list, items) is a vectorized nd-equivalent of list.index(item)indAB = npi.indices(AB, XY, missing='mask').compressed()indXY = npi.indices(XY, AB, missing='mask').compressed()
Note that you can choose how to handle missing values as well. Also take a look at the set-operations, such as npi.intersection(XY, AB); they might provider a simpler route to what it is you aim to achieve at a higher level.
Try this:
import numpy as npA = np.asarray([400.5, 100, 700, 200, 15, 900])B = np.asarray([500.5, 200, 500, 600.5, 8, 999])X = np.asarray([400.5, 700, 100, 300, 15, 555, 900])Y = np.asarray([500.5, 500,600.5, 100, 8, 555, 999])AB = np.stack([A, B], axis=-1)XY = np.stack([X, Y], axis=-1)eq = AB[:, np.newaxis, :] == XY[np.newaxis, :, :]eq = np.logical_and.reduce(eq, axis=-1)indAB, = np.where(np.logical_or.reduce(eq, axis=1))indXY, = np.where(np.logical_or.reduce(eq, axis=0))print("indAB", indAB)print("indXY", indXY)
Output:
indAB [0 2 4 5]indXY [0 1 4 6]
Explanation
AB
and XY
are just the arrays A
and B
and X
and Y
respectively "stacked" into two dimensional arrays. eq
holds the all-against-all comparison of the elements in AB
and XY
; np.newaxis
is used to add dimensions to AB
and XY
(note that AB
gets a new dimension in position 1 and XY
in position 0). The equality operator ==
broadcasts the arrays through their new dimensions. The first np.logical_and.reduce
is to ensure that both "components" are equal (A
to X
and B
to Y
), and the np.logical_or.reduce
operations check if there are any full equalities from AB
to XY
and from XY
to AB
. Finally, np.where
gets the indices.
As a downside, note that this requires a boolean array of size len(A)
x len(X)
x 2, so if the original arrays are very large you might run into memory problems.
Update
As indicated, very large arrays could be an issue. If you want to make all the comparisons "in one go" there is not really a way around it (there size of the intermediate array is simply the number of comparisons). However, you can also run the algorithm "by pieces", for example something like this:
import numpy as npMAX_SIZE = 2 # Biggest array will be MAX_SIZE x MAX_SIZE x 2A = np.asarray([400.5, 100, 700, 200, 15, 900])B = np.asarray([500.5, 200, 500, 600.5, 8, 999])X = np.asarray([400.5, 700, 100, 300, 15, 555, 900])Y = np.asarray([500.5, 500,600.5, 100, 8, 555, 999])AB = np.stack([A, B], axis=-1)XY = np.stack([X, Y], axis=-1)maskAB = np.full(len(AB), False, dtype=bool)maskXY = np.full(len(XY), False, dtype=bool)for iAB in range(0, len(AB), MAX_SIZE): pAB = np.expand_dims(AB[iAB:iAB + MAX_SIZE], axis=1) for iXY in range(0, len(XY), MAX_SIZE): pXY = np.expand_dims(XY[iXY:iXY + MAX_SIZE], axis=0) eq = pAB == pXY eq = np.logical_and.reduce(eq, axis=-1) maskAB[iAB:iAB + MAX_SIZE] |= np.logical_or.reduce(eq, axis=1) maskXY[iXY:iXY + MAX_SIZE] |= np.logical_or.reduce(eq, axis=0)indAB, = np.where(maskAB)indXY, = np.where(maskXY)print("indAB", indAB)print("indXY", indXY)
And the output is still:
indAB [0 2 4 5]indXY [0 1 4 6]
I'm using a MAX_SIZE
of 2 just to show that it works in the example, but in practice you can choose it depending on the maximum amount of memory you are willing to use (e.g. for MAX_SIZE = 10000
it should be in the order of hundreds of megabytes). MAX_SIZE
does not need to be smaller the size of the arrays, nor it has to be a divisor of their size.
Here's an alternative method. I dare say it's relatively clear, it should be efficient thanks to the use of sets and it only requires O( len(A) + len(X) )
memory.
numpy
isn't even needed, but can be used for the arrays.
from collections import defaultdictA = [400.5, 100, 700, 200, 15, 900]B = [500.5, 200, 500, 600.5, 8, 999]X = [400.5, 700, 100, 300, 15, 555, 900]Y = [500.5, 500, 600.5, 100, 8, 555, 999]def get_indices(values): d = defaultdict(set) for i, value in enumerate(values): d[value].add(i) return diA, iB, iX, iY = [get_indices(values) for values in [A, B, X, Y]]print(iA)# {400.5: {0}, 100: {1}, 200: {3}, 900: {5}, 700: {2}, 15: {4}}print(iX)# {400.5: {0}, 100: {2}, 300: {3}, 900: {6}, 555: {5}, 700: {1}, 15: {4}}for i, (a, b) in enumerate(zip(A, B)): common_indices = iX[a] & iY[b] if common_indices: print("A B : %d" % i) print("X Y : %d" % common_indices.pop()) print()# A B : 0# X Y : 0# A B : 2# X Y : 1# A B : 4# X Y : 4# A B : 5# X Y : 6