Fast (vectorized) way to find points in one DF belonging to equally sized rectangles (given by two points) from the second DF
You can do something like this:
Assuming we have the following DFs:
In [150]: AOut[150]: type latw late lngs lngn0 1000 45.457966 45.458030 9.174864 9.1749071 1001 45.457966 45.458030 9.174864 9.1749072 1002 45.458030 45.458094 9.174864 9.1749073 1003 45.458094 45.458157 9.174864 9.17490716 1004 45.458923 45.458987 9.174864 9.17490717 1005 45.458987 45.459051 9.174864 9.174907970 1006 45.460583 45.460647 9.175545 9.175587971 1007 45.460647 45.460711 9.175545 9.175587972 1008 45.460711 45.460775 9.175545 9.175587996 1009 45.458540 45.458604 9.175587 9.175630997 1010 45.458604 45.458668 9.175587 9.175630998 1011 45.458668 45.458732 9.175587 9.175630999 1012 45.458732 45.458796 9.175587 9.175630In [151]: BOut[151]: type Lat Lng0 0 45.4581 9.17491 1 45.4590 9.17492 2 45.4586 9.1756
Solution:
B['new'] = B.apply(lambda x: A.loc[ (A.latw < x.Lat) & (x.Lat < A.late) & (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type'].head(1), axis=1) \ .values.diagonal()In [153]: BOut[153]: type Lat Lng new0 0 45.4581 9.1749 1003.01 1 45.4590 9.1749 1005.02 2 45.4586 9.1756 1009.0
PS I'm not sure this is the fastest way to achieve that...
High Level Explanation
This answer is quite involved. The following are high-level points
- Use
np.searchsorted
with the west latitudes inA
relative to the sorted latitudes inB
. Repeat this for east latitudes but with a parameter ofside='right'
. This tells me where each west/east latitudes fall in the spectrum defined inB
.np.searchsorted
are index values. So if the east based results are greater than the west based results, that implies that there was a latitude fromB
between west and east.- Track where east searches are greater than west searches.
- Find the difference. Not only do I care if the east search is greater than the west search, the difference tells me how many
B.Lat
's are in between. I expand arrays to track all possible matches.
- Repeat this process for south and north longitudes but only over the subset in which we found matching latitudes. No need to waste effort looking where we know we don't have a match.
- Perform tricky slicing to unwind positions
- Use
numpy
structured arrays to find intersections of 2-D arrays- I used this answer from @JoeKington please up vote his answer because this is awesome!
- This final intersection has ordinal positions from
A
and matching ordinal positions fromB
.- remember I said we could possible match more than one row from
B
, I take the first.
- remember I said we could possible match more than one row from
Demonstration
use A
and B
provided by @MaxU
p = process(A, B)print(p)[[3 0] [5 1] [9 2]]
To show the results and compare. I'll put them side by side
pd.concat([ A.iloc[p[:, 0]].reset_index(drop=True), B.iloc[p[:, 1]].reset_index(drop=True) ], axis=1, keys=['A', 'B'])
However, to swap the type
values, I put a flag in the process
function to reassign. Just run process(A, B, reassign_type=True)
timing
functions
I defined the functions to return the same thing and make it an apples to apples comparison
def pir(A, B): return A.values[process(A, B)[:, 0], 0]def maxu(A, B): return B.apply( lambda x: A.loc[ (A.latw < x.Lat) & (x.Lat < A.late) & (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type' ].head(1), axis=1 ).values.diagonal()
more thorough testing
from timeit import timeitfrom itertools import productrng = np.arange(20, 51, 5)idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A'])functions = pd.Index(['pir', 'maxu'], name='method')results = pd.DataFrame(index=idx, columns=functions)for bi, ai in idx: n = ai ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]]) A = pd.DataFrame( np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]), columns=['type', 'latw', 'lngs', 'late', 'lngn'] ) m = int(bi ** .5) prng = np.arange(m) / m * 10 + .5 B = pd.DataFrame( np.array(list(product(prng, prng))) + np.array([[40, 30]]), columns=['Lat', "Lng"]) B.insert(0, 'type', np.random.randint(3, size=len(B))) for j in functions: results.set_value( (bi, ai), j, timeit( '{}(A, B)'.format(j), 'from __main__ import {}, A, B'.format(j), number=10 ) )results.plot(rot=45)
Code
from collections import namedtupleTwoSided = namedtuple('TwoSided', ['match', 'idx', 'found'])def two_sided_search(left, right, a): # sort `a` and save order for later argsort = a.argsort() asorted = a[argsort] # where does `left` fall in `asorted` s_left = asorted.searchsorted(left) # where does `right` fall in `asorted` s_right = asorted.searchsorted(right, side='right') rng = np.arange(left.size) # a match happens when where we found `left`, `right` was found later match_idx = rng[s_left < s_right] # ignore positions with no match left_pos = s_left[match_idx] right_pos = s_right[match_idx] # distance from where left was found to where right was found # this represents how many items in a are wrapped by left and right diff_pos = right_pos - left_pos # build an index on `match_idx` paired with all positions in # `asorted`. d2 = left_pos - np.append(0, diff_pos[:-1].cumsum()) p1 = np.arange(len(left_pos)).repeat(diff_pos) p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos) # returns # `match_idx` which is an integer slice of either `left` or `right` # for each element in `match_idx` there are one or more elements in # `a` that were surrounded by `left` and `right` # `p1` are the positions in `match_idx` that are paired with `argsort[p2]` # for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]` # this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]` # and `left[match_idx[1]] < a[5] < right[match_idx[1]]` # and `left[match_idx[2]] < a[12] < right[match_idx[2]]` # and `left[match_idx[2]] < a[5] < right[match_idx[2]]` return TwoSided(match_idx, p1, argsort[p2])def intersectNd(a, b): # taken from https://stackoverflow.com/a/8317403/2336654 # returns the interesection of 2 2-D arrays # the strategy is to convert to a structured array then # use np.intersect1d then convert back to unstructured ncols = a.shape[1] dtype = dict( names=['f{}'.format(i) for i in range(ncols)], formats=ncols * [a.dtype] ) return np.intersect1d( a.view(dtype), b.view(dtype) ).view(a.dtype).reshape(-1, ncols)def where_in(b1, b2): # return a mask of lenght len(b2) where True # when element of b2 is in b1 bsort = b1.argsort() bsrtd = b1[bsort] bsrch = bsrtd.searchsorted(b2) % bsrtd.size return bsrtd[bsrch] == b2def process(A, B, reassign_type=False): lats = two_sided_search( A.latw.values, A.late.values, B.Lat.values, ) # subset A & B # no point looking for longitude matches # in rows without a latitude match lngs = A.lngs.values[lats.match] lngn = A.lngn.values[lats.match] u, i = np.unique(lats.found, return_index=1) lng = B.Lng.values[u] lons = two_sided_search( lngs, lngn, lng ) # `lats.match` is where we found successful latitudes # `lons.match` is where we found successful longitudes # since `lons.match` was based on `lats.match` we can slice # to get the original positions of `A` that satisfy both # the latw < blat < late & lngs < blng < lngn match = lats.match[lons.match] # however, for every row in A that might be a match for B, # there may be multilple B's # # We start addressing this by finding # where in lats.idx do we have matches # this gives me a boolean array of lats.match # index values repeated for each row in B # that satisfies our conditions wi = where_in(lons.match, lats.idx) # a (? x 2) array representing all combinations of # rows in A that matched the lon_pos = np.hstack([ lats.match[lons.match[lons.idx], None], u[lons.found, None] ]) lat_pos = np.hstack([ lats.match[lats.idx[wi], None], lats.found[wi, None] ]) x = intersectNd(lat_pos, lon_pos) p, pi = np.unique(x[:, 0], return_index=1) if reassign_type: A.iloc[x[pi, 0], A.columns.get_loc('type')] = \ B.iloc[x[pi, 1], B.columns.get_loc('type')].values else: return x[pi]
I ran the solution suggested by MaxU on datasize of 1000 and the timeit shows 1.2 s as best of 3. I achieved some improvement in time by creating a list 'c' which holds the boolean results of comparison and the using this list to assign as follows:
c=list((A.latw < B.Lat) & (A.late > B.Lat ) & (A.lngs < B.Lng) & (A.lngn>B.Lng))B.apply(A.loc[ c , 'type'].head(1),axis=1).values.diagonal()
The first part runs in the order of micro seconds and the second part takes miliseconds.
Can you please try it on your large data set and see if it really improves the performance.