Fast (vectorized) way to find points in one DF belonging to equally sized rectangles (given by two points) from the second DF

python pandas numpy dataframe parallel-processing

You can do something like this:

Assuming we have the following DFs:

In [150]: AOut[150]:     type       latw       late      lngs      lngn0    1000  45.457966  45.458030  9.174864  9.1749071    1001  45.457966  45.458030  9.174864  9.1749072    1002  45.458030  45.458094  9.174864  9.1749073    1003  45.458094  45.458157  9.174864  9.17490716   1004  45.458923  45.458987  9.174864  9.17490717   1005  45.458987  45.459051  9.174864  9.174907970  1006  45.460583  45.460647  9.175545  9.175587971  1007  45.460647  45.460711  9.175545  9.175587972  1008  45.460711  45.460775  9.175545  9.175587996  1009  45.458540  45.458604  9.175587  9.175630997  1010  45.458604  45.458668  9.175587  9.175630998  1011  45.458668  45.458732  9.175587  9.175630999  1012  45.458732  45.458796  9.175587  9.175630In [151]: BOut[151]:   type      Lat     Lng0     0  45.4581  9.17491     1  45.4590  9.17492     2  45.4586  9.1756

Solution:

B['new'] = B.apply(lambda x: A.loc[  (A.latw < x.Lat) & (x.Lat < A.late)                                   & (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type'].head(1),                   axis=1) \            .values.diagonal()In [153]: BOut[153]:   type      Lat     Lng     new0     0  45.4581  9.1749  1003.01     1  45.4590  9.1749  1005.02     2  45.4586  9.1756  1009.0

PS I'm not sure this is the fastest way to achieve that...

python pandas numpy dataframe parallel-processing

High Level Explanation

This answer is quite involved. The following are high-level points

Use np.searchsorted with the west latitudes in A relative to the sorted latitudes in B. Repeat this for east latitudes but with a parameter of side='right'. This tells me where each west/east latitudes fall in the spectrum defined in B. np.searchsorted are index values. So if the east based results are greater than the west based results, that implies that there was a latitude from B between west and east.
- Track where east searches are greater than west searches.
- Find the difference. Not only do I care if the east search is greater than the west search, the difference tells me how many B.Lat's are in between. I expand arrays to track all possible matches.
Repeat this process for south and north longitudes but only over the subset in which we found matching latitudes. No need to waste effort looking where we know we don't have a match.
- Perform tricky slicing to unwind positions
Use numpy structured arrays to find intersections of 2-D arrays
- I used this answer from @JoeKington please up vote his answer because this is awesome!
This final intersection has ordinal positions from A and matching ordinal positions from B.
- remember I said we could possible match more than one row from B, I take the first.

Demonstration

use A and B provided by @MaxU

p = process(A, B)print(p)[[3 0] [5 1] [9 2]]

To show the results and compare. I'll put them side by side

pd.concat([        A.iloc[p[:, 0]].reset_index(drop=True),        B.iloc[p[:, 1]].reset_index(drop=True)    ], axis=1, keys=['A', 'B'])

However, to swap the type values, I put a flag in the process function to reassign. Just run process(A, B, reassign_type=True)

timing
functions
I defined the functions to return the same thing and make it an apples to apples comparison

def pir(A, B):    return A.values[process(A, B)[:, 0], 0]def maxu(A, B):    return B.apply(        lambda x: A.loc[            (A.latw < x.Lat) &            (x.Lat < A.late) &            (A.lngs < x.Lng) &            (x.Lng < A.lngn), 'type'        ].head(1),        axis=1    ).values.diagonal()

more thorough testing

from timeit import timeitfrom itertools import productrng = np.arange(20, 51, 5)idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A'])functions = pd.Index(['pir', 'maxu'], name='method')results = pd.DataFrame(index=idx, columns=functions)for bi, ai in idx:    n = ai    ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]])    A = pd.DataFrame(        np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]),        columns=['type', 'latw', 'lngs', 'late', 'lngn']    )    m = int(bi ** .5)    prng = np.arange(m) / m * 10 + .5    B = pd.DataFrame(        np.array(list(product(prng, prng))) + np.array([[40, 30]]),        columns=['Lat', "Lng"])    B.insert(0, 'type', np.random.randint(3, size=len(B)))    for j in functions:        results.set_value(            (bi, ai), j,            timeit(                '{}(A, B)'.format(j),                'from __main__ import {}, A, B'.format(j),                number=10            )        )results.plot(rot=45)

Code

from collections import namedtupleTwoSided = namedtuple('TwoSided', ['match', 'idx', 'found'])def two_sided_search(left, right, a):    # sort `a` and save order for later    argsort = a.argsort()    asorted = a[argsort]    # where does `left` fall in `asorted`    s_left = asorted.searchsorted(left)    # where does `right` fall in `asorted`    s_right = asorted.searchsorted(right, side='right')    rng = np.arange(left.size)    # a match happens when where we found `left`, `right` was found later    match_idx = rng[s_left < s_right]    # ignore positions with no match    left_pos = s_left[match_idx]    right_pos = s_right[match_idx]    # distance from where left was found to where right was found    # this represents how many items in a are wrapped by left and right    diff_pos = right_pos - left_pos    # build an index on `match_idx` paired with all positions in    # `asorted`.    d2 = left_pos - np.append(0, diff_pos[:-1].cumsum())    p1 = np.arange(len(left_pos)).repeat(diff_pos)    p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos)    # returns    # `match_idx` which is an integer slice of either `left` or `right`    # for each element in `match_idx` there are one or more elements in    # `a` that were surrounded by `left` and `right`    # `p1` are the positions in `match_idx` that are paired with `argsort[p2]`    # for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]`    # this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]`    # and `left[match_idx[1]] < a[5] < right[match_idx[1]]`    # and `left[match_idx[2]] < a[12] < right[match_idx[2]]`    # and `left[match_idx[2]] < a[5] < right[match_idx[2]]`    return TwoSided(match_idx, p1, argsort[p2])def intersectNd(a, b):    # taken from https://stackoverflow.com/a/8317403/2336654    # returns the interesection of 2 2-D arrays    # the strategy is to convert to a structured array then    # use np.intersect1d then convert back to unstructured    ncols = a.shape[1]    dtype = dict(        names=['f{}'.format(i) for i in range(ncols)],        formats=ncols * [a.dtype]    )    return np.intersect1d(        a.view(dtype), b.view(dtype)    ).view(a.dtype).reshape(-1, ncols)def where_in(b1, b2):    # return a mask of lenght len(b2) where True    # when element of b2 is in b1    bsort = b1.argsort()    bsrtd = b1[bsort]    bsrch = bsrtd.searchsorted(b2) % bsrtd.size    return bsrtd[bsrch] == b2def process(A, B, reassign_type=False):    lats = two_sided_search(        A.latw.values,        A.late.values,        B.Lat.values,    )    # subset A & B    # no point looking for longitude matches    # in rows without a latitude match    lngs = A.lngs.values[lats.match]    lngn = A.lngn.values[lats.match]    u, i = np.unique(lats.found, return_index=1)    lng = B.Lng.values[u]    lons = two_sided_search(        lngs, lngn, lng    )    # `lats.match` is where we found successful latitudes    # `lons.match` is where we found successful longitudes    # since `lons.match` was based on `lats.match` we can slice    # to get the original positions of `A` that satisfy both    # the latw < blat < late & lngs < blng < lngn    match = lats.match[lons.match]    # however, for every row in A that might be a match for B,    # there may be multilple B's    #    # We start addressing this by finding    # where in lats.idx do we have matches    # this gives me a boolean array of lats.match    # index values repeated for each row in B    # that satisfies our conditions    wi = where_in(lons.match, lats.idx)    # a (? x 2) array representing all combinations of    # rows in A that matched the     lon_pos = np.hstack([            lats.match[lons.match[lons.idx], None],            u[lons.found, None]        ])    lat_pos = np.hstack([            lats.match[lats.idx[wi], None],            lats.found[wi, None]        ])    x = intersectNd(lat_pos, lon_pos)    p, pi = np.unique(x[:, 0], return_index=1)    if reassign_type:        A.iloc[x[pi, 0], A.columns.get_loc('type')] = \            B.iloc[x[pi, 1], B.columns.get_loc('type')].values    else:        return x[pi]

python pandas numpy dataframe parallel-processing

I ran the solution suggested by MaxU on datasize of 1000 and the timeit shows 1.2 s as best of 3. I achieved some improvement in time by creating a list 'c' which holds the boolean results of comparison and the using this list to assign as follows:

c=list((A.latw < B.Lat) & (A.late > B.Lat  ) & (A.lngs < B.Lng) & (A.lngn>B.Lng))B.apply(A.loc[ c , 'type'].head(1),axis=1).values.diagonal()

The first part runs in the order of micro seconds and the second part takes miliseconds.

Can you please try it on your large data set and see if it really improves the performance.

CodeHunter

Fast (vectorized) way to find points in one DF belonging to equally sized rectangles (given by two points) from the second DF

High Level Explanation

Demonstration

Code

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last