Best way to join / merge by range in pandas Best way to join / merge by range in pandas pandas pandas

Best way to join / merge by range in pandas


Setup
Consider the dataframes A and B

A = pd.DataFrame(dict(        A_id=range(10),        A_value=range(5, 105, 10)    ))B = pd.DataFrame(dict(        B_id=range(5),        B_low=[0, 30, 30, 46, 84],        B_high=[10, 40, 50, 54, 84]    ))A   A_id  A_value0     0        51     1       152     2       253     3       354     4       455     5       556     6       657     7       758     8       859     9       95B   B_high  B_id  B_low0      10     0      01      40     1     302      50     2     303      54     3     464      84     4     84

numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.valuesbh = B.B_high.valuesbl = B.B_low.valuesi, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))pd.DataFrame(    np.column_stack([A.values[i], B.values[j]]),    columns=A.columns.append(B.columns))   A_id  A_value  B_high  B_id  B_low0     0        5      10     0      01     3       35      40     1     302     3       35      50     2     303     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.DataFrame(    np.column_stack([A.values[i], B.values[j]]),    columns=A.columns.append(B.columns)).append(    A[~np.in1d(np.arange(len(A)), np.unique(i))],    ignore_index=True, sort=False)    A_id  A_value  B_id  B_low  B_high0      0        5   0.0    0.0    10.01      3       35   1.0   30.0    40.02      3       35   2.0   30.0    50.03      4       45   2.0   30.0    50.04      1       15   NaN    NaN     NaN5      2       25   NaN    NaN     NaN6      5       55   NaN    NaN     NaN7      6       65   NaN    NaN     NaN8      7       75   NaN    NaN     NaN9      8       85   NaN    NaN     NaN10     9       95   NaN    NaN     NaN


Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:

conn = sqlite3.connect(":memory:") df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])df1.to_sql("df1", conn, index=False)df2.to_sql("df2", conn, index=False)qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"tt = pd.read_sql_query(qry,conn)

You can adapt the query as needed in your application


I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.