Best way to join / merge by range in pandas
Setup
Consider the dataframes A
and B
A = pd.DataFrame(dict( A_id=range(10), A_value=range(5, 105, 10) ))B = pd.DataFrame(dict( B_id=range(5), B_low=[0, 30, 30, 46, 84], B_high=[10, 40, 50, 54, 84] ))A A_id A_value0 0 51 1 152 2 253 3 354 4 455 5 556 6 657 7 758 8 859 9 95B B_high B_id B_low0 10 0 01 40 1 302 50 2 303 54 3 464 84 4 84
numpy
The ✌easiest✌ way is to use numpy
broadcasting.
We look for every instance of A_value
being greater than or equal to B_low
while at the same time A_value
is less than or equal to B_high
.
a = A.A_value.valuesbh = B.B_high.valuesbl = B.B_low.valuesi, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))pd.DataFrame( np.column_stack([A.values[i], B.values[j]]), columns=A.columns.append(B.columns)) A_id A_value B_high B_id B_low0 0 5 10 0 01 3 35 40 1 302 3 35 50 2 303 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A
that doesn't match.
pd.DataFrame( np.column_stack([A.values[i], B.values[j]]), columns=A.columns.append(B.columns)).append( A[~np.in1d(np.arange(len(A)), np.unique(i))], ignore_index=True, sort=False) A_id A_value B_id B_low B_high0 0 5 0.0 0.0 10.01 3 35 1.0 30.0 40.02 3 35 2.0 30.0 50.03 4 45 2.0 30.0 50.04 1 15 NaN NaN NaN5 2 25 NaN NaN NaN6 5 55 NaN NaN NaN7 6 65 NaN NaN NaN8 7 75 NaN NaN NaN9 8 85 NaN NaN NaN10 9 95 NaN NaN NaN
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:") df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])df1.to_sql("df1", conn, index=False)df2.to_sql("df2", conn, index=False)qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.