How to do/workaround a conditional join in python Pandas?
Well, I can think of a few ways:
- essentially blow up the dataframe by just merging on the exact field (
company
)... then filter on the 30-day windows after the merge.
- should be fast but could use lots of memory
- Move the merging and filtering on the 30-day window into a
groupby()
.
- results in a merge for each group, so slower but should use less memory
Option #1
Suppose your data looks like the following (I expanded your sample data):
print df company date measure0 0 2010-01-01 101 0 2010-01-15 102 0 2010-02-01 103 0 2010-02-15 104 0 2010-03-01 105 0 2010-03-15 106 0 2010-04-01 107 1 2010-03-01 58 1 2010-03-15 59 1 2010-04-01 510 1 2010-04-15 511 1 2010-05-01 512 1 2010-05-15 5print windows company end_date0 0 2010-02-011 0 2010-03-152 1 2010-04-013 1 2010-05-15
Create a beginning date for the 30 day windows:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))print windows company end_date beg_date0 0 2010-02-01 2010-01-021 0 2010-03-15 2010-02-132 1 2010-04-01 2010-03-023 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date
falls within beg_date
and end_date
:
df = df.merge(windows,on='company',how='left')df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df company date measure end_date beg_date2 0 2010-01-15 10 2010-02-01 2010-01-024 0 2010-02-01 10 2010-02-01 2010-01-027 0 2010-02-15 10 2010-03-15 2010-02-139 0 2010-03-01 10 2010-03-15 2010-02-1311 0 2010-03-15 10 2010-03-15 2010-02-1316 1 2010-03-15 5 2010-04-01 2010-03-0218 1 2010-04-01 5 2010-04-01 2010-03-0221 1 2010-04-15 5 2010-05-15 2010-04-1523 1 2010-05-01 5 2010-05-15 2010-04-1525 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company
and end_date
:
print df.groupby(['company','end_date']).sum() measurecompany end_date 0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15
Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))def cond_merge(g,windows): g = g.merge(windows,on='company',how='left') g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)] return g.groupby('end_date')['measure'].sum()print df.groupby('company').apply(cond_merge,windows)company end_date 0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15
Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:
windows['date'] = windows['end_date']df = df.merge(windows,on=['company','date'],how='outer')print df company date measure end_date0 0 2010-01-01 10 NaT1 0 2010-01-15 10 NaT2 0 2010-02-01 10 2010-02-013 0 2010-02-15 10 NaT4 0 2010-03-01 10 NaT5 0 2010-03-15 10 2010-03-156 0 2010-04-01 10 NaT7 1 2010-03-01 5 NaT8 1 2010-03-15 5 NaT9 1 2010-04-01 5 2010-04-0110 1 2010-04-15 5 NaT11 1 2010-05-01 5 NaT12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())print df company date measure end_date0 0 2010-01-01 10 2010-02-011 0 2010-01-15 10 2010-02-012 0 2010-02-01 10 2010-02-013 0 2010-02-15 10 2010-03-154 0 2010-03-01 10 2010-03-155 0 2010-03-15 10 2010-03-156 0 2010-04-01 10 NaT7 1 2010-03-01 5 2010-04-018 1 2010-03-15 5 2010-04-019 1 2010-04-01 5 2010-04-0110 1 2010-04-15 5 2010-05-1511 1 2010-05-01 5 2010-05-1512 1 2010-05-15 5 2010-05-15df = df[df.end_date.notnull()]df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))print df company date measure end_date beg_date0 0 2010-01-01 10 2010-02-01 2010-01-021 0 2010-01-15 10 2010-02-01 2010-01-022 0 2010-02-01 10 2010-02-01 2010-01-023 0 2010-02-15 10 2010-03-15 2010-02-134 0 2010-03-01 10 2010-03-15 2010-02-135 0 2010-03-15 10 2010-03-15 2010-02-137 1 2010-03-01 5 2010-04-01 2010-03-028 1 2010-03-15 5 2010-04-01 2010-03-029 1 2010-04-01 5 2010-04-01 2010-03-0210 1 2010-04-15 5 2010-05-15 2010-04-1511 1 2010-05-01 5 2010-05-15 2010-04-1512 1 2010-05-15 5 2010-05-15 2010-04-15df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df.groupby(['company','end_date']).sum() measurecompany end_date 0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.
There is a very easy, and practical (or maybe the only direct way) to do conditional join in pandas. Since there is no direct way to do conditional join in pandas, you will need an additional library, and that is, pandasql
Install the library pandasql
from pip using the command pip install pandasql
. This library allows you to manipulate the pandas dataframes using the SQL queries.
import pandas as pdfrom pandasql import sqldfdf = pd.read_excel(r'play_data.xlsx')df id Name Amount0 A001 A 1001 A002 B 1102 A003 C 1203 A005 D 150
Now let's just do a conditional join to compare the Amount of the IDs
# Make your pysqldf object:pysqldf = lambda q: sqldf(q, globals())# Write your query in SQL syntax, here you can use df as a normal SQL tablecond_join= ''' select df_left.*, df_right.* from df as df_left join df as df_right on df_left.[Amount] > (df_right.[Amount]+10)'''# Now, get your queries results as dataframe using the sqldf object that you createdpysqldf(cond_join) id Name Amount id Name Amount0 A003 C 120 A001 A 1001 A005 D 150 A001 A 1002 A005 D 150 A002 B 1103 A005 D 150 A003 C 120