How to do/workaround a conditional join in python Pandas? How to do/workaround a conditional join in python Pandas? python python

How to do/workaround a conditional join in python Pandas?


Well, I can think of a few ways:

  1. essentially blow up the dataframe by just merging on the exact field (company)... then filter on the 30-day windows after the merge.
  • should be fast but could use lots of memory
  1. Move the merging and filtering on the 30-day window into a groupby().
  • results in a merge for each group, so slower but should use less memory

Option #1

Suppose your data looks like the following (I expanded your sample data):

print df    company       date  measure0         0 2010-01-01       101         0 2010-01-15       102         0 2010-02-01       103         0 2010-02-15       104         0 2010-03-01       105         0 2010-03-15       106         0 2010-04-01       107         1 2010-03-01        58         1 2010-03-15        59         1 2010-04-01        510        1 2010-04-15        511        1 2010-05-01        512        1 2010-05-15        5print windows   company   end_date0        0 2010-02-011        0 2010-03-152        1 2010-04-013        1 2010-05-15

Create a beginning date for the 30 day windows:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -                       np.timedelta64(30,'D'))print windows   company   end_date   beg_date0        0 2010-02-01 2010-01-021        0 2010-03-15 2010-02-132        1 2010-04-01 2010-03-023        1 2010-05-15 2010-04-15

Now do a merge and then select based on if date falls within beg_date and end_date:

df = df.merge(windows,on='company',how='left')df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df    company       date  measure   end_date   beg_date2         0 2010-01-15       10 2010-02-01 2010-01-024         0 2010-02-01       10 2010-02-01 2010-01-027         0 2010-02-15       10 2010-03-15 2010-02-139         0 2010-03-01       10 2010-03-15 2010-02-1311        0 2010-03-15       10 2010-03-15 2010-02-1316        1 2010-03-15        5 2010-04-01 2010-03-0218        1 2010-04-01        5 2010-04-01 2010-03-0221        1 2010-04-15        5 2010-05-15 2010-04-1523        1 2010-05-01        5 2010-05-15 2010-04-1525        1 2010-05-15        5 2010-05-15 2010-04-15

You can compute the 30 day window sums by grouping on company and end_date:

print df.groupby(['company','end_date']).sum()                    measurecompany end_date           0       2010-02-01       20        2010-03-15       301       2010-04-01       10        2010-05-15       15

Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -                       np.timedelta64(30,'D'))def cond_merge(g,windows):    g = g.merge(windows,on='company',how='left')    g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]    return g.groupby('end_date')['measure'].sum()print df.groupby('company').apply(cond_merge,windows)company  end_date  0        2010-02-01    20         2010-03-15    301        2010-04-01    10         2010-05-15    15

Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:

windows['date'] = windows['end_date']df = df.merge(windows,on=['company','date'],how='outer')print df    company       date  measure   end_date0         0 2010-01-01       10        NaT1         0 2010-01-15       10        NaT2         0 2010-02-01       10 2010-02-013         0 2010-02-15       10        NaT4         0 2010-03-01       10        NaT5         0 2010-03-15       10 2010-03-156         0 2010-04-01       10        NaT7         1 2010-03-01        5        NaT8         1 2010-03-15        5        NaT9         1 2010-04-01        5 2010-04-0110        1 2010-04-15        5        NaT11        1 2010-05-01        5        NaT12        1 2010-05-15        5 2010-05-15

This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:

df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())print df    company       date  measure   end_date0         0 2010-01-01       10 2010-02-011         0 2010-01-15       10 2010-02-012         0 2010-02-01       10 2010-02-013         0 2010-02-15       10 2010-03-154         0 2010-03-01       10 2010-03-155         0 2010-03-15       10 2010-03-156         0 2010-04-01       10        NaT7         1 2010-03-01        5 2010-04-018         1 2010-03-15        5 2010-04-019         1 2010-04-01        5 2010-04-0110        1 2010-04-15        5 2010-05-1511        1 2010-05-01        5 2010-05-1512        1 2010-05-15        5 2010-05-15df = df[df.end_date.notnull()]df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -                   np.timedelta64(30,'D'))print df   company       date  measure   end_date   beg_date0         0 2010-01-01       10 2010-02-01 2010-01-021         0 2010-01-15       10 2010-02-01 2010-01-022         0 2010-02-01       10 2010-02-01 2010-01-023         0 2010-02-15       10 2010-03-15 2010-02-134         0 2010-03-01       10 2010-03-15 2010-02-135         0 2010-03-15       10 2010-03-15 2010-02-137         1 2010-03-01        5 2010-04-01 2010-03-028         1 2010-03-15        5 2010-04-01 2010-03-029         1 2010-04-01        5 2010-04-01 2010-03-0210        1 2010-04-15        5 2010-05-15 2010-04-1511        1 2010-05-01        5 2010-05-15 2010-04-1512        1 2010-05-15        5 2010-05-15 2010-04-15df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df.groupby(['company','end_date']).sum()                    measurecompany end_date           0       2010-02-01       20        2010-03-15       301       2010-04-01       10        2010-05-15       15

Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.


There is a very easy, and practical (or maybe the only direct way) to do conditional join in pandas. Since there is no direct way to do conditional join in pandas, you will need an additional library, and that is, pandasql

Install the library pandasql from pip using the command pip install pandasql. This library allows you to manipulate the pandas dataframes using the SQL queries.

import pandas as pdfrom pandasql import sqldfdf = pd.read_excel(r'play_data.xlsx')df    id    Name  Amount0   A001    A   1001   A002    B   1102   A003    C   1203   A005    D   150

Now let's just do a conditional join to compare the Amount of the IDs

# Make your pysqldf object:pysqldf = lambda q: sqldf(q, globals())# Write your query in SQL syntax, here you can use df as a normal SQL tablecond_join= '''    select         df_left.*,        df_right.*    from df as df_left    join df as df_right    on        df_left.[Amount] > (df_right.[Amount]+10)'''# Now, get your queries results as dataframe using the sqldf object that you createdpysqldf(cond_join)    id  Name    Amount  id    Name  Amount0   A003    C   120    A001   A   1001   A005    D   150    A001   A   1002   A005    D   150    A002   B   1103   A005    D   150    A003   C   120