Pandas: conditional shift Pandas: conditional shift pandas pandas

Pandas: conditional shift


I am using a new para here record the condition df2['close_time']<df2['open_time']

df['New']=((df.open_time-df.close_time.shift()).dt.days>0).shift(-1)s=df.groupby('user').apply(lambda x : (x['value']*x['New']).cumsum().shift()).reset_index(level=0,drop=True)s.loc[~(df.New.shift()==True)]=np.nandf['Cumsum']=sdfOut[1043]:    user  open_time close_time  value    New Cumsum0     1 2017-01-01 2017-03-01      5  False    NaN1     1 2017-01-02 2017-02-01      6   True    NaN2     1 2017-02-03 2017-02-05      7   True      63     1 2017-02-07 2017-04-01      3  False     134     2 2017-01-01 2017-02-01     15   True    NaN5     2 2017-03-01 2017-04-01      3    NaN     15

Update : since op update the question (Data from Gabriel A)

df['New']=df.user.map(df.groupby('user').close_time.apply(lambda x: np.array(x)))df['New1']=df.user.map(df.groupby('user').value.apply(lambda x: np.array(x)))df['New2']=[[x>m for m in y] for x,y in zip(df['open_time'],df['New'])  ]df['Yourtarget']=list(map(sum,df['New2']*df['New1'].values))df.drop(['New','New1','New2'],1)Out[1376]:    user  open_time close_time  value  Yourtarget0     1 2016-12-30 2016-12-31      1           01     1 2017-01-01 2017-03-01      5           12     1 2017-01-02 2017-02-01      6           13     1 2017-02-03 2017-02-05      7           74     1 2017-02-07 2017-04-01      3          145     1 2017-09-07 2017-09-11      1          226     2 2018-01-01 2018-02-01     15           07     2 2018-03-01 2018-04-01      3          15


I made a modification to you test case that I think you should include. This solution does handle your edit.

import pandas as pdimport numpy as npdf = pd.read_csv("cond_shift.csv")df

input:

   user open_time   close_time  value0   1   12/30/2016  12/31/2016  11   1   1/1/2017    3/1/2017    52   1   1/2/2017    2/1/2017    63   1   2/3/2017    2/5/2017    74   1   2/7/2017    4/1/2017    35   1   9/7/2017    9/11/2017   16   2   1/1/2018    2/1/2018    157   2   3/1/2018    4/1/2018    3

create columns to shift:

df["open_time"] = pd.to_datetime(df["open_time"])df["close_time"] = pd.to_datetime(df["close_time"])df.sort_values(['user','close_time'],inplace=True)df['close_cumsum']=df.groupby('user')['value'].cumsum()df.sort_values(['user','open_time'],inplace=True)df   user open_time   close_time  value   close_cumsum0   1   2016-12-30  2016-12-31  1       11   1   2017-01-01  2017-03-01  5       192   1   2017-01-02  2017-02-01  6       73   1   2017-02-03  2017-02-05  7       144   1   2017-02-07  2017-04-01  3       225   1   2017-09-07  2017-09-11  1       236   2   2018-01-01  2018-02-01  15      157   2   2018-03-01  2018-04-01  3       18

Shift columns (explanation below):

df["cumulated_closed_value"] = df.groupby("user")["close_cumsum"].transform("shift")condition = ~(df.groupby("user")['close_time'].transform("shift") < df["open_time"])df.loc[ condition,"cumulated_closed_value" ] = Nonedf["cumulated_closed_value"] =df.groupby("user")["cumulated_closed_value"].fillna(method="ffill").fillna(0)dfuser    open_time   close_time  value   close_cumsum    cumulated_closed_value0   1   2016-12-30  2016-12-31  1       1               0.01   1   2017-01-01  2017-03-01  5       19              1.02   1   2017-01-02  2017-02-01  6       7               1.03   1   2017-02-03  2017-02-05  7       14              7.04   1   2017-02-07  2017-04-01  3       22              14.05   1   2017-09-07  2017-09-11  1       23              22.06   2   2018-01-01  2018-02-01  15      15              0.07   2   2018-03-01  2018-04-01  3       18              15.0

All of this has been written is such a way that it's done across all users. I believe the logic is easier if you only focus on one user at a time.

  • Assume no events happens at the same time. This is the same thing asshifting the cumulative sum down one row.
  • Remove events that happens at the same time as other events.
  • Fill in the missing values. With a forwards fill.

I would still thoroughly test this before you use it. Time intervals are weird and there are a lot of edge cases.


(Note: @wen's answer seems fine to me, so I'm not sure if the OP is looking for something more or something different. In any event, here's an alternate approach using merge_asof that should also work well.)

First reshape the dataframes as follows:

lookup = ( df[['close_time','value','user']].set_index(['user','close_time'])           .sort_index().groupby('user').cumsum().reset_index(0) )df = df.set_index('open_time').sort_index()

The idea with "lookup" is simply to sort by "close_time" and then take a (grouped) cumulative sum:

            user  valueclose_time             2017-02-01     1      62017-02-05     1     132017-03-01     1     182017-04-01     1     212017-09-11     1     222018-02-01     2     152018-04-01     2     18

For "df" we just take a subset of the original dataframe:

            user close_time  valueopen_time                         2017-01-01     1 2017-03-01      52017-01-02     1 2017-02-01      62017-02-03     1 2017-02-05      72017-02-07     1 2017-04-01      32017-09-07     1 2017-09-11      12018-01-01     2 2018-02-01     152018-03-01     2 2018-04-01      3

From here, you just want to conceptually merge the two datasets on "user" and the "open_time"/"close_time" but the complicating factor is that we don't want to do an exact match on the time, but rather a sort of "nearest" match.

For these sorts merges you can use merge_asof which is a great tool for various non-exact matches (including 'nearest', 'backward', and 'forward'). Unfortunately due to the inclusion of groupby, it's necessary to also loop over the users, but it's still pretty simple code to read:

df_merged = pd.DataFrame()for u in df['user'].unique():    df_merged = df_merged.append( pd.merge_asof( df[df.user==u],  lookup[lookup.user==u],                                                  left_index=True, right_index=True,                                                  direction='backward' ) )df_merged.drop('user_y',axis=1).rename({'value_y':'close_cumsum'},axis=1)

Results:

            user_x close_time  value_x  close_cumsumopen_time                                           2017-01-01       1 2017-03-01        5           NaN2017-01-02       1 2017-02-01        6           NaN2017-02-03       1 2017-02-05        7           6.02017-02-07       1 2017-04-01        3          13.02017-09-07       1 2017-09-11        1          21.02018-01-01       2 2018-02-01       15           NaN2018-03-01       2 2018-04-01        3          15.0