Pandas: conditional shift
I am using a new para here record the condition df2['close_time']<df2['open_time']
df['New']=((df.open_time-df.close_time.shift()).dt.days>0).shift(-1)s=df.groupby('user').apply(lambda x : (x['value']*x['New']).cumsum().shift()).reset_index(level=0,drop=True)s.loc[~(df.New.shift()==True)]=np.nandf['Cumsum']=sdfOut[1043]: user open_time close_time value New Cumsum0 1 2017-01-01 2017-03-01 5 False NaN1 1 2017-01-02 2017-02-01 6 True NaN2 1 2017-02-03 2017-02-05 7 True 63 1 2017-02-07 2017-04-01 3 False 134 2 2017-01-01 2017-02-01 15 True NaN5 2 2017-03-01 2017-04-01 3 NaN 15
Update : since op update the question (Data from Gabriel A)
df['New']=df.user.map(df.groupby('user').close_time.apply(lambda x: np.array(x)))df['New1']=df.user.map(df.groupby('user').value.apply(lambda x: np.array(x)))df['New2']=[[x>m for m in y] for x,y in zip(df['open_time'],df['New']) ]df['Yourtarget']=list(map(sum,df['New2']*df['New1'].values))df.drop(['New','New1','New2'],1)Out[1376]: user open_time close_time value Yourtarget0 1 2016-12-30 2016-12-31 1 01 1 2017-01-01 2017-03-01 5 12 1 2017-01-02 2017-02-01 6 13 1 2017-02-03 2017-02-05 7 74 1 2017-02-07 2017-04-01 3 145 1 2017-09-07 2017-09-11 1 226 2 2018-01-01 2018-02-01 15 07 2 2018-03-01 2018-04-01 3 15
I made a modification to you test case that I think you should include. This solution does handle your edit.
import pandas as pdimport numpy as npdf = pd.read_csv("cond_shift.csv")df
input:
user open_time close_time value0 1 12/30/2016 12/31/2016 11 1 1/1/2017 3/1/2017 52 1 1/2/2017 2/1/2017 63 1 2/3/2017 2/5/2017 74 1 2/7/2017 4/1/2017 35 1 9/7/2017 9/11/2017 16 2 1/1/2018 2/1/2018 157 2 3/1/2018 4/1/2018 3
create columns to shift:
df["open_time"] = pd.to_datetime(df["open_time"])df["close_time"] = pd.to_datetime(df["close_time"])df.sort_values(['user','close_time'],inplace=True)df['close_cumsum']=df.groupby('user')['value'].cumsum()df.sort_values(['user','open_time'],inplace=True)df user open_time close_time value close_cumsum0 1 2016-12-30 2016-12-31 1 11 1 2017-01-01 2017-03-01 5 192 1 2017-01-02 2017-02-01 6 73 1 2017-02-03 2017-02-05 7 144 1 2017-02-07 2017-04-01 3 225 1 2017-09-07 2017-09-11 1 236 2 2018-01-01 2018-02-01 15 157 2 2018-03-01 2018-04-01 3 18
Shift columns (explanation below):
df["cumulated_closed_value"] = df.groupby("user")["close_cumsum"].transform("shift")condition = ~(df.groupby("user")['close_time'].transform("shift") < df["open_time"])df.loc[ condition,"cumulated_closed_value" ] = Nonedf["cumulated_closed_value"] =df.groupby("user")["cumulated_closed_value"].fillna(method="ffill").fillna(0)dfuser open_time close_time value close_cumsum cumulated_closed_value0 1 2016-12-30 2016-12-31 1 1 0.01 1 2017-01-01 2017-03-01 5 19 1.02 1 2017-01-02 2017-02-01 6 7 1.03 1 2017-02-03 2017-02-05 7 14 7.04 1 2017-02-07 2017-04-01 3 22 14.05 1 2017-09-07 2017-09-11 1 23 22.06 2 2018-01-01 2018-02-01 15 15 0.07 2 2018-03-01 2018-04-01 3 18 15.0
All of this has been written is such a way that it's done across all users. I believe the logic is easier if you only focus on one user at a time.
- Assume no events happens at the same time. This is the same thing asshifting the cumulative sum down one row.
- Remove events that happens at the same time as other events.
- Fill in the missing values. With a forwards fill.
I would still thoroughly test this before you use it. Time intervals are weird and there are a lot of edge cases.
(Note: @wen's answer seems fine to me, so I'm not sure if the OP is looking for something more or something different. In any event, here's an alternate approach using merge_asof
that should also work well.)
First reshape the dataframes as follows:
lookup = ( df[['close_time','value','user']].set_index(['user','close_time']) .sort_index().groupby('user').cumsum().reset_index(0) )df = df.set_index('open_time').sort_index()
The idea with "lookup" is simply to sort by "close_time" and then take a (grouped) cumulative sum:
user valueclose_time 2017-02-01 1 62017-02-05 1 132017-03-01 1 182017-04-01 1 212017-09-11 1 222018-02-01 2 152018-04-01 2 18
For "df" we just take a subset of the original dataframe:
user close_time valueopen_time 2017-01-01 1 2017-03-01 52017-01-02 1 2017-02-01 62017-02-03 1 2017-02-05 72017-02-07 1 2017-04-01 32017-09-07 1 2017-09-11 12018-01-01 2 2018-02-01 152018-03-01 2 2018-04-01 3
From here, you just want to conceptually merge the two datasets on "user" and the "open_time"/"close_time" but the complicating factor is that we don't want to do an exact match on the time, but rather a sort of "nearest" match.
For these sorts merges you can use merge_asof
which is a great tool for various non-exact matches (including 'nearest', 'backward', and 'forward'). Unfortunately due to the inclusion of groupby, it's necessary to also loop over the users, but it's still pretty simple code to read:
df_merged = pd.DataFrame()for u in df['user'].unique(): df_merged = df_merged.append( pd.merge_asof( df[df.user==u], lookup[lookup.user==u], left_index=True, right_index=True, direction='backward' ) )df_merged.drop('user_y',axis=1).rename({'value_y':'close_cumsum'},axis=1)
Results:
user_x close_time value_x close_cumsumopen_time 2017-01-01 1 2017-03-01 5 NaN2017-01-02 1 2017-02-01 6 NaN2017-02-03 1 2017-02-05 7 6.02017-02-07 1 2017-04-01 3 13.02017-09-07 1 2017-09-11 1 21.02018-01-01 2 2018-02-01 15 NaN2018-03-01 2 2018-04-01 3 15.0