Pandas sum time interval in a group excluding overlaps
First, add a column which tracks the latest end time seen so far (but only considering the same group):
df['notbefore'] = df.groupby('group').end.shift().cummax()
It's shifted by 1 so that it reflects the latest end time seen on previous rows, excluding the same row. It's important to have shift()
before cummax()
, otherwise the shift will "leak" values between groups.
Then add a column containing the "effective" start time:
df['effstart'] = df[['start', 'notbefore']].max(1)
This is the start time modified so that it is not before any previous end time (to avoid overlap).
Then compute the total seconds covered:
df['effsec'] = (df.end - df.effstart).clip(np.timedelta64(0))
df
is now:
id group start end notbefore effstart effsec0 0 0 2019-10-21 16:20:00 2019-10-21 16:25:00 NaT 2019-10-21 16:20:00 00:05:001 1 1 2019-10-21 16:22:00 2019-10-21 16:24:00 NaT 2019-10-21 16:22:00 00:02:002 2 1 2019-10-21 16:22:00 2019-10-21 16:24:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:00:003 3 2 2019-10-21 16:15:00 2019-10-21 16:18:00 NaT 2019-10-21 16:15:00 00:03:004 4 2 2019-10-21 16:22:00 2019-10-21 16:26:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:02:005 5 3 2019-10-21 16:58:00 2019-10-21 17:02:00 NaT 2019-10-21 16:58:00 00:04:006 6 4 2019-10-21 17:02:00 2019-10-21 17:06:00 NaT 2019-10-21 17:02:00 00:04:007 7 4 2019-10-21 17:03:00 2019-10-21 17:07:00 2019-10-21 17:06:00 2019-10-21 17:06:00 00:01:008 8 4 2019-10-21 17:04:00 2019-10-21 17:08:00 2019-10-21 17:07:00 2019-10-21 17:07:00 00:01:009 9 4 2019-10-21 17:20:00 2019-10-21 17:22:00 2019-10-21 17:08:00 2019-10-21 17:20:00 00:02:00
To get the final results:
df.groupby('group').effsec.sum()
Which gives you:
group0 00:05:001 00:02:002 00:05:003 00:04:004 00:08:00
Use-
def merge_intervals(intervals): sorted_by_lower_bound = sorted(intervals, key=lambda tup: tup[0]) merged = [] for higher in sorted_by_lower_bound: if not merged: merged.append(higher) else: lower = merged[-1] # test for intersection between lower and higher: # we know via sorting that lower[0] <= higher[0] if higher[0] <= lower[1]: upper_bound = max(lower[1], higher[1]) merged[-1] = (lower[0], upper_bound) # replace by merged interval else: merged.append(higher) return mergeddf['dt'] = df[['start', 'end']].apply(tuple, axis=1)op = df.groupby(['group'])['dt'].apply(list)f_op = op.apply(merge_intervals)op_d = f_op.apply(lambda x: sum([(y[1]-y[0]).seconds for y in x]))
Output
group0 3001 1202 4203 2404 480
As the source data I took the following DataFrame:
group start end0 G1 2019-09-01 12:00 2019-09-01 12:021 G1 2019-09-01 12:01 2019-09-01 12:042 G1 2019-09-01 12:07 2019-09-01 12:103 G2 2019-09-01 12:05 2019-09-01 12:124 G2 2019-09-01 12:10 2019-09-01 12:15
The first step is to define a function counting seconds within a group of rows:
def getSecs(grp): return pd.DatetimeIndex([]).union_many([ pd.date_range( row.start, row.end, freq='s', closed='left') for _, row in grp.iterrows() ]).size
Then apply this function to each group, grouping by group:
secs = df.groupby('group').apply(getSecs).rename('secs')
For my test data, the result is:
groupG1 420G2 600Name: secs, dtype: int64
And the last step is to create a new column in df by merging with secs:
df = df.merge(secs, left_on='Grp', right_index=True)
The result is:
group start end secs0 G1 2019-09-01 12:00 2019-09-01 12:02 4201 G1 2019-09-01 12:01 2019-09-01 12:04 4202 G1 2019-09-01 12:07 2019-09-01 12:10 4203 G2 2019-09-01 12:05 2019-09-01 12:12 6004 G2 2019-09-01 12:10 2019-09-01 12:15 600
A quite concise solution, just 6 lines of code, substantially less thansome other solutions.
Note also that only my solution creates a new column, with equalvalues for all rows within each group (one of other solution failed onthis detail).All other solutions stopped at computing numbers of seconds for each group.