Pandas sum time interval in a group excluding overlaps Pandas sum time interval in a group excluding overlaps pandas pandas

Pandas sum time interval in a group excluding overlaps


First, add a column which tracks the latest end time seen so far (but only considering the same group):

df['notbefore'] = df.groupby('group').end.shift().cummax()

It's shifted by 1 so that it reflects the latest end time seen on previous rows, excluding the same row. It's important to have shift() before cummax(), otherwise the shift will "leak" values between groups.

Then add a column containing the "effective" start time:

df['effstart'] = df[['start', 'notbefore']].max(1)

This is the start time modified so that it is not before any previous end time (to avoid overlap).

Then compute the total seconds covered:

df['effsec'] = (df.end - df.effstart).clip(np.timedelta64(0))

df is now:

   id  group               start                 end           notbefore            effstart   effsec0   0      0 2019-10-21 16:20:00 2019-10-21 16:25:00                 NaT 2019-10-21 16:20:00 00:05:001   1      1 2019-10-21 16:22:00 2019-10-21 16:24:00                 NaT 2019-10-21 16:22:00 00:02:002   2      1 2019-10-21 16:22:00 2019-10-21 16:24:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:00:003   3      2 2019-10-21 16:15:00 2019-10-21 16:18:00                 NaT 2019-10-21 16:15:00 00:03:004   4      2 2019-10-21 16:22:00 2019-10-21 16:26:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:02:005   5      3 2019-10-21 16:58:00 2019-10-21 17:02:00                 NaT 2019-10-21 16:58:00 00:04:006   6      4 2019-10-21 17:02:00 2019-10-21 17:06:00                 NaT 2019-10-21 17:02:00 00:04:007   7      4 2019-10-21 17:03:00 2019-10-21 17:07:00 2019-10-21 17:06:00 2019-10-21 17:06:00 00:01:008   8      4 2019-10-21 17:04:00 2019-10-21 17:08:00 2019-10-21 17:07:00 2019-10-21 17:07:00 00:01:009   9      4 2019-10-21 17:20:00 2019-10-21 17:22:00 2019-10-21 17:08:00 2019-10-21 17:20:00 00:02:00

To get the final results:

df.groupby('group').effsec.sum()

Which gives you:

group0   00:05:001   00:02:002   00:05:003   00:04:004   00:08:00


Use-

def merge_intervals(intervals):    sorted_by_lower_bound = sorted(intervals, key=lambda tup: tup[0])    merged = []    for higher in sorted_by_lower_bound:        if not merged:            merged.append(higher)        else:            lower = merged[-1]            # test for intersection between lower and higher:            # we know via sorting that lower[0] <= higher[0]            if higher[0] <= lower[1]:                upper_bound = max(lower[1], higher[1])                merged[-1] = (lower[0], upper_bound)  # replace by merged interval            else:                merged.append(higher)    return mergeddf['dt'] = df[['start', 'end']].apply(tuple, axis=1)op = df.groupby(['group'])['dt'].apply(list)f_op = op.apply(merge_intervals)op_d = f_op.apply(lambda x: sum([(y[1]-y[0]).seconds for y in x]))

Output

group0    3001    1202    4203    2404    480


As the source data I took the following DataFrame:

  group             start               end0    G1  2019-09-01 12:00  2019-09-01 12:021    G1  2019-09-01 12:01  2019-09-01 12:042    G1  2019-09-01 12:07  2019-09-01 12:103    G2  2019-09-01 12:05  2019-09-01 12:124    G2  2019-09-01 12:10  2019-09-01 12:15

The first step is to define a function counting seconds within a group of rows:

def getSecs(grp):    return pd.DatetimeIndex([]).union_many([ pd.date_range(        row.start, row.end, freq='s', closed='left')            for _, row in grp.iterrows() ]).size

Then apply this function to each group, grouping by group:

secs = df.groupby('group').apply(getSecs).rename('secs')

For my test data, the result is:

groupG1    420G2    600Name: secs, dtype: int64

And the last step is to create a new column in df by merging with secs:

df = df.merge(secs, left_on='Grp', right_index=True)

The result is:

  group             start               end  secs0    G1  2019-09-01 12:00  2019-09-01 12:02   4201    G1  2019-09-01 12:01  2019-09-01 12:04   4202    G1  2019-09-01 12:07  2019-09-01 12:10   4203    G2  2019-09-01 12:05  2019-09-01 12:12   6004    G2  2019-09-01 12:10  2019-09-01 12:15   600

A quite concise solution, just 6 lines of code, substantially less thansome other solutions.

Note also that only my solution creates a new column, with equalvalues for all rows within each group (one of other solution failed onthis detail).All other solutions stopped at computing numbers of seconds for each group.