How to efficiently compute a rolling unique count in a pandas time series? How to efficiently compute a rolling unique count in a pandas time series? pandas pandas

How to efficiently compute a rolling unique count in a pandas time series?


I had 2 errors in the fast method windowed_nunique, now corrected in windowed_nunique_corrected below:

  1. The size of the array for memoizing the number of unique counts for each person ID within the window, pid_cts, was too small.
  2. Because the leading and trailing edges of the window include integer days, date_min should be updated when (date - date_min + 1) > window.

Related links:

In [14]:

# Define a custom function and implement a just-in-time compiler.@numba.jit(nopython=True)def windowed_nunique_corrected(dates, pids, window):    r"""Track number of unique persons in window,    reading through arrays only once.    Args:        dates (numpy.ndarray): Array of dates as number of days since epoch.        pids (numpy.ndarray): Array of integer person identifiers.            Required: min(pids) >= 0        window (int): Width of window in units of difference of `dates`.            Required: window >= 1    Returns:        ucts (numpy.ndarray): Array of unique counts.    Raises:        AssertionError: Raised if not...            * len(dates) == len(pids)            * min(pids) >= 0            * window >= 1    Notes:        * Matches `pandas.core.window.Rolling`            with a time series alias offset.    """    # Check arguments.    assert len(dates) == len(pids)    assert np.min(pids) >= 0    assert window >= 1    # Initialize counters.    idx_min = 0    idx_max = dates.shape[0]    date_min = dates[idx_min]    pid_min = pids[idx_min]    pid_max = np.max(pids) + 1    pid_cts = np.zeros(pid_max, dtype=np.int64)    pid_cts[pid_min] = 1    uct = 1    ucts = np.zeros(idx_max, dtype=np.int64)    ucts[idx_min] = uct    idx = 1    # For each (date, person)...    while idx < idx_max:        # Lookup date, person.        date = dates[idx]        pid = pids[idx]        # If person count went from 0 to 1, increment unique person count.        pid_cts[pid] += 1        if pid_cts[pid] == 1:            uct += 1        # For past dates outside of window...        # Note: If window=3, it includes day0,day1,day2.        while (date - date_min + 1) > window:            # If person count went from 1 to 0, decrement unique person count.            pid_cts[pid_min] -= 1            if pid_cts[pid_min] == 0:                uct -= 1            idx_min += 1            date_min = dates[idx_min]            pid_min = pids[idx_min]        # Record unique person count.        ucts[idx] = uct        idx += 1    return ucts

In [15]:

# Cast dates to integers.df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')df['DateEpoch'] = df['DateEpoch'].astype(int)

In [16]:

%%timeitwindowed_nunique_corrected(    dates=df['DateEpoch'].values,    pids=df['PersonId'].values,    window=window)

98.8 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]:

# Check accuracy of results.test = windowed_nunique_corrected(    dates=df['DateEpoch'].values,    pids=df['PersonId'].values,    window=window)assert all(ref == test)


Very close to your time in seed test two, but as a one liner, re sampled over a year.

 df.resample('AS',on='Date')['PersonId'].expanding(0).apply(lambda x: np.unique(x).shape[0])

Time results

1 loop, best of 3: 483 ms per loop


If you only want the number of unique person that went in the building in the last 365 days you could first limit your dataset at the last 365 days with .loc :

df = df.loc[df['date'] > '2016-09-28',:]

and the with a groupby you get as many rows as unique people that came in and if you do it by count you also get the nubmer of times they came in :

df = df.groupby('PersonID').count()

that seem to work for your question, but maybe i got it wrong.have a good day