How to efficiently compute a rolling unique count in a pandas time series?

python pandas time-series distinct-values rolling-computation

I had 2 errors in the fast method windowed_nunique, now corrected in windowed_nunique_corrected below:

The size of the array for memoizing the number of unique counts for each person ID within the window, pid_cts, was too small.
Because the leading and trailing edges of the window include integer days, date_min should be updated when (date - date_min + 1) > window.

Related links:

Source Jupyter Notebook updated with solution: https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38cc

In [14]:

# Define a custom function and implement a just-in-time compiler.@numba.jit(nopython=True)def windowed_nunique_corrected(dates, pids, window):    r"""Track number of unique persons in window,    reading through arrays only once.    Args:        dates (numpy.ndarray): Array of dates as number of days since epoch.        pids (numpy.ndarray): Array of integer person identifiers.            Required: min(pids) >= 0        window (int): Width of window in units of difference of `dates`.            Required: window >= 1    Returns:        ucts (numpy.ndarray): Array of unique counts.    Raises:        AssertionError: Raised if not...            * len(dates) == len(pids)            * min(pids) >= 0            * window >= 1    Notes:        * Matches `pandas.core.window.Rolling`            with a time series alias offset.    """    # Check arguments.    assert len(dates) == len(pids)    assert np.min(pids) >= 0    assert window >= 1    # Initialize counters.    idx_min = 0    idx_max = dates.shape[0]    date_min = dates[idx_min]    pid_min = pids[idx_min]    pid_max = np.max(pids) + 1    pid_cts = np.zeros(pid_max, dtype=np.int64)    pid_cts[pid_min] = 1    uct = 1    ucts = np.zeros(idx_max, dtype=np.int64)    ucts[idx_min] = uct    idx = 1    # For each (date, person)...    while idx < idx_max:        # Lookup date, person.        date = dates[idx]        pid = pids[idx]        # If person count went from 0 to 1, increment unique person count.        pid_cts[pid] += 1        if pid_cts[pid] == 1:            uct += 1        # For past dates outside of window...        # Note: If window=3, it includes day0,day1,day2.        while (date - date_min + 1) > window:            # If person count went from 1 to 0, decrement unique person count.            pid_cts[pid_min] -= 1            if pid_cts[pid_min] == 0:                uct -= 1            idx_min += 1            date_min = dates[idx_min]            pid_min = pids[idx_min]        # Record unique person count.        ucts[idx] = uct        idx += 1    return ucts

In [15]:

# Cast dates to integers.df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')df['DateEpoch'] = df['DateEpoch'].astype(int)

In [16]:

%%timeitwindowed_nunique_corrected(    dates=df['DateEpoch'].values,    pids=df['PersonId'].values,    window=window)

98.8 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]:

# Check accuracy of results.test = windowed_nunique_corrected(    dates=df['DateEpoch'].values,    pids=df['PersonId'].values,    window=window)assert all(ref == test)

python pandas time-series distinct-values rolling-computation

Very close to your time in seed test two, but as a one liner, re sampled over a year.

 df.resample('AS',on='Date')['PersonId'].expanding(0).apply(lambda x: np.unique(x).shape[0])

Time results

1 loop, best of 3: 483 ms per loop

python pandas time-series distinct-values rolling-computation

If you only want the number of unique person that went in the building in the last 365 days you could first limit your dataset at the last 365 days with .loc :

df = df.loc[df['date'] > '2016-09-28',:]

and the with a groupby you get as many rows as unique people that came in and if you do it by count you also get the nubmer of times they came in :

df = df.groupby('PersonID').count()

that seem to work for your question, but maybe i got it wrong.have a good day

CodeHunter

How to efficiently compute a rolling unique count in a pandas time series?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last