Python Pandas: calculate rolling mean (moving average) over variable number of rows

python pandas dataframe time-series moving-average

Not a particularly pandasy solution, but it sounds like you want to do something like

df['rv'] = np.nanfor i in range(len(df)):    j = i    s = 0    while j >= 0 and s < 5:        s += df['distance'].loc[j]        j -= 1    if s >= 5:        df['rv'].loc[i] = df['velocity'][j+1:i+1].mean()

Update: Since this answer, the OP stated that they want a "valid Pandas solution (e.g. without loops)". If we take this to mean that they want something more performant than the above, then, perhaps ironically given the comment, the first optimization that comes to mind is to avoid the data frame unless needed:

l = len(df)a = np.empty(l)d = df['distance'].valuesv = df['velocity'].valuesfor i in range(l):    j = i    s = 0    while j >= 0 and s < 5:        s += d[j]        j -= 1    if s >= 5:        a[i] = v[j+1:i+1].mean()df['rv'] = a

Moreover, as suggested by @JohnE, numba quickly comes in handy for further optimization. While it won't do much on the first solution above, the second solution can be decorated with a @numba.jit out-of-the-box with immediate benefits. Benchmarking all three solutions on

pd.DataFrame({'velocity': 50*np.random.random(10000), 'distance': 5*np.random.rand(10000)})

I get the following results:

          Method                 Benchmark-----------------------------------------------Original data frame based     4.65 s ± 325 msPure numpy array based       80.8 ms ± 9.95 msJitted numpy array based      766 µs ± 52 µs

Even the innocent-looking mean is enough to throw off numba; if we get rid of that and go instead with

@numba.jitdef numba_example():    l = len(df)    a = np.empty(l)    d = df['distance'].values    v = df['velocity'].values    for i in range(l):        j = i        s = 0        while j >= 0 and s < 5:            s += d[j]            j -= 1        if s >= 5:            for k in range(j+1, i+1):                a[i] += v[k]            a[i] /= (i-j)    df['rv'] = a

then the benchmark reduces to 158 µs ± 8.41 µs.

Now, if you happen to know more about the structure of df['distance'], the while loop can probably be optimized further. (For example, if the values happen to always be much lower than 5, it will be faster to cut the cumulative sum from its tail, rather than recalculating everything.)

python pandas dataframe time-series moving-average

How about

df.rolling(window=3, min_periods=2).mean()   distance   velocity0       NaN        NaN1  2.500000  15.0000002  2.000000  11.6666673  2.666667  21.666667

To combine them

df['rv'] = df.velocity.rolling(window=3, min_periods=2).mean()

It looks like something's a little off with the window shape.

CodeHunter

Python Pandas: calculate rolling mean (moving average) over variable number of rows

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last