Python Pandas: calculate rolling mean (moving average) over variable number of rows
Not a particularly pandasy solution, but it sounds like you want to do something like
df['rv'] = np.nanfor i in range(len(df)): j = i s = 0 while j >= 0 and s < 5: s += df['distance'].loc[j] j -= 1 if s >= 5: df['rv'].loc[i] = df['velocity'][j+1:i+1].mean()
Update: Since this answer, the OP stated that they want a "valid Pandas solution (e.g. without loops)". If we take this to mean that they want something more performant than the above, then, perhaps ironically given the comment, the first optimization that comes to mind is to avoid the data frame unless needed:
l = len(df)a = np.empty(l)d = df['distance'].valuesv = df['velocity'].valuesfor i in range(l): j = i s = 0 while j >= 0 and s < 5: s += d[j] j -= 1 if s >= 5: a[i] = v[j+1:i+1].mean()df['rv'] = a
Moreover, as suggested by @JohnE, numba quickly comes in handy for further optimization. While it won't do much on the first solution above, the second solution can be decorated with a @numba.jit
out-of-the-box with immediate benefits. Benchmarking all three solutions on
pd.DataFrame({'velocity': 50*np.random.random(10000), 'distance': 5*np.random.rand(10000)})
I get the following results:
Method Benchmark-----------------------------------------------Original data frame based 4.65 s ± 325 msPure numpy array based 80.8 ms ± 9.95 msJitted numpy array based 766 µs ± 52 µs
Even the innocent-looking mean
is enough to throw off numba; if we get rid of that and go instead with
@numba.jitdef numba_example(): l = len(df) a = np.empty(l) d = df['distance'].values v = df['velocity'].values for i in range(l): j = i s = 0 while j >= 0 and s < 5: s += d[j] j -= 1 if s >= 5: for k in range(j+1, i+1): a[i] += v[k] a[i] /= (i-j) df['rv'] = a
then the benchmark reduces to 158 µs ± 8.41 µs.
Now, if you happen to know more about the structure of df['distance']
, the while
loop can probably be optimized further. (For example, if the values happen to always be much lower than 5, it will be faster to cut the cumulative sum from its tail, rather than recalculating everything.)
How about
df.rolling(window=3, min_periods=2).mean() distance velocity0 NaN NaN1 2.500000 15.0000002 2.000000 11.6666673 2.666667 21.666667
To combine them
df['rv'] = df.velocity.rolling(window=3, min_periods=2).mean()
It looks like something's a little off with the window shape.