dataframe representation of a rolling window
We could use NumPy to get views into those sliding windows with its esoteric strided tricks
. If you are using this new dimension for some reduction like matrix-multiplication, this would be ideal. If for some reason, you want to have a 2D
output, we need to use a reshape at the end, which will result in creating a copy though.
Thus, the implementation would look something like this -
from numpy.lib.stride_tricks import as_strided as strideddef get_sliding_window(df, W, return2D=0): a = df.values s0,s1 = a.strides m,n = a.shape out = strided(a,shape=(m-W+1,W,n),strides=(s0,s0,s1)) if return2D==1: return out.reshape(a.shape[0]-W+1,-1) else: return out
Sample run for 2D/3D output -
In [68]: dfOut[68]: A B0 0.44 0.411 0.46 0.472 0.46 0.023 0.85 0.824 0.78 0.76In [70]: get_sliding_window(df, 3,return2D=1)Out[70]: array([[ 0.44, 0.41, 0.46, 0.47, 0.46, 0.02], [ 0.46, 0.47, 0.46, 0.02, 0.85, 0.82], [ 0.46, 0.02, 0.85, 0.82, 0.78, 0.76]])
Here's how the 3D views output would look like -
In [69]: get_sliding_window(df, 3,return2D=0)Out[69]: array([[[ 0.44, 0.41], [ 0.46, 0.47], [ 0.46, 0.02]], [[ 0.46, 0.47], [ 0.46, 0.02], [ 0.85, 0.82]], [[ 0.46, 0.02], [ 0.85, 0.82], [ 0.78, 0.76]]])
Let's time it for views 3D
output for various window sizes -
In [331]: df = pd.DataFrame(np.random.rand(1000, 3).round(2))In [332]: %timeit get_3d_shfted_array(df,2) # @Yakym Pirozhenko's soln10000 loops, best of 3: 47.9 µs per loopIn [333]: %timeit get_sliding_window(df,2)10000 loops, best of 3: 39.2 µs per loopIn [334]: %timeit get_3d_shfted_array(df,5) # @Yakym Pirozhenko's soln10000 loops, best of 3: 89.9 µs per loopIn [335]: %timeit get_sliding_window(df,5)10000 loops, best of 3: 39.4 µs per loopIn [336]: %timeit get_3d_shfted_array(df,15) # @Yakym Pirozhenko's soln1000 loops, best of 3: 258 µs per loopIn [337]: %timeit get_sliding_window(df,15)10000 loops, best of 3: 38.8 µs per loop
Let's verify that we are indeed getting views -
In [338]: np.may_share_memory(get_sliding_window(df,2), df.values)Out[338]: True
The almost constant timings with get_sliding_window
even across various window sizes suggest the huge benefit of getting the view instead of copying.
Disclaimers:
First, I would not call the method you provide clunky. It is readable and you can easily generalize with a list comprehension to any window size. At the same time, this is somewhat of an open ended question that may have many solutions, including your own.
/Disclaimers
Here is one other method that I think qualifies under your description:
Use np.dstack
on df.values
. One benefit over existing approach is construction speed.
import pandas as pdimport numpy as npfrom io import StringIOdf = pd.read_csv(StringIO(''' A B Ca 0.44 0.41 0.46b 0.47 0.46 0.02c 0.85 0.82 0.78d 0.76 0.93 0.83e 0.88 0.93 0.72f 0.12 0.15 0.20g 0.44 0.10 0.28h 0.61 0.09 0.84i 0.74 0.87 0.69j 0.38 0.23 0.44'''), sep=r' +')window = 2def get_3d_shfted_array(df, window=window): rows = df.values res = np.dstack((rows[i:i-window] for i in range(window))) return res# 100000 loops, best of 3: 15.5 µs per loopres = get_3d_shfted_array(df)zero = res[...,0]one = res[...,1]# current methoddef get_multiindexed_array(df, window=window): return pd.concat([df, df.shift(-1)], axis=1, keys=[0, 1]).dropna()# 1000 loops, best of 3: 928 µs per loop