Pandas Rolling window Spearman correlation Pandas Rolling window Spearman correlation pandas pandas

Pandas Rolling window Spearman correlation


rolling.corr does Pearson, so you can use it for that. For Spearman, use something like this:

import pandas as pdfrom numpy.lib.stride_tricks import as_stridedfrom numpy.lib import padimport numpy as npdef rolling_spearman(seqa, seqb, window):    stridea = seqa.strides[0]    ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])    strideb = seqa.strides[0]    ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])    ar = pd.DataFrame(ssa)    br = pd.DataFrame(ssb)    ar = ar.rank(1)    br = br.rank(1)    corrs = ar.corrwith(br, 1)    return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)

E.g.:

In [144]: df = pd.DataFrame(np.random.randint(0,1000,size=(10,2)), columns = list('ab'))In [145]: df['corr'] = rolling_spearman(df.a, df.b, 4)In [146]: dfOut[146]:      a    b  corr0  429  922   NaN1  618  382   NaN2  476  517   NaN3  267  536  -0.84  582  844  -0.45  254  895  -0.46  583  974   0.47  687  298  -0.48  697  447  -0.69  383   35   0.4

Explanation: numpy.lib.stride_tricks.as_strided is a hacky method that in this case gives us a view of the sequences that looks like a 2d array with the rolling window sections of the sequence we're looking at.

From then on, it's simple. Spearman correlation is equivalent to transforming the sequences to ranks, and taking the Pearson correlation coefficient. Pandas has, helpfully, got fast implementations to do this row-wise on DataFrames. Then at the end we pad the start of the resulting Series with NaN values (so you can add it as a column to your dataframe or whatever).

(Personal note: I spent so long trying to figure out how to do this efficiently with numpy and scipy before I realised everything you need is in pandas already...!).

To show the speed advantage of this method over just looping over the sliding windows, I made a little file called srsmall.py containing:

import pandas as pdfrom numpy.lib.stride_tricks import as_stridedimport scipy.statsfrom numpy.lib import padimport numpy as npdef rolling_spearman_slow(seqa, seqb, window):    stridea = seqa.strides[0]    ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])    strideb = seqa.strides[0]    ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])    corrs = [scipy.stats.spearmanr(a, b)[0] for (a,b) in zip(ssa, ssb)]    return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)def rolling_spearman_quick(seqa, seqb, window):    stridea = seqa.strides[0]    ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea])    strideb = seqa.strides[0]    ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb])    ar = pd.DataFrame(ssa)    br = pd.DataFrame(ssb)    ar = ar.rank(1)    br = br.rank(1)    corrs = ar.corrwith(br, 1)    return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)

And compare the performance:

In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: from srsmall import rolling_spearman_slow as slowIn [4]: from srsmall import rolling_spearman_quick as quickIn [5]: for i in range(6):   ...:     df = pd.DataFrame(np.random.randint(0,1000,size=(10*(10**i),2)), columns=list('ab'))   ...:     print len(df), " rows"   ...:     print "quick: ",   ...:     %timeit quick(df.a, df.b, 4)   ...:     print "slow: ",   ...:     %timeit slow(df.a, df.b, 4)   ...:     10  rowsquick: 100 loops, best of 3: 3.52 ms per loopslow: 100 loops, best of 3: 3.2 ms per loop100  rowsquick: 100 loops, best of 3: 3.53 ms per loopslow: 10 loops, best of 3: 42 ms per loop1000  rowsquick: 100 loops, best of 3: 3.82 ms per loopslow: 1 loop, best of 3: 430 ms per loop10000  rowsquick: 100 loops, best of 3: 7.47 ms per loopslow: 1 loop, best of 3: 4.33 s per loop100000  rowsquick: 10 loops, best of 3: 50.2 ms per loopslow: 1 loop, best of 3: 43.4 s per loop1000000  rowsquick: 1 loop, best of 3: 404 ms per loopslow:

On a million rows (on my machine), the quick (pandas) version runs in less than half a second. Not shown above but 10 million took 8.43 seconds. The slow one is still running, but assuming the linear growth continues it should take around 7 minutes for 1M and over an hour for 10M.