Pandas Rolling window Spearman correlation
rolling.corr
does Pearson, so you can use it for that. For Spearman, use something like this:
import pandas as pdfrom numpy.lib.stride_tricks import as_stridedfrom numpy.lib import padimport numpy as npdef rolling_spearman(seqa, seqb, window): stridea = seqa.strides[0] ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea]) strideb = seqa.strides[0] ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb]) ar = pd.DataFrame(ssa) br = pd.DataFrame(ssb) ar = ar.rank(1) br = br.rank(1) corrs = ar.corrwith(br, 1) return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)
E.g.:
In [144]: df = pd.DataFrame(np.random.randint(0,1000,size=(10,2)), columns = list('ab'))In [145]: df['corr'] = rolling_spearman(df.a, df.b, 4)In [146]: dfOut[146]: a b corr0 429 922 NaN1 618 382 NaN2 476 517 NaN3 267 536 -0.84 582 844 -0.45 254 895 -0.46 583 974 0.47 687 298 -0.48 697 447 -0.69 383 35 0.4
Explanation: numpy.lib.stride_tricks.as_strided
is a hacky method that in this case gives us a view of the sequences that looks like a 2d array with the rolling window sections of the sequence we're looking at.
From then on, it's simple. Spearman correlation is equivalent to transforming the sequences to ranks, and taking the Pearson correlation coefficient. Pandas has, helpfully, got fast implementations to do this row-wise on DataFrame
s. Then at the end we pad the start of the resulting Series
with NaN values (so you can add it as a column to your dataframe or whatever).
(Personal note: I spent so long trying to figure out how to do this efficiently with numpy and scipy before I realised everything you need is in pandas already...!).
To show the speed advantage of this method over just looping over the sliding windows, I made a little file called srsmall.py
containing:
import pandas as pdfrom numpy.lib.stride_tricks import as_stridedimport scipy.statsfrom numpy.lib import padimport numpy as npdef rolling_spearman_slow(seqa, seqb, window): stridea = seqa.strides[0] ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea]) strideb = seqa.strides[0] ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb]) corrs = [scipy.stats.spearmanr(a, b)[0] for (a,b) in zip(ssa, ssb)] return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)def rolling_spearman_quick(seqa, seqb, window): stridea = seqa.strides[0] ssa = as_strided(seqa, shape=[len(seqa) - window + 1, window], strides=[stridea, stridea]) strideb = seqa.strides[0] ssb = as_strided(seqb, shape=[len(seqb) - window + 1, window], strides =[strideb, strideb]) ar = pd.DataFrame(ssa) br = pd.DataFrame(ssb) ar = ar.rank(1) br = br.rank(1) corrs = ar.corrwith(br, 1) return pad(corrs, (window - 1, 0), 'constant', constant_values=np.nan)
And compare the performance:
In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: from srsmall import rolling_spearman_slow as slowIn [4]: from srsmall import rolling_spearman_quick as quickIn [5]: for i in range(6): ...: df = pd.DataFrame(np.random.randint(0,1000,size=(10*(10**i),2)), columns=list('ab')) ...: print len(df), " rows" ...: print "quick: ", ...: %timeit quick(df.a, df.b, 4) ...: print "slow: ", ...: %timeit slow(df.a, df.b, 4) ...: 10 rowsquick: 100 loops, best of 3: 3.52 ms per loopslow: 100 loops, best of 3: 3.2 ms per loop100 rowsquick: 100 loops, best of 3: 3.53 ms per loopslow: 10 loops, best of 3: 42 ms per loop1000 rowsquick: 100 loops, best of 3: 3.82 ms per loopslow: 1 loop, best of 3: 430 ms per loop10000 rowsquick: 100 loops, best of 3: 7.47 ms per loopslow: 1 loop, best of 3: 4.33 s per loop100000 rowsquick: 10 loops, best of 3: 50.2 ms per loopslow: 1 loop, best of 3: 43.4 s per loop1000000 rowsquick: 1 loop, best of 3: 404 ms per loopslow:
On a million rows (on my machine), the quick (pandas) version runs in less than half a second. Not shown above but 10 million took 8.43 seconds. The slow one is still running, but assuming the linear growth continues it should take around 7 minutes for 1M and over an hour for 10M.