How to tackle inconsistent results while using pandas rolling correlation?
What if you compute you replace the sums in your pearson formula with rolling sums
def rolling_pearson(a, b, n): a_sum = a.rolling(n).sum() b_sum = b.rolling(n).sum() ab_sum = (a*b).rolling(n).sum() aa_sum = (a**2).rolling(n).sum() bb_sum = (b**2).rolling(n).sum(); num = n * ab_sum - a_sum * b_sum; den = (n*aa_sum - a_sum**2) * (n * bb_sum - b_sum**2) return num / den**(0.5)rolling_pearson(df.a, df.b, 100)
... 12977 1.109077e-0612978 9.555249e-0712979 7.761921e-0712980 5.460717e-0712981 infLength: 12982, dtype: float64
Why is this so
In order to answer this question I needed to check the implementation. Because indeed the variance of the last 100 samples of b
is zero, and the rolling correlation is computed as a.cov(b) / (a.var() * b.var())**0.5
.
After some search I found the rolling variance implementation here, the method they are using is the Welford's online algorithm. This algorithm is nice because you can add one sample using only one multiplication (the same as the methods with cumulative sums), and you can calculate with a single integer division. Here rewrite it in python.
def welford_add(existingAggregate, newValue): if pd.isna(newValue): return s (count, mean, M2) = existingAggregate count += 1 delta = newValue - mean mean += delta / count delta2 = newValue - mean M2 += delta * delta2 return (count, mean, M2)def welford_remove(existingAggregate, newValue): if pd.isna(newValue): return s (count, mean, M2) = existingAggregate count -= 1 delta = newValue - mean mean -= delta / count delta2 = newValue - mean M2 -= delta * delta2 return (count, mean, M2)def finalize(existingAggregate): (count, mean, M2) = existingAggregate (mean, variance, sampleVariance) = (mean, M2 / count if count > 0 else None, M2 / (count - 1) if count > 1 else None) return (mean, variance, sampleVariance)
In the pandas implementation they mention the Kahan's summation, this is important to get better precision in additions, but the results are not improved by that (I didn't check if whether if it is properly implemented or not).
Applying the Welford algorithm with n=100
s = (0,0,0)for i in range(len(df.b)): if i >= n: s = welford_remove(s, df.b[i-n]) s = welford_add(s, df.b[i])finalize(s)
It gives
(6.000000000000152, 4.7853099260919405e-12, 4.8336463899918594e-12)
And the df.b.rolling(100).var()
gives
0 NaN1 NaN2 NaN3 NaN4 NaN ... 12977 6.206061e-0112978 4.703030e-0112979 3.167677e-0112980 1.600000e-0112981 6.487273e-12Name: b, Length: 12982, dtype: float64
With error 6.4e-12
slightly higher than the 4.83e-12
given by direct application of the Welford's method.
On the other hand (df.b**2).rolling(n).sum()-df.b.rolling(n).sum()**2/n
gives 0.0 for the last entry.
0 NaN1 NaN2 NaN3 NaN4 NaN ... 12977 61.4412978 46.5612979 31.3612980 15.8412981 0.00Name: b, Length: 12982, dtype: float64
I hope this explanation is satisfactory :)