How to tackle inconsistent results while using pandas rolling correlation? How to tackle inconsistent results while using pandas rolling correlation? pandas pandas

How to tackle inconsistent results while using pandas rolling correlation?


What if you compute you replace the sums in your pearson formula with rolling sums

def rolling_pearson(a, b, n):    a_sum = a.rolling(n).sum()    b_sum = b.rolling(n).sum()    ab_sum = (a*b).rolling(n).sum()    aa_sum = (a**2).rolling(n).sum()    bb_sum = (b**2).rolling(n).sum();        num = n * ab_sum - a_sum * b_sum;    den = (n*aa_sum - a_sum**2) * (n * bb_sum - b_sum**2)    return num / den**(0.5)rolling_pearson(df.a, df.b, 100)
             ...     12977    1.109077e-0612978    9.555249e-0712979    7.761921e-0712980    5.460717e-0712981             infLength: 12982, dtype: float64

Why is this so

In order to answer this question I needed to check the implementation. Because indeed the variance of the last 100 samples of b is zero, and the rolling correlation is computed as a.cov(b) / (a.var() * b.var())**0.5.

After some search I found the rolling variance implementation here, the method they are using is the Welford's online algorithm. This algorithm is nice because you can add one sample using only one multiplication (the same as the methods with cumulative sums), and you can calculate with a single integer division. Here rewrite it in python.

def welford_add(existingAggregate, newValue):    if pd.isna(newValue):        return s    (count, mean, M2) = existingAggregate    count += 1    delta = newValue - mean    mean += delta / count    delta2 = newValue - mean    M2 += delta * delta2    return (count, mean, M2)def welford_remove(existingAggregate, newValue):    if pd.isna(newValue):        return s    (count, mean, M2) = existingAggregate    count -= 1    delta = newValue - mean    mean -= delta / count    delta2 = newValue - mean    M2 -= delta * delta2    return (count, mean, M2)def finalize(existingAggregate):    (count, mean, M2) = existingAggregate    (mean, variance, sampleVariance) = (mean,             M2 / count if count > 0 else None,             M2 / (count - 1) if count > 1 else None)    return (mean, variance, sampleVariance)

In the pandas implementation they mention the Kahan's summation, this is important to get better precision in additions, but the results are not improved by that (I didn't check if whether if it is properly implemented or not).

Applying the Welford algorithm with n=100

s = (0,0,0)for i in range(len(df.b)):    if i >= n:        s = welford_remove(s, df.b[i-n])    s = welford_add(s, df.b[i])finalize(s)

It gives

(6.000000000000152, 4.7853099260919405e-12, 4.8336463899918594e-12)

And the df.b.rolling(100).var() gives

0                 NaN1                 NaN2                 NaN3                 NaN4                 NaN             ...     12977    6.206061e-0112978    4.703030e-0112979    3.167677e-0112980    1.600000e-0112981    6.487273e-12Name: b, Length: 12982, dtype: float64

With error 6.4e-12 slightly higher than the 4.83e-12 given by direct application of the Welford's method.

On the other hand (df.b**2).rolling(n).sum()-df.b.rolling(n).sum()**2/n gives 0.0 for the last entry.

0          NaN1          NaN2          NaN3          NaN4          NaN         ...  12977    61.4412978    46.5612979    31.3612980    15.8412981     0.00Name: b, Length: 12982, dtype: float64

I hope this explanation is satisfactory :)