How to tackle inconsistent results while using pandas rolling correlation?

python pandas numpy rolling-computation pearson-correlation

What if you compute you replace the sums in your pearson formula with rolling sums

def rolling_pearson(a, b, n):    a_sum = a.rolling(n).sum()    b_sum = b.rolling(n).sum()    ab_sum = (a*b).rolling(n).sum()    aa_sum = (a**2).rolling(n).sum()    bb_sum = (b**2).rolling(n).sum();        num = n * ab_sum - a_sum * b_sum;    den = (n*aa_sum - a_sum**2) * (n * bb_sum - b_sum**2)    return num / den**(0.5)rolling_pearson(df.a, df.b, 100)

             ...     12977    1.109077e-0612978    9.555249e-0712979    7.761921e-0712980    5.460717e-0712981             infLength: 12982, dtype: float64

Why is this so

In order to answer this question I needed to check the implementation. Because indeed the variance of the last 100 samples of b is zero, and the rolling correlation is computed as a.cov(b) / (a.var() * b.var())**0.5.

After some search I found the rolling variance implementation here, the method they are using is the Welford's online algorithm. This algorithm is nice because you can add one sample using only one multiplication (the same as the methods with cumulative sums), and you can calculate with a single integer division. Here rewrite it in python.

def welford_add(existingAggregate, newValue):    if pd.isna(newValue):        return s    (count, mean, M2) = existingAggregate    count += 1    delta = newValue - mean    mean += delta / count    delta2 = newValue - mean    M2 += delta * delta2    return (count, mean, M2)def welford_remove(existingAggregate, newValue):    if pd.isna(newValue):        return s    (count, mean, M2) = existingAggregate    count -= 1    delta = newValue - mean    mean -= delta / count    delta2 = newValue - mean    M2 -= delta * delta2    return (count, mean, M2)def finalize(existingAggregate):    (count, mean, M2) = existingAggregate    (mean, variance, sampleVariance) = (mean,             M2 / count if count > 0 else None,             M2 / (count - 1) if count > 1 else None)    return (mean, variance, sampleVariance)

In the pandas implementation they mention the Kahan's summation, this is important to get better precision in additions, but the results are not improved by that (I didn't check if whether if it is properly implemented or not).

Applying the Welford algorithm with n=100

s = (0,0,0)for i in range(len(df.b)):    if i >= n:        s = welford_remove(s, df.b[i-n])    s = welford_add(s, df.b[i])finalize(s)

It gives

(6.000000000000152, 4.7853099260919405e-12, 4.8336463899918594e-12)

And the df.b.rolling(100).var() gives

0                 NaN1                 NaN2                 NaN3                 NaN4                 NaN             ...     12977    6.206061e-0112978    4.703030e-0112979    3.167677e-0112980    1.600000e-0112981    6.487273e-12Name: b, Length: 12982, dtype: float64

With error 6.4e-12 slightly higher than the 4.83e-12 given by direct application of the Welford's method.

On the other hand (df.b**2).rolling(n).sum()-df.b.rolling(n).sum()**2/n gives 0.0 for the last entry.

0          NaN1          NaN2          NaN3          NaN4          NaN         ...  12977    61.4412978    46.5612979    31.3612980    15.8412981     0.00Name: b, Length: 12982, dtype: float64

I hope this explanation is satisfactory :)

CodeHunter

How to tackle inconsistent results while using pandas rolling correlation?

Why is this so

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last