Pandas expanding/rolling window correlation calculation with p-value Pandas expanding/rolling window correlation calculation with p-value pandas pandas

Pandas expanding/rolling window correlation calculation with p-value


I could not think of a clever way to do this in pandas using rolling directly, but note that you can calculate the p-value given the correlation coefficient.

Pearson's correlation coefficient follows Student's t-distribution and you can get the p-value by plugging it to the cdf defined by the incomplete beta function, scipy.special.betainc. It sounds complicated but can be done in a few lines of code. Below is a function that computes the p-value given the correlation coefficient corr and the sample size n. It is actually based on scipy's implementation you have been using.

import pandas as pdfrom scipy.special import betaincdef pvalue(corr, n=50):    df = n - 2    t_squared = corr**2 * (df / ((1.0 - corr) * (1.0 + corr)))    prob = betainc(0.5*df, 0.5, df/(df+t_squared))    return prob

You can then apply this function to the correlation values you already have.

rolling_corr = df['x'].rolling(50).corr(df['y'])pvalue(rolling_corr)

It might not be the perfect vectorised numpy solution but should be tens of times faster than calculating the correlations over and over again.


Approach #1

corr2_coeff_rowwise lists how to do element-wise correlation between rows. We could strip it down to a case for element-wise correlation between two columns. So, we would end up with a loop that uses corr2_coeff_rowwise. Then, we would try to vectorize it and see that there are pieces in it that could be vectorized :

  1. Getting average values with mean. This could be vectorized with use of uniform filter.
  2. Next up was getting the differences between those average values against sliding elements from input arrays. To port to a vectorized one, we would make use of broadcasting.

Rest stays the same to get the first off the two outputs from pearsonr.

To get the second output, we go back to the source code. This should be straight-forward given the first coefficient output.

So, with those in mind, we would end up with something like this -

import scipy.special as specialfrom scipy.ndimage import uniform_filterdef sliding_corr1(a,b,W):    # a,b are input arrays; W is window length    am = uniform_filter(a.astype(float),W)    bm = uniform_filter(b.astype(float),W)    amc = am[W//2:-W//2+1]    bmc = bm[W//2:-W//2+1]    da = a[:,None]-amc    db = b[:,None]-bmc    # Get sliding mask of valid windows    m,n = da.shape    mask1 = np.arange(m)[:,None] >= np.arange(n)    mask2 = np.arange(m)[:,None] < np.arange(n)+W    mask = mask1 & mask2    dam = (da*mask)    dbm = (db*mask)    ssAs = np.einsum('ij,ij->j',dam,dam)    ssBs = np.einsum('ij,ij->j',dbm,dbm)    D = np.einsum('ij,ij->j',dam,dbm)    coeff = D/np.sqrt(ssAs*ssBs)    n = W    ab = n/2 - 1    pval = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(coeff))))    return coeff,pval

Thus, to get the final output off the inputs from the pandas series -

out = sliding_corr1(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)

Approach #2

A lot similar to Approach #1, but we will use numba to improve on the memory efficiency to replace step #2 from previous approach.

from numba import njitimport math@njit(parallel=True)def sliding_corr2_coeff(a,b,amc,bmc):    L = len(a)-W+1    out00 = np.empty(L)    for i in range(L):        out_a = 0        out_b = 0        out_D = 0        for j in range(W):            d_a = a[i+j]-amc[i]            d_b = b[i+j]-bmc[i]            out_D += d_a*d_b            out_a += d_a**2            out_b += d_b**2        out00[i] = out_D/math.sqrt(out_a*out_b)    return out00def sliding_corr2(a,b,W):    am = uniform_filter(a.astype(float),W)    bm = uniform_filter(b.astype(float),W)    amc = am[W//2:-W//2+1]    bmc = bm[W//2:-W//2+1]    coeff = sliding_corr2_coeff(a,b,amc,bmc)    ab = W/2 - 1    pval = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(coeff))))    return coeff,pval

Approach #3

Very similar to previous one, except that we are pushing all of the coefficient work to numba -

@njit(parallel=True)def sliding_corr3_coeff(a,b,W):    L = len(a)-W+1    out00 = np.empty(L)    for i in range(L):        a_mean = 0.0        b_mean = 0.0        for j in range(W):            a_mean += a[i+j]            b_mean += b[i+j]        a_mean /= W        b_mean /= W        out_a = 0        out_b = 0        out_D = 0        for j in range(W):            d_a = a[i+j]-a_mean            d_b = b[i+j]-b_mean            out_D += d_a*d_b            out_a += d_a*d_a            out_b += d_b*d_b        out00[i] = out_D/math.sqrt(out_a*out_b)    return out00def sliding_corr3(a,b,W):        coeff = sliding_corr3_coeff(a,b,W)    ab = W/2 - 1    pval = 2*special.btdtr(ab, ab, 0.5*(1 - np.abs(coeff)))    return coeff,pval

Timings -

In [181]: df = pd.DataFrame({'x': np.random.rand(10000), 'y': np.random.rand(10000)})In [182]: %timeit sliding_corr2(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)100 loops, best of 3: 5.05 ms per loopIn [183]: %timeit sliding_corr3(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)100 loops, best of 3: 5.51 ms per loop

Note :

  • sliding_corr1 seems to be taking long on this dataset and most probably because of the memory-requirement from its step#2.

  • The bottleneck after using the numba funcs, then transfers to the p-val computation with special.btdtr.