Pandas expanding/rolling window correlation calculation with p-value
I could not think of a clever way to do this in pandas using rolling
directly, but note that you can calculate the p-value given the correlation coefficient.
Pearson's correlation coefficient follows Student's t-distribution and you can get the p-value by plugging it to the cdf defined by the incomplete beta function, scipy.special.betainc
. It sounds complicated but can be done in a few lines of code. Below is a function that computes the p-value given the correlation coefficient corr
and the sample size n
. It is actually based on scipy's implementation you have been using.
import pandas as pdfrom scipy.special import betaincdef pvalue(corr, n=50): df = n - 2 t_squared = corr**2 * (df / ((1.0 - corr) * (1.0 + corr))) prob = betainc(0.5*df, 0.5, df/(df+t_squared)) return prob
You can then apply this function to the correlation values you already have.
rolling_corr = df['x'].rolling(50).corr(df['y'])pvalue(rolling_corr)
It might not be the perfect vectorised numpy solution but should be tens of times faster than calculating the correlations over and over again.
Approach #1
corr2_coeff_rowwise
lists how to do element-wise correlation between rows. We could strip it down to a case for element-wise correlation between two columns. So, we would end up with a loop that uses corr2_coeff_rowwise
. Then, we would try to vectorize it and see that there are pieces in it that could be vectorized :
- Getting average values with
mean
. This could be vectorized with use of uniform filter. - Next up was getting the differences between those average values against sliding elements from input arrays. To port to a vectorized one, we would make use of
broadcasting
.
Rest stays the same to get the first off the two outputs from pearsonr
.
To get the second output, we go back to the source code
. This should be straight-forward given the first coefficient output.
So, with those in mind, we would end up with something like this -
import scipy.special as specialfrom scipy.ndimage import uniform_filterdef sliding_corr1(a,b,W): # a,b are input arrays; W is window length am = uniform_filter(a.astype(float),W) bm = uniform_filter(b.astype(float),W) amc = am[W//2:-W//2+1] bmc = bm[W//2:-W//2+1] da = a[:,None]-amc db = b[:,None]-bmc # Get sliding mask of valid windows m,n = da.shape mask1 = np.arange(m)[:,None] >= np.arange(n) mask2 = np.arange(m)[:,None] < np.arange(n)+W mask = mask1 & mask2 dam = (da*mask) dbm = (db*mask) ssAs = np.einsum('ij,ij->j',dam,dam) ssBs = np.einsum('ij,ij->j',dbm,dbm) D = np.einsum('ij,ij->j',dam,dbm) coeff = D/np.sqrt(ssAs*ssBs) n = W ab = n/2 - 1 pval = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(coeff)))) return coeff,pval
Thus, to get the final output off the inputs from the pandas series -
out = sliding_corr1(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)
Approach #2
A lot similar to Approach #1
, but we will use numba
to improve on the memory efficiency to replace step #2 from previous approach.
from numba import njitimport math@njit(parallel=True)def sliding_corr2_coeff(a,b,amc,bmc): L = len(a)-W+1 out00 = np.empty(L) for i in range(L): out_a = 0 out_b = 0 out_D = 0 for j in range(W): d_a = a[i+j]-amc[i] d_b = b[i+j]-bmc[i] out_D += d_a*d_b out_a += d_a**2 out_b += d_b**2 out00[i] = out_D/math.sqrt(out_a*out_b) return out00def sliding_corr2(a,b,W): am = uniform_filter(a.astype(float),W) bm = uniform_filter(b.astype(float),W) amc = am[W//2:-W//2+1] bmc = bm[W//2:-W//2+1] coeff = sliding_corr2_coeff(a,b,amc,bmc) ab = W/2 - 1 pval = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(coeff)))) return coeff,pval
Approach #3
Very similar to previous one, except that we are pushing all of the coefficient work to numba
-
@njit(parallel=True)def sliding_corr3_coeff(a,b,W): L = len(a)-W+1 out00 = np.empty(L) for i in range(L): a_mean = 0.0 b_mean = 0.0 for j in range(W): a_mean += a[i+j] b_mean += b[i+j] a_mean /= W b_mean /= W out_a = 0 out_b = 0 out_D = 0 for j in range(W): d_a = a[i+j]-a_mean d_b = b[i+j]-b_mean out_D += d_a*d_b out_a += d_a*d_a out_b += d_b*d_b out00[i] = out_D/math.sqrt(out_a*out_b) return out00def sliding_corr3(a,b,W): coeff = sliding_corr3_coeff(a,b,W) ab = W/2 - 1 pval = 2*special.btdtr(ab, ab, 0.5*(1 - np.abs(coeff))) return coeff,pval
Timings -
In [181]: df = pd.DataFrame({'x': np.random.rand(10000), 'y': np.random.rand(10000)})In [182]: %timeit sliding_corr2(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)100 loops, best of 3: 5.05 ms per loopIn [183]: %timeit sliding_corr3(df['x'].to_numpy(copy=False),df['y'].to_numpy(copy=False),50)100 loops, best of 3: 5.51 ms per loop
Note :
sliding_corr1
seems to be taking long on this dataset and most probably because of the memory-requirement from its step#2.The bottleneck after using the numba funcs, then transfers to the p-val computation with
special.btdtr
.