Quickest way to calculate subset of correlation matrix Quickest way to calculate subset of correlation matrix pandas pandas

Quickest way to calculate subset of correlation matrix


Using dot products to compute the correlation (as in your example) seems like a good approach. I'll describe two improvements, then code implementing them.

Improvement 1: Pull means out of dot product

We can pull the means out of the dot product, to avoid having to subtract them from every value (similar to how you pulled the standard deviations out of the dot product, which we'll also do).

Let x, y be vectors with n elements. Let a, b be scalars. Let <x,y> denote the dot product between x and y.

The correlation between x and y can be expressed using the dot product

<(x-mean(x))/std(x), (y-mean(y))/std(y)> / n

To pull the standard deviations out of the dot product, we can use the following identity (as you did above):

<ax, by> = a*b*<x, y>

To pull the means out of the dot product, we can derive another identity:

<x+a, y+b> = <x,y> + a*sum(y) + b*sum(x) + a*b*n

In the case where a = -mean(x), b = -mean(y), this simplifies to:

<x-mean(x), y-mean(y)> = <x, y> - sum(x)*sum(y)/n

Using these identities, the correlation between x and y is equivalent to:

(<x, y> - sum(x)*sum(y)/n) / (std(x)*std(y)*n)

In the function below, this will be expressed using matrix multiplication and outer products to handle multiple variables simultaneously (as in your example).

Improvement 2: Pre-compute sums and standard deviations

We can pre-compute the sums and standard deviations, to avoid re-computing them for all columns every time the function is called.

Code

Putting the two improvements together, we have the following (I don't speak pandas, so it's in numpy):

def corr_cols(x, xsum, xstd, lo, hi):    n = x.shape[0]    return (        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)        / (np.outer(xstd, xstd[lo:hi])*n)    )# fake data w/ 10 points, 5 dimensionsx = np.random.rand(10, 5)# precompute sums and standard deviations along each dimensionxsum = np.sum(x, 0)xstd = np.std(x, 0)# calculate columns of correlation matrix for dimensions 1 thru 3r = corr_cols(x, xsum, xstd, 1, 4)

Better code

Pre-computing and storing the sums and standard deviations can be hidden inside a closure, to give a nicer interface and keep the main code cleaner. Functionally, the operations are equivalent to the previous code.

def col_correlator(x):    n = x.shape[0]    xsum = np.sum(x, 0)    xstd = np.std(x, 0)    return lambda lo, hi: (        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)        / (np.outer(xstd, xstd[lo:hi])*n)    )# construct function to compute columns of correlation matrixcc = col_correlator(x)# compute columns of correlation matrix for dimensions 1 thru 3r = cc(1, 4)

EDIT: (piRSquared)

I wanted to put my edit in this post to further encourage upvoting of this answer.

This is the code I implemented utilizing this advice. This solution translates back and forth between pandas and numpy.

def corr_closure(df):    d = df.values    sums = d.sum(0, keepdims=True)    stds = d.std(0, keepdims=True)    n = d.shape[0]    def corr(k=0, l=10):        d2 = d.T.dot(d[:, k:l])        sums2 = sums.T.dot(sums[:, k:l])        stds2 = stds.T.dot(stds[:, k:l])        return pd.DataFrame((d2 - sums2 / n) / stds2 / n,                            df.columns, df.columns[k:l])    return corr

Use case:

corr = corr_closure(df)corr(0, 2)

enter image description here


Credit goes to @user20160, @piRsquared.

I had a very similar issue. I was trying to compute just a quarter of the matrix: the correlations between some group of columns to another one.

I modified the code a bit, that it takes 4 parameters, for the 2 groups of vectors:

def col_correlator(x):n = x.shape[0]xsum = np.sum(x, 0)xstd = np.std(x, 0)return lambda lo_c, hi_c, lo_r, hi_r: (    (np.dot(x[:, lo_r:hi_r].T, x[:, lo_c:hi_c]) - np.outer(xsum[lo_r:hi_r], xsum[lo_c:hi_c]) / n)    / (np.outer(xstd[lo_r:hi_r], xstd[lo_c:hi_c]) * n))# construct function to compute columns of correlation matrixcc = col_correlator(x)# compute columns of correlation matrix for dimensions 1 thru 3r = cc(n, m,0,n)