Quickest way to calculate subset of correlation matrix

python numpy pandas

Using dot products to compute the correlation (as in your example) seems like a good approach. I'll describe two improvements, then code implementing them.

Improvement 1: Pull means out of dot product

We can pull the means out of the dot product, to avoid having to subtract them from every value (similar to how you pulled the standard deviations out of the dot product, which we'll also do).

Let x, y be vectors with n elements. Let a, b be scalars. Let <x,y> denote the dot product between x and y.

The correlation between x and y can be expressed using the dot product

<(x-mean(x))/std(x), (y-mean(y))/std(y)> / n

To pull the standard deviations out of the dot product, we can use the following identity (as you did above):

<ax, by> = a*b*<x, y>

To pull the means out of the dot product, we can derive another identity:

<x+a, y+b> = <x,y> + a*sum(y) + b*sum(x) + a*b*n

In the case where a = -mean(x), b = -mean(y), this simplifies to:

<x-mean(x), y-mean(y)> = <x, y> - sum(x)*sum(y)/n

Using these identities, the correlation between x and y is equivalent to:

(<x, y> - sum(x)*sum(y)/n) / (std(x)*std(y)*n)

In the function below, this will be expressed using matrix multiplication and outer products to handle multiple variables simultaneously (as in your example).

Improvement 2: Pre-compute sums and standard deviations

We can pre-compute the sums and standard deviations, to avoid re-computing them for all columns every time the function is called.

Code

Putting the two improvements together, we have the following (I don't speak pandas, so it's in numpy):

def corr_cols(x, xsum, xstd, lo, hi):    n = x.shape[0]    return (        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)        / (np.outer(xstd, xstd[lo:hi])*n)    )# fake data w/ 10 points, 5 dimensionsx = np.random.rand(10, 5)# precompute sums and standard deviations along each dimensionxsum = np.sum(x, 0)xstd = np.std(x, 0)# calculate columns of correlation matrix for dimensions 1 thru 3r = corr_cols(x, xsum, xstd, 1, 4)

Better code

Pre-computing and storing the sums and standard deviations can be hidden inside a closure, to give a nicer interface and keep the main code cleaner. Functionally, the operations are equivalent to the previous code.

def col_correlator(x):    n = x.shape[0]    xsum = np.sum(x, 0)    xstd = np.std(x, 0)    return lambda lo, hi: (        (np.dot(x.T, x[:, lo:hi]) - np.outer(xsum, xsum[lo:hi])/n)        / (np.outer(xstd, xstd[lo:hi])*n)    )# construct function to compute columns of correlation matrixcc = col_correlator(x)# compute columns of correlation matrix for dimensions 1 thru 3r = cc(1, 4)

EDIT: (piRSquared)

I wanted to put my edit in this post to further encourage upvoting of this answer.

This is the code I implemented utilizing this advice. This solution translates back and forth between pandas and numpy.

def corr_closure(df):    d = df.values    sums = d.sum(0, keepdims=True)    stds = d.std(0, keepdims=True)    n = d.shape[0]    def corr(k=0, l=10):        d2 = d.T.dot(d[:, k:l])        sums2 = sums.T.dot(sums[:, k:l])        stds2 = stds.T.dot(stds[:, k:l])        return pd.DataFrame((d2 - sums2 / n) / stds2 / n,                            df.columns, df.columns[k:l])    return corr

Use case:

corr = corr_closure(df)corr(0, 2)

python numpy pandas

Credit goes to @user20160, @piRsquared.

I had a very similar issue. I was trying to compute just a quarter of the matrix: the correlations between some group of columns to another one.

I modified the code a bit, that it takes 4 parameters, for the 2 groups of vectors:

def col_correlator(x):n = x.shape[0]xsum = np.sum(x, 0)xstd = np.std(x, 0)return lambda lo_c, hi_c, lo_r, hi_r: (    (np.dot(x[:, lo_r:hi_r].T, x[:, lo_c:hi_c]) - np.outer(xsum[lo_r:hi_r], xsum[lo_c:hi_c]) / n)    / (np.outer(xstd[lo_r:hi_r], xstd[lo_c:hi_c]) * n))# construct function to compute columns of correlation matrixcc = col_correlator(x)# compute columns of correlation matrix for dimensions 1 thru 3r = cc(n, m,0,n)

CodeHunter

Quickest way to calculate subset of correlation matrix

Improvement 1: Pull means out of dot product

Improvement 2: Pre-compute sums and standard deviations

Code

Better code

EDIT: (piRSquared)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last