Computing the correlation coefficient between two multi-dimensional arrays Computing the correlation coefficient between two multi-dimensional arrays arrays arrays

Computing the correlation coefficient between two multi-dimensional arrays


Correlation (default 'valid' case) between two 2D arrays:

You can simply use matrix-multiplication np.dot like so -

out = np.dot(arr_one,arr_two.T)

Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.


Row-wise Correlation Coefficient calculation for two 2D arrays:

def corr2_coeff(A, B):    # Rowwise mean of input arrays & subtract from input arrays themeselves    A_mA = A - A.mean(1)[:, None]    B_mB = B - B.mean(1)[:, None]    # Sum of squares across rows    ssA = (A_mA**2).sum(1)    ssB = (B_mB**2).sum(1)    # Finally get corr coeff    return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))

This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB

Benchmarking

This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.

Case #1:

In [106]: A = np.random.rand(1000, 100)In [107]: B = np.random.rand(1000, 100)In [108]: %timeit corr2_coeff(A, B)100 loops, best of 3: 15 ms per loopIn [109]: %timeit generate_correlation_map(A, B)100 loops, best of 3: 19.6 ms per loop

Case #2:

In [110]: A = np.random.rand(5000, 100)In [111]: B = np.random.rand(5000, 100)In [112]: %timeit corr2_coeff(A, B)1 loops, best of 3: 368 ms per loopIn [113]: %timeit generate_correlation_map(A, B)1 loops, best of 3: 493 ms per loop

Case #3:

In [114]: A = np.random.rand(10000, 10)In [115]: B = np.random.rand(10000, 10)In [116]: %timeit corr2_coeff(A, B)1 loops, best of 3: 1.29 s per loopIn [117]: %timeit generate_correlation_map(A, B)1 loops, best of 3: 1.83 s per loop

The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -

In [118]: A = np.random.rand(1000, 100)In [119]: B = np.random.rand(1000, 100)In [120]: %timeit corr2_coeff(A, B)100 loops, best of 3: 15.3 ms per loopIn [121]: %timeit generate_correlation_map(A, B)100 loops, best of 3: 19.7 ms per loopIn [122]: %timeit pearsonr_based(A, B)1 loops, best of 3: 33 s per loop


@Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.

In order to calculate the correlation coefficient, a bit more is required:

import numpy as npdef generate_correlation_map(x, y):    """Correlate each n with each m.    Parameters    ----------    x : np.array      Shape N X T.    y : np.array      Shape M X T.    Returns    -------    np.array      N X M array in which each element is a correlation coefficient.    """    mu_x = x.mean(1)    mu_y = y.mean(1)    n = x.shape[1]    if n != y.shape[1]:        raise ValueError('x and y must ' +                         'have the same number of timepoints.')    s_x = x.std(1, ddof=n - 1)    s_y = y.std(1, ddof=n - 1)    cov = np.dot(x,                 y.T) - n * np.dot(mu_x[:, np.newaxis],                                  mu_y[np.newaxis, :])    return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])

Here's a test of this function, which passes:

from scipy.stats import pearsonrdef test_generate_correlation_map():    x = np.random.rand(10, 10)    y = np.random.rand(20, 10)    desired = np.empty((10, 20))    for n in range(x.shape[0]):        for m in range(y.shape[0]):            desired[n, m] = pearsonr(x[n, :], y[m, :])[0]    actual = generate_correlation_map(x, y)    np.testing.assert_array_almost_equal(actual, desired)


For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.

def pearsonr_2D(x, y):    """computes pearson correlation coefficient       where x is a 1D and y a 2D array"""    upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)    lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))        rho = upper / lower        return rho

Example run:

>>> xOut[1]: array([1, 2, 3])>>> yOut[2]: array([[ 1,  2,  3],               [ 6,  7, 12],               [ 9,  3,  1]])>>> pearsonr_2D(x, y)Out[3]: array([ 1.        ,  0.93325653, -0.96076892])