Correlation coefficients and p values for all pairs of rows of a matrix Correlation coefficients and p values for all pairs of rows of a matrix numpy numpy

Correlation coefficients and p values for all pairs of rows of a matrix


I have encountered the same problem today.

After half an hour of googling, I can't find any code in numpy/scipy library can help me do this.

So I wrote my own version of corrcoef

import numpy as npfrom scipy.stats import pearsonr, betaidef corrcoef(matrix):    r = np.corrcoef(matrix)    rf = r[np.triu_indices(r.shape[0], 1)]    df = matrix.shape[1] - 2    ts = rf * rf * (df / (1 - rf * rf))    pf = betai(0.5 * df, 0.5, df / (df + ts))    p = np.zeros(shape=r.shape)    p[np.triu_indices(p.shape[0], 1)] = pf    p[np.tril_indices(p.shape[0], -1)] = p.T[np.tril_indices(p.shape[0], -1)]    p[np.diag_indices(p.shape[0])] = np.ones(p.shape[0])    return r, pdef corrcoef_loop(matrix):    rows, cols = matrix.shape[0], matrix.shape[1]    r = np.ones(shape=(rows, rows))    p = np.ones(shape=(rows, rows))    for i in range(rows):        for j in range(i+1, rows):            r_, p_ = pearsonr(matrix[i], matrix[j])            r[i, j] = r[j, i] = r_            p[i, j] = p[j, i] = p_    return r, p

The first version use the result of np.corrcoef, and then calculate p-value based on triangle-upper values of corrcoef matrix.

The second loop version just iterating over rows, do pearsonr manually.

def test_corrcoef():    a = np.array([        [1, 2, 3, 4],        [1, 3, 1, 4],        [8, 3, 8, 5],        [2, 3, 2, 1]])    r1, p1 = corrcoef(a)    r2, p2 = corrcoef_loop(a)    assert np.allclose(r1, r2)    assert np.allclose(p1, p2)

The test passed, they are the same.

def test_timing():    import time    a = np.random.randn(100, 2500)    def timing(func, *args, **kwargs):        t0 = time.time()        loops = 10        for _ in range(loops):            func(*args, **kwargs)        print('{} takes {} seconds loops={}'.format(            func.__name__, time.time() - t0, loops))    timing(corrcoef, a)    timing(corrcoef_loop, a)if __name__ == '__main__':    test_corrcoef()    test_timing()

The performance on my Macbook against 100x2500 matrix

corrcoef takes 0.06608104705810547 seconds loops=10

corrcoef_loop takes 7.585600137710571 seconds loops=10


The most consice way of doing it might be the buildin method .corr in pandas, to get r:

In [79]:import pandas as pdm=np.random.random((6,6))df=pd.DataFrame(m)print df.corr()          0         1         2         3         4         50  1.000000 -0.282780  0.455210 -0.377936 -0.850840  0.1905451 -0.282780  1.000000 -0.747979 -0.461637  0.270770  0.0088152  0.455210 -0.747979  1.000000 -0.137078 -0.683991  0.5573903 -0.377936 -0.461637 -0.137078  1.000000  0.511070 -0.8016144 -0.850840  0.270770 -0.683991  0.511070  1.000000 -0.4992475  0.190545  0.008815  0.557390 -0.801614 -0.499247  1.000000

To get p values using t-test:

In [84]:n=6r=df.corr()t=r*np.sqrt((n-2)/(1-r*r))import scipy.stats as ssss.t.cdf(t, n-2)Out[84]:array([[ 1.        ,  0.2935682 ,  0.817826  ,  0.23004382,  0.01585695,         0.64117917],       [ 0.2935682 ,  1.        ,  0.04363408,  0.17836685,  0.69811422,         0.50661121],       [ 0.817826  ,  0.04363408,  1.        ,  0.39783538,  0.06700715,         0.8747497 ],       [ 0.23004382,  0.17836685,  0.39783538,  1.        ,  0.84993082,         0.02756579],       [ 0.01585695,  0.69811422,  0.06700715,  0.84993082,  1.        ,         0.15667393],       [ 0.64117917,  0.50661121,  0.8747497 ,  0.02756579,  0.15667393,         1.        ]])In [85]:ss.pearsonr(m[:,0], m[:,1])Out[85]:(-0.28277983892175751, 0.58713640696703184)In [86]:#be careful about the difference of 1-tail test and 2-tail test:0.58713640696703184/2Out[86]:0.2935682034835159 #the value in ss.t.cdf(t, n-2) [0,1] cell

Also you can just use the scipy.stats.pearsonr you mentioned in OP:

In [95]:#returns a list of tuples of (r, p, index1, index2)import itertools[ss.pearsonr(m[:,i],m[:,j])+(i, j) for i, j in itertools.product(range(n), range(n))]Out[95]:[(1.0, 0.0, 0, 0), (-0.28277983892175751, 0.58713640696703184, 0, 1), (0.45521036266021014, 0.36434799921123057, 0, 2), (-0.3779357902414715, 0.46008763115463419, 0, 3), (-0.85083961671703368, 0.031713908656676448, 0, 4), (0.19054495489542525, 0.71764166168348287, 0, 5), (-0.28277983892175751, 0.58713640696703184, 1, 0), (1.0, 0.0, 1, 1),#etc, etc


Sort of hackish and possibly inefficient, but I think this could be what you're looking for:

import scipy.spatial.distance as distimport scipy.stats as ss# Pearson's correlation coefficientsprint dist.squareform(dist.pdist(data, lambda x, y: ss.pearsonr(x, y)[0]))    # p-valuesprint dist.squareform(dist.pdist(data, lambda x, y: ss.pearsonr(x, y)[1]))

Scipy's pdist is a very helpful function, which is primarily meant for finding Pairwise distances between observations in n-dimensional space.

But it allows user defined callable 'distance metrics', which can be exploited to carry out any kind of pair-wise operation. The result is returned in a condensed distance matrix form, which can be easily changed to the square matrix form using Scipy's 'squareform' function.