Generate correlated data in Python (3.3) Generate correlated data in Python (3.3) numpy numpy

Generate correlated data in Python (3.3)


numpy.random.multivariate_normal is the function that you want.

Example:

import numpy as npimport matplotlib.pyplot as pltnum_samples = 400# The desired mean values of the sample.mu = np.array([5.0, 0.0, 10.0])# The desired covariance matrix.r = np.array([        [  3.40, -2.75, -2.00],        [ -2.75,  5.50,  1.50],        [ -2.00,  1.50,  1.25]    ])# Generate the random samples.y = np.random.multivariate_normal(mu, r, size=num_samples)# Plot various projections of the samples.plt.subplot(2,2,1)plt.plot(y[:,0], y[:,1], 'b.')plt.plot(mu[0], mu[1], 'ro')plt.ylabel('y[1]')plt.axis('equal')plt.grid(True)plt.subplot(2,2,3)plt.plot(y[:,0], y[:,2], 'b.')plt.plot(mu[0], mu[2], 'ro')plt.xlabel('y[0]')plt.ylabel('y[2]')plt.axis('equal')plt.grid(True)plt.subplot(2,2,4)plt.plot(y[:,1], y[:,2], 'b.')plt.plot(mu[1], mu[2], 'ro')plt.xlabel('y[1]')plt.axis('equal')plt.grid(True)plt.show()

Result:

enter image description here

See also CorrelatedRandomSamples in the SciPy Cookbook.


If you Cholesky-decompose a covariance matrix C into L L^T, and generate anindependent random vector x, then Lx will be a random vector with covarianceC.

import numpy as npimport matplotlib.pyplot as pltlinalg = np.linalgnp.random.seed(1)num_samples = 1000num_variables = 2cov = [[0.3, 0.2], [0.2, 0.2]]L = linalg.cholesky(cov)# print(L.shape)# (2, 2)uncorrelated = np.random.standard_normal((num_variables, num_samples))mean = [1, 1]correlated = np.dot(L, uncorrelated) + np.array(mean).reshape(2, 1)# print(correlated.shape)# (2, 1000)plt.scatter(correlated[0, :], correlated[1, :], c='green')plt.show()

enter image description here

Reference: See Cholesky decomposition


If you want to generate two series, X and Y, with a particular (Pearson) correlation coefficient (e.g. 0.2):

rho = cov(X,Y) / sqrt(var(X)*var(Y))

you could choose the covariance matrix to be

cov = [[1, 0.2],       [0.2, 1]]

This makes the cov(X,Y) = 0.2, and the variances, var(X) and var(Y) both equal to 1. So rho would equal 0.2.

For example, below we generate pairs of correlated series, X and Y, 1000 times. Then we plot a histogram of the correlation coefficients:

import numpy as npimport matplotlib.pyplot as pltimport scipy.stats as statslinalg = np.linalgnp.random.seed(1)num_samples = 1000num_variables = 2cov = [[1.0, 0.2], [0.2, 1.0]]L = linalg.cholesky(cov)rhos = []for i in range(1000):    uncorrelated = np.random.standard_normal((num_variables, num_samples))    correlated = np.dot(L, uncorrelated)    X, Y = correlated    rho, pval = stats.pearsonr(X, Y)    rhos.append(rho)plt.hist(rhos)plt.show()

enter image description here

As you can see, the correlation coefficients are generally near 0.2, but for any given sample, the correlation will most likely not be 0.2 exactly.