Generate correlated data in Python (3.3)

numpy.random.multivariate_normal is the function that you want.

Example:

import numpy as npimport matplotlib.pyplot as pltnum_samples = 400# The desired mean values of the sample.mu = np.array([5.0, 0.0, 10.0])# The desired covariance matrix.r = np.array([        [  3.40, -2.75, -2.00],        [ -2.75,  5.50,  1.50],        [ -2.00,  1.50,  1.25]    ])# Generate the random samples.y = np.random.multivariate_normal(mu, r, size=num_samples)# Plot various projections of the samples.plt.subplot(2,2,1)plt.plot(y[:,0], y[:,1], 'b.')plt.plot(mu[0], mu[1], 'ro')plt.ylabel('y[1]')plt.axis('equal')plt.grid(True)plt.subplot(2,2,3)plt.plot(y[:,0], y[:,2], 'b.')plt.plot(mu[0], mu[2], 'ro')plt.xlabel('y[0]')plt.ylabel('y[2]')plt.axis('equal')plt.grid(True)plt.subplot(2,2,4)plt.plot(y[:,1], y[:,2], 'b.')plt.plot(mu[1], mu[2], 'ro')plt.xlabel('y[1]')plt.axis('equal')plt.grid(True)plt.show()

Result:

enter image description here

See also CorrelatedRandomSamples in the SciPy Cookbook.

python r numpy scipy correlation

If you Cholesky-decompose a covariance matrix C into L L^T, and generate anindependent random vector x, then Lx will be a random vector with covarianceC.

import numpy as npimport matplotlib.pyplot as pltlinalg = np.linalgnp.random.seed(1)num_samples = 1000num_variables = 2cov = [[0.3, 0.2], [0.2, 0.2]]L = linalg.cholesky(cov)# print(L.shape)# (2, 2)uncorrelated = np.random.standard_normal((num_variables, num_samples))mean = [1, 1]correlated = np.dot(L, uncorrelated) + np.array(mean).reshape(2, 1)# print(correlated.shape)# (2, 1000)plt.scatter(correlated[0, :], correlated[1, :], c='green')plt.show()

enter image description here

Reference: See Cholesky decomposition

If you want to generate two series, X and Y, with a particular (Pearson) correlation coefficient (e.g. 0.2):

rho = cov(X,Y) / sqrt(var(X)*var(Y))

you could choose the covariance matrix to be

cov = [[1, 0.2],       [0.2, 1]]

This makes the cov(X,Y) = 0.2, and the variances, var(X) and var(Y) both equal to 1. So rho would equal 0.2.

For example, below we generate pairs of correlated series, X and Y, 1000 times. Then we plot a histogram of the correlation coefficients:

import numpy as npimport matplotlib.pyplot as pltimport scipy.stats as statslinalg = np.linalgnp.random.seed(1)num_samples = 1000num_variables = 2cov = [[1.0, 0.2], [0.2, 1.0]]L = linalg.cholesky(cov)rhos = []for i in range(1000):    uncorrelated = np.random.standard_normal((num_variables, num_samples))    correlated = np.dot(L, uncorrelated)    X, Y = correlated    rho, pval = stats.pearsonr(X, Y)    rhos.append(rho)plt.hist(rhos)plt.show()

enter image description here

As you can see, the correlation coefficients are generally near 0.2, but for any given sample, the correlation will most likely not be 0.2 exactly.

CodeHunter

Generate correlated data in Python (3.3)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last