20

I'd like to generate correlated arrays of x and y coordinates, in order to test various matplotlib plotting approaches, but I'm failing somewhere, because I can't get numpy.random.multivariate_normal to give me the samples I want. Ideally, I want my x values between -0.51, and 51.2, and my y values between 0.33 and 51.6 (though I suppose equal ranges would be OK, since I can constrain the plot afterwards), but I'm not sure what mean (0, 0?) and covariance values I should be using to get these samples from the function.

elyase
41.2k12 gold badges121 silver badges123 bronze badges
asked Sep 8, 2013 at 12:44

2 Answers 2

40

As the name implies numpy.random.multivariate_normal generates normal distributions, this means that there is a non-null probability of finding points outside of any given interval. You can generate correlated uniform distributions but this a little more convoluted. Take a look here for two possible methods.

If you want to go with the normal distribution you can set up the sigmas so that your half-interval correspond to 3 standard deviations (you can also filter out the bad points if needed). In this way you will have ~99% of your points inside your interval, ex:

import numpy as np
from matplotlib.pyplot import scatter
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()] 
stds = [xx.std() / 3, yy.std() / 3]
corr = 0.8 # correlation
covs = [[stds[0]**2 , stds[0]*stds[1]*corr], 
 [stds[0]*stds[1]*corr, stds[1]**2]] 
m = np.random.multivariate_normal(means, covs, 1000).T
scatter(m[0], m[1])

enter image description here

Peter O.
33.1k14 gold badges86 silver badges97 bronze badges
answered Sep 8, 2013 at 13:49
Sign up to request clarification or add additional context in comments.

4 Comments

Would it be possible to implement a variation in order to create a correlated variable (i.e. numpy array) from an existing numpy array which has its own average and stdev?
Not sure if it's my mistake, but I had to do m[:,0] and m[:,1] for the scatterplot to render correctly.
@KetilMalde: You likely forgot the .T after the function call. multivariate_normal generates an array with a row per observation (1000 rows), and a column per variable (2 columns). .T transposes the array and the shape becomes (2, 1000).
This is correct. But perhaps it is easier to understand the positioning and scaling of the normal marginals compared to the given bounds if you simply write, say, means[0]=(xx[0]+xx[1])/2 and stds[0]=(xx[1]-xx[0])/6. I was confused a bit why you need the sample standard deviations of xx and yy. Then I realized that for two data points it simplies to the half-width of the given range (if .std() uses the biased sample variance definition with the 1/N factor instead of 1/(N-1), which it does.)
2

can use Cholesky-decomposition to simulate correlated random variables given a correlation matrix - like here

import numpy as np
import matplotlib.pyplot as plt
# desired correlation matrix
cor_matrix = np.array([[1.0, 0.8],
 [0.8, 1.0]])
L = np.linalg.cholesky(cor_matrix)
# build some signals that will result in the desired correlation matrix
X = L.dot(np.random.normal(0,1,(2,1000))) # the more the sample (1000) the better
# estimate their correlation matrix
print(np.corrcoef(X))
print(X.shape)
plt.scatter(X[0], X[1])
plt.show()

enter image description here

answered Jan 10, 2024 at 9:51

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.