Generating random correlated x and y points using Numpy

Question 1

I'd like to generate correlated arrays of x and y coordinates, in order to test various matplotlib plotting approaches, but I'm failing somewhere, because I can't get numpy.random.multivariate_normal to give me the samples I want. Ideally, I want my x values between -0.51, and 51.2, and my y values between 0.33 and 51.6 (though I suppose equal ranges would be OK, since I can constrain the plot afterwards), but I'm not sure what mean (0, 0?) and covariance values I should be using to get these samples from the function.

Question 2

As the name implies numpy.random.multivariate_normal generates normal distributions, this means that there is a non-null probability of finding points outside of any given interval. You can generate correlated uniform distributions but this a little more convoluted. Take a look here for two possible methods.

If you want to go with the normal distribution you can set up the sigmas so that your half-interval correspond to 3 standard deviations (you can also filter out the bad points if needed). In this way you will have ~99% of your points inside your interval, ex:

import numpy as np
from matplotlib.pyplot import scatter
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()] 
stds = [xx.std() / 3, yy.std() / 3]
corr = 0.8 # correlation
covs = [[stds[0]**2 , stds[0]*stds[1]*corr], 
 [stds[0]*stds[1]*corr, stds[1]**2]] 
m = np.random.multivariate_normal(means, covs, 1000).T
scatter(m[0], m[1])

enter image description here

Question 3

Would it be possible to implement a variation in order to create a correlated variable (i.e. numpy array) from an existing numpy array which has its own average and stdev?

Question 4

Not sure if it's my mistake, but I had to do m[:,0] and m[:,1] for the scatterplot to render correctly.

Question 5

@KetilMalde: You likely forgot the .T after the function call. multivariate_normal generates an array with a row per observation (1000 rows), and a column per variable (2 columns). .T transposes the array and the shape becomes (2, 1000).

Question 6

This is correct. But perhaps it is easier to understand the positioning and scaling of the normal marginals compared to the given bounds if you simply write, say, means[0]=(xx[0]+xx[1])/2 and stds[0]=(xx[1]-xx[0])/6. I was confused a bit why you need the sample standard deviations of xx and yy. Then I realized that for two data points it simplies to the half-width of the given range (if .std() uses the biased sample variance definition with the 1/N factor instead of 1/(N-1), which it does.)

Question 7

can use Cholesky-decomposition to simulate correlated random variables given a correlation matrix - like here

import numpy as np
import matplotlib.pyplot as plt
# desired correlation matrix
cor_matrix = np.array([[1.0, 0.8],
 [0.8, 1.0]])
L = np.linalg.cholesky(cor_matrix)
# build some signals that will result in the desired correlation matrix
X = L.dot(np.random.normal(0,1,(2,1000))) # the more the sample (1000) the better
# estimate their correlation matrix
print(np.corrcoef(X))
print(X.shape)
plt.scatter(X[0], X[1])
plt.show()

enter image description here

elyase 41.2k12 gold badges121 silver badges123 bronze badges · Accepted Answer · 2013-09-08 13:49:49Z

As the name implies numpy.random.multivariate_normal generates normal distributions, this means that there is a non-null probability of finding points outside of any given interval. You can generate correlated uniform distributions but this a little more convoluted. Take a look here for two possible methods.

If you want to go with the normal distribution you can set up the sigmas so that your half-interval correspond to 3 standard deviations (you can also filter out the bad points if needed). In this way you will have ~99% of your points inside your interval, ex:

import numpy as np
from matplotlib.pyplot import scatter
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()] 
stds = [xx.std() / 3, yy.std() / 3]
corr = 0.8 # correlation
covs = [[stds[0]**2 , stds[0]*stds[1]*corr], 
 [stds[0]*stds[1]*corr, stds[1]**2]] 
m = np.random.multivariate_normal(means, covs, 1000).T
scatter(m[0], m[1])

enter image description here

Would it be possible to implement a variation in order to create a correlated variable (i.e. numpy array) from an existing numpy array which has its own average and stdev?
Not sure if it's my mistake, but I had to do m[:,0] and m[:,1] for the scatterplot to render correctly.
@KetilMalde: You likely forgot the .T after the function call. multivariate_normal generates an array with a row per observation (1000 rows), and a column per variable (2 columns). .T transposes the array and the shape becomes (2, 1000).
This is correct. But perhaps it is easier to understand the positioning and scaling of the normal marginals compared to the given bounds if you simply write, say, means[0]=(xx[0]+xx[1])/2 and stds[0]=(xx[1]-xx[0])/6. I was confused a bit why you need the sample standard deviations of xx and yy. Then I realized that for two data points it simplies to the half-width of the given range (if .std() uses the biased sample variance definition with the 1/N factor instead of 1/(N-1), which it does.)

CollectivesTM on Stack Overflow

Generating random correlated x and y points using Numpy

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related