Randomly generate a list with predetermined mean

Question 1

Aim: randomly generate \$n\$ numbers between \$a\$ and \$b\$ with (roughly) mean \$m\$.

Problem: The simple beginner's code I wrote becomes very inefficient as \$m\$ moves away from \$\frac{a + b}{2}\$.

import random
import numpy as np
import time
low = 0
high = 100
data = 50
def lst_w_mean(target):
 HT = [x for x in range(low, high+1)]
 target_mean = 0
 while target_mean <> int(target):
 outcomes = [random.choice(HT) for x in range(data)]
 target_mean = np.mean(outcomes)
 return outcomes
t1 = time.clock()
data_new = lst_w_mean(54)
print "elapsed time: ", time.clock() - t1
print data_new
t1 = time.clock()
data_new = lst_w_mean(32)
print "elapsed time: ", time.clock() - t1
print data_new

Works pretty decently for means close to the the mean of HT (e.g., means between between 40 and 60). Takes for ever (without generating a run-time error) beyond that narrow range (roughly 28 minutes for mean =30)

%run 
elapsed time for mean = 54: 0.00734800000009
[46, 82, 6, 0, 90, 73, 99, 2, 34, 88, 51, 14, 89, 72, 40, 8, 97, 44, 46, 45, 89, 10, 22, 52, 96, 98, 43, 52, 59, 74, 52, 54, 64, 66, 21, 71, 92, 34, 76, 33, 26, 36, 53, 74, 64, 85, 59, 26, 69, 24]
elapsed time for mean = 30: 1695.210358
[67, 43, 30, 78, 67, 33, 13, 24, 27, 1, 2, 4, 4, 2, 49, 1, 41, 16, 6, 9, 6, 90, 1, 31, 52, 91, 5, 11, 4, 94, 2, 74, 60, 38, 8, 2, 10, 20, 92, 25, 17, 36, 12, 9, 7, 22, 16, 72, 44, 32]

Reformulating HT with something like:

if target < 40:
 high = 70

etc., so that the target mean is closer to the mean of HT does not generate realistic data besides being an ad hoc solution. And yes, I would like what I call "realistic data" to have a "high" standard deviation perhaps with a number of "outliers".

The process speeds up significantly as the value of data diminishes, so maybe the better approach is to try to use recursion with smaller data values (sort of like in merge-sort)?

Question 2

Do you have any other requirement for the variables? In other words, could you just sample from a gaussian with mean target?

Question 3

which version of python is used?

Question 4

@Alex Judging from the print statements 2.x...

Question 5

Yes, 2.7 @Graipher Sorry, neither a programmer nor a statistician--just a philosophy teacher who is interested in these things. So I am not sure exactly what you mean by "sampling from a Gaussian". If it means generating samples that have Gaussian distributions, and checking their means, that would sort of go against my intention of accepting large variation and outliers. I noticed that python's random module had some tools that seem to do things like that, is that what you had in mind?

Question 6

@200_success If you can please explain what was wrong with my edit (since you deleted it), I will refrain from repeating the error. Such edits are common in Stackoverflow--mine was intended to help others.

Question 7

Review

The official Python style-guide recommends using lower_case for variables (your variable HT violates this).

I would make all relevant parameters actual parameters of the function. This makes it a lot easier to test.

The main code of a module should be wrapped in a if __name__ == "__main__": guard. This allows you to do in another script:

from random_with_mean import lst_w_mean

without executing the tests.

I would try to find a different name for the function, without too many abbreviations.

Alternative approach

I would use a completely different approach to this.

First, note that the Poisson distribution is a positive-semidefinite distribution for integers. If we want to generate a integers in the range (0, 100) (similar to your example, but note the off-by-one error), with mean 50, We can just take a np.random.poisson(50) (which I will call P(50) from now on), ensuring to throw away all values which are larger than 100 and redraw them.

This also works when generating values in the range (0, 100) but with mean 10. Only when the mean becomes larger than the half-way point do we start getting into trouble. Suddenly we are starting to loose a larger and larger part of the tail to the upper boundary. To avoid this, we can just swap the problem around and declare the upper boundary to be zero and generate the values as high - P(high - target), so we generate how far the value is away from the boundary.

Similarly, when the lower bound is not zero, we need to shift the values upwards and generate low + P(target - low).

An implementation of this is:

import numpy as np
def lst_w_mean(low, high, num, target):
 swap = target > (high - low) / 2
 out = []
 to_generate = num
 while to_generate:
 if swap:
 x = high - np.random.poisson(high - target, size=to_generate)
 x = x[low <= x] # x can't be larger than high by construction
 else:
 x = low + np.random.poisson(target - low, size=to_generate)
 x = x[x < high] # x can't be smaller than low by construction
 out.extend(x)
 to_generate = num - len(out)
 return out
if __name__ == "__main__":
 print np.mean(lst_w_mean(0, 100, 50, 50))
 print np.mean(lst_w_mean(0, 100, 50, 10))
 print np.mean(lst_w_mean(0, 100, 50, 70))

Note that I always generate multiple values (enough to fill the out list if all were inside the boundaries) at the same time, then throw away all values outside of the boundaries and loop until the out list is the right size. Most of the time only one iteration is needed, because it can only go outside of the boundary on the side away from the closer boundary.

Note that the variance of the Poisson distribution is the same as its mean and so its standard deviation is the square root of that. This means that you will get some outliers, but the distribution is far from uniform over your range.

To visualize this, here are three different samples (with n=5000), one with mean 10, 50, 75 and 99 each using the Poisson distribution:

Poisson distribution

As an alternative, you could take the Beta distribution, which is only defined on the interval [0, 1], which helps us a lot here, because we don't need to take care of edge effects, we only need to shift and rescale our values. Given a mean and sigma (I took sigma to be an arbitrary value), we can derive the necessary parameters \$\alpha\$ and \$\beta\,ドル as given as alternative parametrization on wikipedia:

\$\alpha = \mu \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right),\quad \beta = (1 - \mu) \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right)\$

which only holds when \$\sigma^2 < \mu(1 - \mu)\$.

With this, I ended up with the following code:

from __future__ import division
def lst_beta(low, high, num, mean):
 diff = high - low
 mu = (mean - low) / diff
 sigma = mu * (1 - mu)
 c = 1 / sigma - 1
 a, b = mu * c, (1 - mu) * c
 return low + diff * np.random.beta(a, b, size=num)

Visualized for \$\mu = 10, 50, 75, 99\$ this looks like this:

Beta distribution

In comparison, this is what @Peilonrayz's algorithm produces for these means:

Peilonrayz

Edit:

Here is the code to generate the graphs:

import matplotlib.pyplot as plt
#function definitions here
def lst_w_mean(low, high, num, target):
 ...
plt.figure()
means = 10, 50, 75, 99
for i, mean in enumerate(means, 1):
 plt.subplot(len(means), 1, i)
 l = lst_w_mean(0, 100, 5000, mean)
 plt.hist(l, bins=100, range=(0, 100))
plt.show()

Question 8

With your alternate approach, I would be tempted to use a normal distribution with trimmed tails for target means near the middle of the range, rather than only poisson, as trimming changes the mean. Something like lower third poisson, middle third normal, upper third flipped poisson

Question 9

There are probably multiple useful distributions to choose from. Working out the error should be analytically tractable, all it requires is looking at CDFs at the cutoff points

Question 10

Thank you, works and fast. However, yields very low standard deviations (between 4 and 6). I am looking to generate data with stdev over 10, preferably higher. I commented above about this worry regarding Gaussian or other tailored distributions.

Question 11

@user2738815 Yes, that is the disadvantage of the poisson distribution, as noted in my last paragraph. A gaussian would have the advantage that you can choose the std dev, at the price of manually having to deal with the edge effects. I will think if I can find a better distribution.

Question 12

Yes, this is excellent! However, initially, the code generated a "divison by zero" error when called with, say: lst_beta(0, 100, 164, 70), so I patched it by changing the second line of the function to: mu = (mean - low) / float(diff). (Might not be the best way to patch, but it works.) Here is a plot where an actual distribution I have (blue), and a distribution generated by the code (green) with the same mean is plotted together. Impressive.

Question 13

If you want to keep your algorithm, then you should:

Take; low, high, and data, as arguments to lst_w_mean.
Use != not <>.
Upgrade to Python 3 to have the option to remove the reliance on NumPy. With the new statistics.mean

Your algorithm reminds me of the infamous Bogosort. You're generating a bunch of random numbers and hoping for the best. This is clearly not going to be performant.

First off I'd recommend that you create a helper function. This is as thinking of algorithms as \0ドル-\text{stop}\$ with a step of \1ドル\$ is easier than thinking of \$\text{start}-\text{stop}\$ with a step of \$\text{step}\$. To do this you can perform a transform on the output of you function so that you multiply by the step, and add by the start. But you should take into account that some means are not possible with certain steps. Finally if you make your input work like a Python range, then it'll be familiar to other Python programmers. And so you could get a decorator like:

from functools import wraps
from math import copysign
def normalize_input(fn):
 @wraps(fn)
 def wrapper(mean, length, start, stop=None, step=None):
 if stop is None:
 stop = start
 start = 0
 if step is None:
 step = 1
 if not (start <= mean * copysign(1, step) < stop):
 raise ValueError('mean is not within distribution range.')
 if mean * length % step != 0:
 raise ValueError('unreachable mean.')
 stop -= start
 mean -= start
 if step != 1:
 stop /= step
 mean /= step
 numbers = fn(mean, length, stop)
 for i, num in enumerate(numbers):
 numbers[i] = num * step + start
 return numbers
 return wrapper

This allows you to define functions that only take a mean, length and stop. And so could make your code:

@normalize_input
def list_with_mean(mean, length, stop):
 HT = [x for x in range(stop+1)]
 target_mean = 0
 while target_mean != mean:
 outcomes = [random.choice(HT) for x in range(length)]
 target_mean = sum(outcomes)/len(outcomes)
 return outcomes
numbers = list_with_mean(50, 100, 25, 75)

Which will generate \100ドル\$ random numbers with a mean of \50ドル\,ドル and with numbers ranging from \25ドル-75\$.

After this you want to look into a better way to generate these random numbers. At first I used an algorithm that generates \$n\$ random numbers that add to a number. As then we can use an algorithm documented by David Schwartz. To prove that we can use this algorithm rather than the one that you are using, we have two of the numbers in the mean equation:

$$\text{mean} = \frac{\Sigma_{i=0}^{\text{len}(n)}(n_i)}{\text{len}(n)}$$
$$\Sigma_{i=0}^{\text{len}(n)}(n_i) = \text{mean} * \text{len}(n)$$

And so we can quickly implement the algorithm with:

from random import randrange
from itertools import tee
# Itertools Recipe 
def pairwise(iterable):
 "s -> (s0,s1), (s1,s2), (s2, s3), ..."
 a, b = tee(iterable)
 next(b, None)
 return zip(a, b)
def fixed_sum(total, length):
 for a, b in pairwise(sorted([0, total] + [randrange(total) for _ in range(length - 1)])):
 yield b - a
def bias_mean(lower=False):
 def bias_mean_inner(fn):
 @wraps(fn)
 def wrapper(mean, length, stop):
 flip_mean = (mean > stop / 2) is lower
 if flip_mean:
 mean = stop - mean
 numbers = fn(mean, length, stop)
 if flip_mean:
 for i, num in enumerate(numbers):
 numbers[i] = stop - num
 return numbers
 return wrapper
 return bias_mean_inner
@normalize_input
@bias_mean(True)
def by_sum(mean, length, stop):
 numbers = list(fixed_sum(int(mean * length), length))
 add = 0
 while True:
 for i in range(0, length):
 num = numbers[i]
 if num >= stop:
 add += (num - stop)
 numbers[i] = stop
 else:
 if add + num > stop:
 n = stop
 add -= stop - num
 else:
 n = num + add
 add = 0
 numbers[i] = n
 if add == 0:
 break
 return numbers

As pointed out by @Graipher this algorithm isn't very good to generate the random numbers. And so instead of using the above algorithm, you can use one that adds a random amount to each number, as long as it doesn't exceed the maximum size. To do this you can create an amount variable that you subtract from, when you generate a random number. You then want to loop through you list and pick a random number to add to the current item. The domain of this random item is \0ドル-\min(\text{amount}, \text{stop} - \text{item})\$. This is so that it doesn't exceed the maximum number size, and all the items add to the total we want. This can get you the code:

@normalize_input
@bias_mean(False)
def by_sum_distribution(mean, length, stop):
 numbers = [0] * length
 amount = length * mean
 while amount:
 for i in range(length):
 n = randrange(min(amount, stop - numbers[i]) + 1)
 numbers[i] += n
 amount -= n
 return numbers

This has a uniform output when the \$\text{mean} = \frac{\text{stop}}{2}\,ドル but otherwise the beta function is much better.

Question 14

Thanks, I am studying/testing your answer, but am hitting a snag. When I make the call numbers_with_mean(x, 50, 1, 100, 1) using a loop with x ranging from 30 to 70, output time for each step takes around .0002 seconds until a certain mean -- then the program gets stuck in run-time (no error message, stuck at different mean each time). Also, where did you get the idea that my code was an "algorithm" :) I am just a beginner who is asking for help.

Question 15

@user2738815 Yeah, I messed up. I was transforming the array whilst modifying it, and so if you randomly got stop at the end you'd infinitely loop. It should now be fixed, this was also the cause of the odd means. If you look at a definition of algorithm, "an algorithm is a self-contained step-by-step set of operations to be performed" - Wikipedia, it's what your code is.

Question 16

Sincerely grateful for the time you are putting into this. Yes, since I have been teaching logic for years, I am familiar with that wider sense of "algorithm" according to which even well-written pizza recipes are one :) However, I do not think they will be taught in a CS course called "Algorithms".

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2016-12-14 16:52:28Z

Review

The official Python style-guide recommends using lower_case for variables (your variable HT violates this).

I would make all relevant parameters actual parameters of the function. This makes it a lot easier to test.

The main code of a module should be wrapped in a if __name__ == "__main__": guard. This allows you to do in another script:

from random_with_mean import lst_w_mean

without executing the tests.

I would try to find a different name for the function, without too many abbreviations.

Alternative approach

I would use a completely different approach to this.

First, note that the Poisson distribution is a positive-semidefinite distribution for integers. If we want to generate a integers in the range (0, 100) (similar to your example, but note the off-by-one error), with mean 50, We can just take a np.random.poisson(50) (which I will call P(50) from now on), ensuring to throw away all values which are larger than 100 and redraw them.

This also works when generating values in the range (0, 100) but with mean 10. Only when the mean becomes larger than the half-way point do we start getting into trouble. Suddenly we are starting to loose a larger and larger part of the tail to the upper boundary. To avoid this, we can just swap the problem around and declare the upper boundary to be zero and generate the values as high - P(high - target), so we generate how far the value is away from the boundary.

Similarly, when the lower bound is not zero, we need to shift the values upwards and generate low + P(target - low).

An implementation of this is:

import numpy as np
def lst_w_mean(low, high, num, target):
 swap = target > (high - low) / 2
 out = []
 to_generate = num
 while to_generate:
 if swap:
 x = high - np.random.poisson(high - target, size=to_generate)
 x = x[low <= x] # x can't be larger than high by construction
 else:
 x = low + np.random.poisson(target - low, size=to_generate)
 x = x[x < high] # x can't be smaller than low by construction
 out.extend(x)
 to_generate = num - len(out)
 return out
if __name__ == "__main__":
 print np.mean(lst_w_mean(0, 100, 50, 50))
 print np.mean(lst_w_mean(0, 100, 50, 10))
 print np.mean(lst_w_mean(0, 100, 50, 70))

Note that I always generate multiple values (enough to fill the out list if all were inside the boundaries) at the same time, then throw away all values outside of the boundaries and loop until the out list is the right size. Most of the time only one iteration is needed, because it can only go outside of the boundary on the side away from the closer boundary.

Note that the variance of the Poisson distribution is the same as its mean and so its standard deviation is the square root of that. This means that you will get some outliers, but the distribution is far from uniform over your range.

To visualize this, here are three different samples (with n=5000), one with mean 10, 50, 75 and 99 each using the Poisson distribution:

Poisson distribution

As an alternative, you could take the Beta distribution, which is only defined on the interval [0, 1], which helps us a lot here, because we don't need to take care of edge effects, we only need to shift and rescale our values. Given a mean and sigma (I took sigma to be an arbitrary value), we can derive the necessary parameters \$\alpha\$ and \$\beta\,ドル as given as alternative parametrization on wikipedia:

\$\alpha = \mu \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right),\quad \beta = (1 - \mu) \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right)\$

which only holds when \$\sigma^2 < \mu(1 - \mu)\$.

With this, I ended up with the following code:

from __future__ import division
def lst_beta(low, high, num, mean):
 diff = high - low
 mu = (mean - low) / diff
 sigma = mu * (1 - mu)
 c = 1 / sigma - 1
 a, b = mu * c, (1 - mu) * c
 return low + diff * np.random.beta(a, b, size=num)

Visualized for \$\mu = 10, 50, 75, 99\$ this looks like this:

Beta distribution

In comparison, this is what @Peilonrayz's algorithm produces for these means:

Peilonrayz

Edit:

Here is the code to generate the graphs:

import matplotlib.pyplot as plt
#function definitions here
def lst_w_mean(low, high, num, target):
 ...
plt.figure()
means = 10, 50, 75, 99
for i, mean in enumerate(means, 1):
 plt.subplot(len(means), 1, i)
 l = lst_w_mean(0, 100, 5000, mean)
 plt.hist(l, bins=100, range=(0, 100))
plt.show()

With your alternate approach, I would be tempted to use a normal distribution with trimmed tails for target means near the middle of the range, rather than only poisson, as trimming changes the mean. Something like lower third poisson, middle third normal, upper third flipped poisson
There are probably multiple useful distributions to choose from. Working out the error should be analytically tractable, all it requires is looking at CDFs at the cutoff points
Thank you, works and fast. However, yields very low standard deviations (between 4 and 6). I am looking to generate data with stdev over 10, preferably higher. I commented above about this worry regarding Gaussian or other tailored distributions.
@user2738815 Yes, that is the disadvantage of the poisson distribution, as noted in my last paragraph. A gaussian would have the advantage that you can choose the std dev, at the price of manually having to deal with the edge effects. I will think if I can find a better distribution.
Yes, this is excellent! However, initially, the code generated a "divison by zero" error when called with, say: lst_beta(0, 100, 164, 70), so I patched it by changing the second line of the function to: mu = (mean - low) / float(diff). (Might not be the best way to patch, but it works.) Here is a plot where an actual distribution I have (blue), and a distribution generated by the code (green) with the same mean is plotted together. Impressive.

Stack Exchange Network

Randomly generate a list with predetermined mean

2 Answers 2

Review

Alternative approach

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Randomly generate a list with predetermined mean

2 Answers 2

Review

Alternative approach

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions