Aim: randomly generate \$n\$ numbers between \$a\$ and \$b\$ with (roughly) mean \$m\$.
Problem: The simple beginner's code I wrote becomes very inefficient as \$m\$ moves away from \$\frac{a + b}{2}\$.
import random
import numpy as np
import time
low = 0
high = 100
data = 50
def lst_w_mean(target):
HT = [x for x in range(low, high+1)]
target_mean = 0
while target_mean <> int(target):
outcomes = [random.choice(HT) for x in range(data)]
target_mean = np.mean(outcomes)
return outcomes
t1 = time.clock()
data_new = lst_w_mean(54)
print "elapsed time: ", time.clock() - t1
print data_new
t1 = time.clock()
data_new = lst_w_mean(32)
print "elapsed time: ", time.clock() - t1
print data_new
Works pretty decently for means close to the the mean of HT (e.g., means between between 40 and 60). Takes for ever (without generating a run-time error) beyond that narrow range (roughly 28 minutes for mean =30)
%run
elapsed time for mean = 54: 0.00734800000009
[46, 82, 6, 0, 90, 73, 99, 2, 34, 88, 51, 14, 89, 72, 40, 8, 97, 44, 46, 45, 89, 10, 22, 52, 96, 98, 43, 52, 59, 74, 52, 54, 64, 66, 21, 71, 92, 34, 76, 33, 26, 36, 53, 74, 64, 85, 59, 26, 69, 24]
elapsed time for mean = 30: 1695.210358
[67, 43, 30, 78, 67, 33, 13, 24, 27, 1, 2, 4, 4, 2, 49, 1, 41, 16, 6, 9, 6, 90, 1, 31, 52, 91, 5, 11, 4, 94, 2, 74, 60, 38, 8, 2, 10, 20, 92, 25, 17, 36, 12, 9, 7, 22, 16, 72, 44, 32]
Reformulating HT with something like:
if target < 40:
high = 70
etc., so that the target mean is closer to the mean of HT does not generate realistic data besides being an ad hoc solution. And yes, I would like what I call "realistic data" to have a "high" standard deviation perhaps with a number of "outliers".
The process speeds up significantly as the value of data
diminishes, so maybe the better approach is to try to use recursion with smaller data
values (sort of like in merge-sort)?
2 Answers 2
Review
The official Python style-guide recommends using lower_case
for variables (your variable HT
violates this).
I would make all relevant parameters actual parameters of the function. This makes it a lot easier to test.
The main code of a module should be wrapped in a if __name__ == "__main__":
guard. This allows you to do in another script:
from random_with_mean import lst_w_mean
without executing the tests.
I would try to find a different name for the function, without too many abbreviations.
Alternative approach
I would use a completely different approach to this.
First, note that the Poisson distribution is a positive-semidefinite distribution for integers. If we want to generate a integers in the range (0, 100)
(similar to your example, but note the off-by-one error), with mean 50, We can just take a np.random.poisson(50)
(which I will call P(50)
from now on), ensuring to throw away all values which are larger than 100 and redraw them.
This also works when generating values in the range (0, 100)
but with mean 10. Only when the mean becomes larger than the half-way point do we start getting into trouble. Suddenly we are starting to loose a larger and larger part of the tail to the upper boundary. To avoid this, we can just swap the problem around and declare the upper boundary to be zero and generate the values as high - P(high - target)
, so we generate how far the value is away from the boundary.
Similarly, when the lower bound is not zero, we need to shift the values upwards and generate low + P(target - low)
.
An implementation of this is:
import numpy as np
def lst_w_mean(low, high, num, target):
swap = target > (high - low) / 2
out = []
to_generate = num
while to_generate:
if swap:
x = high - np.random.poisson(high - target, size=to_generate)
x = x[low <= x] # x can't be larger than high by construction
else:
x = low + np.random.poisson(target - low, size=to_generate)
x = x[x < high] # x can't be smaller than low by construction
out.extend(x)
to_generate = num - len(out)
return out
if __name__ == "__main__":
print np.mean(lst_w_mean(0, 100, 50, 50))
print np.mean(lst_w_mean(0, 100, 50, 10))
print np.mean(lst_w_mean(0, 100, 50, 70))
Note that I always generate multiple values (enough to fill the out
list if all were inside the boundaries) at the same time, then throw away all values outside of the boundaries and loop until the out
list is the right size. Most of the time only one iteration is needed, because it can only go outside of the boundary on the side away from the closer boundary.
Note that the variance of the Poisson distribution is the same as its mean and so its standard deviation is the square root of that. This means that you will get some outliers, but the distribution is far from uniform over your range.
To visualize this, here are three different samples (with n=5000), one with mean 10, 50, 75 and 99 each using the Poisson distribution:
As an alternative, you could take the Beta distribution, which is only defined on the interval [0, 1], which helps us a lot here, because we don't need to take care of edge effects, we only need to shift and rescale our values. Given a mean and sigma (I took sigma to be an arbitrary value), we can derive the necessary parameters \$\alpha\$ and \$\beta\,ドル as given as alternative parametrization on wikipedia:
\$\alpha = \mu \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right),\quad \beta = (1 - \mu) \left(\frac{\mu(1 - \mu)}{\sigma^2} - 1\right)\$
which only holds when \$\sigma^2 < \mu(1 - \mu)\$.
With this, I ended up with the following code:
from __future__ import division
def lst_beta(low, high, num, mean):
diff = high - low
mu = (mean - low) / diff
sigma = mu * (1 - mu)
c = 1 / sigma - 1
a, b = mu * c, (1 - mu) * c
return low + diff * np.random.beta(a, b, size=num)
Visualized for \$\mu = 10, 50, 75, 99\$ this looks like this:
In comparison, this is what @Peilonrayz's algorithm produces for these means:
Edit:
Here is the code to generate the graphs:
import matplotlib.pyplot as plt
#function definitions here
def lst_w_mean(low, high, num, target):
...
plt.figure()
means = 10, 50, 75, 99
for i, mean in enumerate(means, 1):
plt.subplot(len(means), 1, i)
l = lst_w_mean(0, 100, 5000, mean)
plt.hist(l, bins=100, range=(0, 100))
plt.show()
-
\$\begingroup\$ With your alternate approach, I would be tempted to use a normal distribution with trimmed tails for target means near the middle of the range, rather than only poisson, as trimming changes the mean. Something like lower third poisson, middle third normal, upper third flipped poisson \$\endgroup\$Caleth– Caleth2016年12月14日 17:15:03 +00:00Commented Dec 14, 2016 at 17:15
-
\$\begingroup\$ There are probably multiple useful distributions to choose from. Working out the error should be analytically tractable, all it requires is looking at CDFs at the cutoff points \$\endgroup\$Caleth– Caleth2016年12月14日 17:19:48 +00:00Commented Dec 14, 2016 at 17:19
-
\$\begingroup\$ Thank you, works and fast. However, yields very low standard deviations (between 4 and 6). I am looking to generate data with stdev over 10, preferably higher. I commented above about this worry regarding Gaussian or other tailored distributions. \$\endgroup\$user2738815– user27388152016年12月15日 04:39:18 +00:00Commented Dec 15, 2016 at 4:39
-
\$\begingroup\$ @user2738815 Yes, that is the disadvantage of the poisson distribution, as noted in my last paragraph. A gaussian would have the advantage that you can choose the std dev, at the price of manually having to deal with the edge effects. I will think if I can find a better distribution. \$\endgroup\$Graipher– Graipher2016年12月15日 07:54:14 +00:00Commented Dec 15, 2016 at 7:54
-
1\$\begingroup\$ Yes, this is excellent! However, initially, the code generated a "divison by zero" error when called with, say:
lst_beta(0, 100, 164, 70)
, so I patched it by changing the second line of the function to:mu = (mean - low) / float(diff)
. (Might not be the best way to patch, but it works.) Here is a plot where an actual distribution I have (blue), and a distribution generated by the code (green) with the same mean is plotted together. Impressive. \$\endgroup\$user2738815– user27388152016年12月15日 20:30:07 +00:00Commented Dec 15, 2016 at 20:30
If you want to keep your algorithm, then you should:
- Take;
low
,high
, anddata
, as arguments tolst_w_mean
. - Use
!=
not<>
. - Upgrade to Python 3 to have the option to remove the reliance on NumPy. With the new
statistics.mean
Your algorithm reminds me of the infamous Bogosort. You're generating a bunch of random numbers and hoping for the best. This is clearly not going to be performant.
First off I'd recommend that you create a helper function. This is as thinking of algorithms as \0ドル-\text{stop}\$ with a step of \1ドル\$ is easier than thinking of \$\text{start}-\text{stop}\$ with a step of \$\text{step}\$. To do this you can perform a transform on the output of you function so that you multiply by the step, and add by the start. But you should take into account that some means are not possible with certain steps. Finally if you make your input work like a Python range, then it'll be familiar to other Python programmers. And so you could get a decorator like:
from functools import wraps
from math import copysign
def normalize_input(fn):
@wraps(fn)
def wrapper(mean, length, start, stop=None, step=None):
if stop is None:
stop = start
start = 0
if step is None:
step = 1
if not (start <= mean * copysign(1, step) < stop):
raise ValueError('mean is not within distribution range.')
if mean * length % step != 0:
raise ValueError('unreachable mean.')
stop -= start
mean -= start
if step != 1:
stop /= step
mean /= step
numbers = fn(mean, length, stop)
for i, num in enumerate(numbers):
numbers[i] = num * step + start
return numbers
return wrapper
This allows you to define functions that only take a mean, length and stop. And so could make your code:
@normalize_input
def list_with_mean(mean, length, stop):
HT = [x for x in range(stop+1)]
target_mean = 0
while target_mean != mean:
outcomes = [random.choice(HT) for x in range(length)]
target_mean = sum(outcomes)/len(outcomes)
return outcomes
numbers = list_with_mean(50, 100, 25, 75)
Which will generate \100ドル\$ random numbers with a mean of \50ドル\,ドル and with numbers ranging from \25ドル-75\$.
After this you want to look into a better way to generate these random numbers. At first I used an algorithm that generates \$n\$ random numbers that add to a number. As then we can use an algorithm documented by David Schwartz. To prove that we can use this algorithm rather than the one that you are using, we have two of the numbers in the mean equation:
$$\text{mean} = \frac{\Sigma_{i=0}^{\text{len}(n)}(n_i)}{\text{len}(n)}$$
$$\Sigma_{i=0}^{\text{len}(n)}(n_i) = \text{mean} * \text{len}(n)$$
And so we can quickly implement the algorithm with:
from random import randrange
from itertools import tee
# Itertools Recipe
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
def fixed_sum(total, length):
for a, b in pairwise(sorted([0, total] + [randrange(total) for _ in range(length - 1)])):
yield b - a
def bias_mean(lower=False):
def bias_mean_inner(fn):
@wraps(fn)
def wrapper(mean, length, stop):
flip_mean = (mean > stop / 2) is lower
if flip_mean:
mean = stop - mean
numbers = fn(mean, length, stop)
if flip_mean:
for i, num in enumerate(numbers):
numbers[i] = stop - num
return numbers
return wrapper
return bias_mean_inner
@normalize_input
@bias_mean(True)
def by_sum(mean, length, stop):
numbers = list(fixed_sum(int(mean * length), length))
add = 0
while True:
for i in range(0, length):
num = numbers[i]
if num >= stop:
add += (num - stop)
numbers[i] = stop
else:
if add + num > stop:
n = stop
add -= stop - num
else:
n = num + add
add = 0
numbers[i] = n
if add == 0:
break
return numbers
As pointed out by @Graipher this algorithm isn't very good to generate the random numbers. And so instead of using the above algorithm, you can use one that adds a random amount to each number, as long as it doesn't exceed the maximum size. To do this you can create an amount variable that you subtract from, when you generate a random number. You then want to loop through you list and pick a random number to add to the current item. The domain of this random item is \0ドル-\min(\text{amount}, \text{stop} - \text{item})\$. This is so that it doesn't exceed the maximum number size, and all the items add to the total we want. This can get you the code:
@normalize_input
@bias_mean(False)
def by_sum_distribution(mean, length, stop):
numbers = [0] * length
amount = length * mean
while amount:
for i in range(length):
n = randrange(min(amount, stop - numbers[i]) + 1)
numbers[i] += n
amount -= n
return numbers
This has a uniform output when the \$\text{mean} = \frac{\text{stop}}{2}\,ドル but otherwise the beta function is much better.
-
\$\begingroup\$ Thanks, I am studying/testing your answer, but am hitting a snag. When I make the call numbers_with_mean(x, 50, 1, 100, 1) using a loop with x ranging from 30 to 70, output time for each step takes around .0002 seconds until a certain mean -- then the program gets stuck in run-time (no error message, stuck at different mean each time). Also, where did you get the idea that my code was an "algorithm" :) I am just a beginner who is asking for help. \$\endgroup\$user2738815– user27388152016年12月15日 04:08:40 +00:00Commented Dec 15, 2016 at 4:08
-
1\$\begingroup\$ @user2738815 Yeah, I messed up. I was transforming the array whilst modifying it, and so if you randomly got
stop
at the end you'd infinitely loop. It should now be fixed, this was also the cause of the odd means. If you look at a definition of algorithm, "an algorithm is a self-contained step-by-step set of operations to be performed" - Wikipedia, it's what your code is. \$\endgroup\$2016年12月15日 10:06:50 +00:00Commented Dec 15, 2016 at 10:06 -
\$\begingroup\$ Sincerely grateful for the time you are putting into this. Yes, since I have been teaching logic for years, I am familiar with that wider sense of "algorithm" according to which even well-written pizza recipes are one :) However, I do not think they will be taught in a CS course called "Algorithms". \$\endgroup\$user2738815– user27388152016年12月15日 15:33:45 +00:00Commented Dec 15, 2016 at 15:33
Explore related questions
See similar questions with these tags.
target
? \$\endgroup\$print
statements 2.x... \$\endgroup\$