Discrete random number generator in Python

Question 1

Description: I am given a list of possible values, each of which have an associated probability of occurrence.

How could I improve the algorithm that randomly generates a value based on the given distribution? I want to let the number of possible values grow very large. I essentially use the cumulative probabilities in my solution.

import numpy as np, numpy.random
N = 100
possible_values = [x for x in range(N)]
probabilities = np.random.dirichlet(np.ones(N),size=1).tolist()[0]
def sampleValue(possible_values, probabilities):
 U = random.random()
 cum_probs= np.cumsum(probabilities).tolist()
 for i, c in enumerate(cum_probs):
 if U < c:
 return possible_values[i]
testValues = [sampleValue(possible_values, probabilities) for x in range(100000)]
#testing
print('Sample test')
print('Theoretical', probabilities[10])
print('Simulated', testValues.count(10) / 100000)

Question 2

Is there a reason why you start off with numpy, then switch to lists and a for loop?

If you are given lists, you could convert them to numpy arrays and refactor your sampleValue() function to be done entirely in numpy. Otherwise, keep your generated values as numpy arrays instead of converting them to list and refactor your sampleValue() function to be done with numpy anyway.

probabilities = np.array(np.random.dirichlet(np.ones(N),size=1).tolist()[0])
def sampleValue2(possible_values, probabilities):
 U = random.random()
 return possible_values[np.argmax(np.cumsum(probabilities)>U)]

This reduces function simulation time from

1.18 s ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

to

486 ms ± 8.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then if you know how many random #s you want you can ditch the list comprehension by creating a probability matrix:

cum_probs = np.tile(np.cumsum(probabilities),(num_test,1))

Apply the function we rewrote over the entire matrix instead of by row:

indices = np.argmax(U<cum_probs,axis=1)
return np.array(possible_values)[indices]

to write a new function:

def sampleValue3(possible_values, probabilities):
 U = np.random.random(size=(num_test,1))
 cum_probs = np.tile(np.cumsum(probabilities),(num_test,1))
 indices = np.argmax(U<cum_probs,axis=1)
 return np.array(possible_values)[indices]

And that gives me a simulation time of:

65.9 ms ± 502 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Since you are just doing a lot of numeric operations, your algorithm could be sped up a lot just by keeping everything within numpy arrays/matrices.

Alternatively, from your problem definition, I don't see any reason why you can't just use numpy's random choice:

def sampleValue5(possible_values, probabilities):
 return np.random.choice(possible_values,p=probabilities,size=100000)

which is the fastest of these at:

36.8 ms ± 315 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Question 3

See Walker's alias method for random objects with different probablities,
Python class Walkerrandom (old, 2008).

mochi mochi 1,1445 silver badges7 bronze badges · Accepted Answer · 2017-09-29 02:36:36Z

Is there a reason why you start off with numpy, then switch to lists and a for loop?

If you are given lists, you could convert them to numpy arrays and refactor your sampleValue() function to be done entirely in numpy. Otherwise, keep your generated values as numpy arrays instead of converting them to list and refactor your sampleValue() function to be done with numpy anyway.

probabilities = np.array(np.random.dirichlet(np.ones(N),size=1).tolist()[0])
def sampleValue2(possible_values, probabilities):
 U = random.random()
 return possible_values[np.argmax(np.cumsum(probabilities)>U)]

This reduces function simulation time from

1.18 s ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

to

486 ms ± 8.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then if you know how many random #s you want you can ditch the list comprehension by creating a probability matrix:

cum_probs = np.tile(np.cumsum(probabilities),(num_test,1))

Apply the function we rewrote over the entire matrix instead of by row:

indices = np.argmax(U<cum_probs,axis=1)
return np.array(possible_values)[indices]

to write a new function:

def sampleValue3(possible_values, probabilities):
 U = np.random.random(size=(num_test,1))
 cum_probs = np.tile(np.cumsum(probabilities),(num_test,1))
 indices = np.argmax(U<cum_probs,axis=1)
 return np.array(possible_values)[indices]

And that gives me a simulation time of:

65.9 ms ± 502 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Since you are just doing a lot of numeric operations, your algorithm could be sped up a lot just by keeping everything within numpy arrays/matrices.

Alternatively, from your problem definition, I don't see any reason why you can't just use numpy's random choice:

def sampleValue5(possible_values, probabilities):
 return np.random.choice(possible_values,p=probabilities,size=100000)

which is the fastest of these at:

36.8 ms ± 315 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Stack Exchange Network

Discrete random number generator in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Discrete random number generator in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions