4
\$\begingroup\$

Description: I am given a list of possible values, each of which have an associated probability of occurrence.

How could I improve the algorithm that randomly generates a value based on the given distribution? I want to let the number of possible values grow very large. I essentially use the cumulative probabilities in my solution.

import numpy as np, numpy.random
N = 100
possible_values = [x for x in range(N)]
probabilities = np.random.dirichlet(np.ones(N),size=1).tolist()[0]
def sampleValue(possible_values, probabilities):
 U = random.random()
 cum_probs= np.cumsum(probabilities).tolist()
 for i, c in enumerate(cum_probs):
 if U < c:
 return possible_values[i]
testValues = [sampleValue(possible_values, probabilities) for x in range(100000)]
#testing
print('Sample test')
print('Theoretical', probabilities[10])
print('Simulated', testValues.count(10) / 100000)
Mast
13.8k12 gold badges56 silver badges127 bronze badges
asked Sep 28, 2017 at 18:08
\$\endgroup\$

2 Answers 2

4
\$\begingroup\$

Is there a reason why you start off with numpy, then switch to lists and a for loop?

If you are given lists, you could convert them to numpy arrays and refactor your sampleValue() function to be done entirely in numpy. Otherwise, keep your generated values as numpy arrays instead of converting them to list and refactor your sampleValue() function to be done with numpy anyway.

probabilities = np.array(np.random.dirichlet(np.ones(N),size=1).tolist()[0])
def sampleValue2(possible_values, probabilities):
 U = random.random()
 return possible_values[np.argmax(np.cumsum(probabilities)>U)]

This reduces function simulation time from

1.18 s ± 9.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

to

486 ms ± 8.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then if you know how many random #s you want you can ditch the list comprehension by creating a probability matrix:

cum_probs = np.tile(np.cumsum(probabilities),(num_test,1))

Apply the function we rewrote over the entire matrix instead of by row:

indices = np.argmax(U<cum_probs,axis=1)
return np.array(possible_values)[indices]

to write a new function:

def sampleValue3(possible_values, probabilities):
 U = np.random.random(size=(num_test,1))
 cum_probs = np.tile(np.cumsum(probabilities),(num_test,1))
 indices = np.argmax(U<cum_probs,axis=1)
 return np.array(possible_values)[indices]

And that gives me a simulation time of:

65.9 ms ± 502 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Since you are just doing a lot of numeric operations, your algorithm could be sped up a lot just by keeping everything within numpy arrays/matrices.

Alternatively, from your problem definition, I don't see any reason why you can't just use numpy's random choice:

def sampleValue5(possible_values, probabilities):
 return np.random.choice(possible_values,p=probabilities,size=100000)

which is the fastest of these at:

36.8 ms ± 315 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
answered Sep 29, 2017 at 2:36
\$\endgroup\$
1
\$\begingroup\$

See Walker's alias method for random objects with different probablities,
Python class Walkerrandom (old, 2008).

answered Apr 29, 2018 at 10:17
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.