Weighted random selection

Question 1

If given two lists a: ['the', 'a', 'and', 'for'] and b: [0.2, 0.3, 0.4, 0.1], the positions of the numbers in b represents the weights for the respective probabilities of obtaining the corresponding word. E.g. the : 20%, a : 30%, etc. Create a function that will generate words based on their corresponding probabilities.

My solution:

import random
def weighted_word_selection(words, weights):
 """ 
 words : an array of strings (words) 
 weights : an array of floats (corresponding probabilties based on)
 index number.
 """
 start = 0
 for i in range(len(weights)):
 weights[i] = start+weights[i]
 start += weights[i]
 r = random.uniform(0, 1.0)
 for i in range(len(weights)):
 if r < weights[i]:
 return words[i]

Questions:

What are some alternative ways to solve this, bearing in mind the requirement to optimize performance as the array words and the array weights approaches a large n? Any detailed explanations of underlying processes such as Cython/C based Python solutions are helpful.
Can you discuss the \$O(n)\$ based complexity of the problem and some performance issues in my solution, and how those are addressed in your suggested solution at scale?

Question 2

You may be interested in this post. Its main result is a data structure for probability distribution that runs insert, delete and sample operations in \$\Theta(\log n)\$.

Question 3

You have probably an error in your first for loop:

 weights[i] = start+weights[i]
 start += weights[i]

as the resulting value of start (BTW not very descriptive name) will be increased twice by weights[i], so the fix may be

 start += weights[i]
 weights[i] = start

But anyway, why you accumulate weights of individual words and moreover you assign the final result (which will be always 1, of course) in the start variable, if you never use it? IMHO your program perform something other than you want.

As your two input lists correspond one to other, why not use the zip() function to make pairs of corresponding values:

word_weights = zip(words, weights)

and use it in your for loops?

Why don't utilize the enumerate() function for obtaining indices without any effort?

MarianD MarianD 1,9561 gold badge11 silver badges20 bronze badges · Answer 1 · 2017-08-25 17:48:44Z

You have probably an error in your first for loop:

 weights[i] = start+weights[i]
 start += weights[i]

as the resulting value of start (BTW not very descriptive name) will be increased twice by weights[i], so the fix may be

 start += weights[i]
 weights[i] = start

But anyway, why you accumulate weights of individual words and moreover you assign the final result (which will be always 1, of course) in the start variable, if you never use it? IMHO your program perform something other than you want.

As your two input lists correspond one to other, why not use the zip() function to make pairs of corresponding values:

word_weights = zip(words, weights)

and use it in your for loops?

Why don't utilize the enumerate() function for obtaining indices without any effort?

Stack Exchange Network

Weighted random selection

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Weighted random selection

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions