If given two lists a: ['the', 'a', 'and', 'for'] and b: [0.2, 0.3, 0.4, 0.1], the positions of the numbers in b represents the weights for the respective probabilities of obtaining the corresponding word. E.g. the : 20%, a : 30%, etc. Create a function that will generate words based on their corresponding probabilities.
My solution:
import random
def weighted_word_selection(words, weights):
"""
words : an array of strings (words)
weights : an array of floats (corresponding probabilties based on)
index number.
"""
start = 0
for i in range(len(weights)):
weights[i] = start+weights[i]
start += weights[i]
r = random.uniform(0, 1.0)
for i in range(len(weights)):
if r < weights[i]:
return words[i]
Questions:
- What are some alternative ways to solve this, bearing in mind the requirement to optimize performance as the array words and the array weights approaches a large n? Any detailed explanations of underlying processes such as Cython/C based Python solutions are helpful.
- Can you discuss the \$O(n)\$ based complexity of the problem and some performance issues in my solution, and how those are addressed in your suggested solution at scale?
-
2\$\begingroup\$ You may be interested in this post. Its main result is a data structure for probability distribution that runs insert, delete and sample operations in \$\Theta(\log n)\$. \$\endgroup\$coderodde– coderodde2017年08月25日 06:28:10 +00:00Commented Aug 25, 2017 at 6:28
1 Answer 1
You have probably an error in your first for
loop:
weights[i] = start+weights[i]
start += weights[i]
as the resulting value of start
(BTW not very descriptive name) will be increased twice by weights[i]
, so the fix may be
start += weights[i]
weights[i] = start
But anyway, why you accumulate weights of individual words and moreover you assign the final result (which will be always 1
, of course) in the start
variable, if you never use it? IMHO your program perform something other than you want.
As your two input lists correspond one to other, why not use the zip()
function to make pairs of corresponding values:
word_weights = zip(words, weights)
and use it in your for
loops?
Why don't utilize the enumerate()
function for obtaining indices without any effort?