1
\$\begingroup\$

If given two lists a: ['the', 'a', 'and', 'for'] and b: [0.2, 0.3, 0.4, 0.1], the positions of the numbers in b represents the weights for the respective probabilities of obtaining the corresponding word. E.g. the : 20%, a : 30%, etc. Create a function that will generate words based on their corresponding probabilities.

My solution:

import random
def weighted_word_selection(words, weights):
 """ 
 words : an array of strings (words) 
 weights : an array of floats (corresponding probabilties based on)
 index number.
 """
 start = 0
 for i in range(len(weights)):
 weights[i] = start+weights[i]
 start += weights[i]
 r = random.uniform(0, 1.0)
 for i in range(len(weights)):
 if r < weights[i]:
 return words[i]

Questions:

  1. What are some alternative ways to solve this, bearing in mind the requirement to optimize performance as the array words and the array weights approaches a large n? Any detailed explanations of underlying processes such as Cython/C based Python solutions are helpful.
  2. Can you discuss the \$O(n)\$ based complexity of the problem and some performance issues in my solution, and how those are addressed in your suggested solution at scale?
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Aug 25, 2017 at 2:48
\$\endgroup\$
1
  • 2
    \$\begingroup\$ You may be interested in this post. Its main result is a data structure for probability distribution that runs insert, delete and sample operations in \$\Theta(\log n)\$. \$\endgroup\$ Commented Aug 25, 2017 at 6:28

1 Answer 1

1
\$\begingroup\$

You have probably an error in your first for loop:

 weights[i] = start+weights[i]
 start += weights[i]

as the resulting value of start (BTW not very descriptive name) will be increased twice by weights[i], so the fix may be

 start += weights[i]
 weights[i] = start

But anyway, why you accumulate weights of individual words and moreover you assign the final result (which will be always 1, of course) in the start variable, if you never use it? IMHO your program perform something other than you want.


As your two input lists correspond one to other, why not use the zip() function to make pairs of corresponding values:

word_weights = zip(words, weights)

and use it in your for loops?


Why don't utilize the enumerate() function for obtaining indices without any effort?

answered Aug 25, 2017 at 17:48
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.