3

Given an array of true/false values, what is the most efficient algorithm to select an index with a true value at random.

A sketch simple algorithm is

a <- the array
c <- 0
for i in a:
 if a[i] is true: c++
e <- random number in (0, c-1)
j <- 0
for i in e:
 while j is false: j++
return j

Can anyone come up with a faster algorithm? Maybe there is a way to only walk through the list once even if the number of true elements is not known at first?

asked Dec 3, 2009 at 15:26
2
  • Just curious to know, in which applications are these type of algorithms used? Sometime ago, I came across a similar question, given an infinite size array, first n places are filled with 1's, rest all are zeros. Now this array is given to a new user (who does not know value of n). Now find out an algorithm to mark the place where last 1 is there. This I solved by binary search. Please give some examples where these are used. Commented Dec 3, 2009 at 15:33
  • Near duplicate: stackoverflow.com/questions/1133942/…. In that question the array is of size 52, though, which could affect the answers (for instance, you're pretty sure that an arary of size 52 fits in memory, whereas a here might not fit). Commented Dec 3, 2009 at 15:40

2 Answers 2

8

Use the "pick a random element from an infinite list" algorithm.

Keep an index of your current pick, and also a count of how many true values you've seen.

When you see a true value, increment the count and then replace your pick with the current index with a probability of P=(1/count). (So you always pick the first one you find... then you might switch to the second one, with probability 1/2, then you might switch to the third one with probabilty 1/3 etc.)

This requires only one scan over the list and constant storage. (It does require you to work out a larger number of random numbers, however.) In particular, it doesn't ever require you to either buffer the list or go back to the start - so it can work on an unbounded input stream.

See this answer for a sample LINQ implementation of the simple "pick a random element" algorithm; it would just need minor tweaks.

answered Dec 3, 2009 at 15:32
Sign up to request clarification or add additional context in comments.

4 Comments

A few more details and a proof here: stackoverflow.com/questions/1133942/…. This question is functionally a duplicate of that one, although phrased a bit differently. My instinct is that it will most likely be slower than the two-pass algorithm, assuming data in memory. But worth testing, if the two-pass performance is unacceptable for whatever reason.
@Steve: It depends on the sparseness of "true" values vs cost of generating a random number. If you have a million entries in the list, only 2 of which are "true", then this is likely to be a win. If, on the other hand, you have a million entries all of which are true, the two-pass algorithm will probably be faster. In general I just like the elegance of one-pass constant storage algorithms :)
Heh, I just made the same comment about sparseness on Johannes' answer. I agree about elegance, too, although I worry slightly that using a lot of random numbers makes it harder to analyse the effects of any weaknesses in the RNG.
Thanks Jon and Steve, I suspected there was some technique like this but I wasn't aware of it before!
6

Build a list with indexes that point to true values and select one of those at random. Requires O(n) for list traversal and one try for the random number.

answered Dec 3, 2009 at 15:27

5 Comments

That certainly is faster than what I came up with, though it uses O(n) working space, where mine uses only constant working space. So there still might be room for improvement.
Is it certainly faster? If true values are very rare, then it's almost certainly faster. If false values are very rare, then it's almost certainly slower. Where the break-even point is, I don't know.
Yes, certainly the distribution of true/false values does matter for the question which algorithm is more efficient. But when that's not known all bets are off, as usual. Still, I find Jon's answer very nice and likely to be better than this.
@Steve: It really depends on how often you need to do it. Building the index only happens once, so even if false values are extremely rare, using the index will eventually be faster.
Yes, I hadn't thought of that: there may be guarantees that a certain number of selections will be done before the next time the contents of the array are changed. If we're allowed to widen the scope a little, then it might be better to get rid of the array of flags entirely, and just keep a structure containing the "true" indices. Insertion/removal will be slower, but selection will be faster.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.