Fast random selection algorithm

Question 1

Given an array of true/false values, what is the most efficient algorithm to select an index with a true value at random.

A sketch simple algorithm is

a <- the array
c <- 0
for i in a:
 if a[i] is true: c++
e <- random number in (0, c-1)
j <- 0
for i in e:
 while j is false: j++
return j

Can anyone come up with a faster algorithm? Maybe there is a way to only walk through the list once even if the number of true elements is not known at first?

Question 2

Just curious to know, in which applications are these type of algorithms used? Sometime ago, I came across a similar question, given an infinite size array, first n places are filled with 1's, rest all are zeros. Now this array is given to a new user (who does not know value of n). Now find out an algorithm to mark the place where last 1 is there. This I solved by binary search. Please give some examples where these are used.

Question 3

Near duplicate: stackoverflow.com/questions/1133942/…. In that question the array is of size 52, though, which could affect the answers (for instance, you're pretty sure that an arary of size 52 fits in memory, whereas a here might not fit).

Question 4

Use the "pick a random element from an infinite list" algorithm.

Keep an index of your current pick, and also a count of how many true values you've seen.

When you see a true value, increment the count and then replace your pick with the current index with a probability of P=(1/count). (So you always pick the first one you find... then you might switch to the second one, with probability 1/2, then you might switch to the third one with probabilty 1/3 etc.)

This requires only one scan over the list and constant storage. (It does require you to work out a larger number of random numbers, however.) In particular, it doesn't ever require you to either buffer the list or go back to the start - so it can work on an unbounded input stream.

See this answer for a sample LINQ implementation of the simple "pick a random element" algorithm; it would just need minor tweaks.

Question 5

A few more details and a proof here: stackoverflow.com/questions/1133942/…. This question is functionally a duplicate of that one, although phrased a bit differently. My instinct is that it will most likely be slower than the two-pass algorithm, assuming data in memory. But worth testing, if the two-pass performance is unacceptable for whatever reason.

Question 6

@Steve: It depends on the sparseness of "true" values vs cost of generating a random number. If you have a million entries in the list, only 2 of which are "true", then this is likely to be a win. If, on the other hand, you have a million entries all of which are true, the two-pass algorithm will probably be faster. In general I just like the elegance of one-pass constant storage algorithms :)

Question 7

Heh, I just made the same comment about sparseness on Johannes' answer. I agree about elegance, too, although I worry slightly that using a lot of random numbers makes it harder to analyse the effects of any weaknesses in the RNG.

Question 8

Thanks Jon and Steve, I suspected there was some technique like this but I wasn't aware of it before!

Question 9

Build a list with indexes that point to true values and select one of those at random. Requires O(n) for list traversal and one try for the random number.

Question 10

That certainly is faster than what I came up with, though it uses O(n) working space, where mine uses only constant working space. So there still might be room for improvement.

Question 11

Is it certainly faster? If true values are very rare, then it's almost certainly faster. If false values are very rare, then it's almost certainly slower. Where the break-even point is, I don't know.

Question 12

Yes, certainly the distribution of true/false values does matter for the question which algorithm is more efficient. But when that's not known all bets are off, as usual. Still, I find Jon's answer very nice and likely to be better than this.

Question 13

@Steve: It really depends on how often you need to do it. Building the index only happens once, so even if false values are extremely rare, using the index will eventually be faster.

Question 14

Yes, I hadn't thought of that: there may be guarantees that a certain number of selections will be done before the next time the contents of the array are changed. If we're allowed to widen the scope a little, then it might be better to get rid of the array of flags entirely, and just keep a structure containing the "true" indices. Insertion/removal will be slower, but selection will be faster.

Jon Skeet 1.5m893 gold badges9.3k silver badges9.3k bronze badges · Accepted Answer · 2009-12-03 15:32:06Z

8

Use the "pick a random element from an infinite list" algorithm.

Keep an index of your current pick, and also a count of how many true values you've seen.

When you see a true value, increment the count and then replace your pick with the current index with a probability of P=(1/count). (So you always pick the first one you find... then you might switch to the second one, with probability 1/2, then you might switch to the third one with probabilty 1/3 etc.)

This requires only one scan over the list and constant storage. (It does require you to work out a larger number of random numbers, however.) In particular, it doesn't ever require you to either buffer the list or go back to the start - so it can work on an unbounded input stream.

See this answer for a sample LINQ implementation of the simple "pick a random element" algorithm; it would just need minor tweaks.

Share

Improve this answer

edited May 23, 2017 at 12:19

Community's user avatar

Community Bot

11 silver badge

answered Dec 3, 2009 at 15:32

Jon Skeet's user avatar

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Steve Jessop

Steve Jessop Over a year ago

A few more details and a proof here: stackoverflow.com/questions/1133942/…. This question is functionally a duplicate of that one, although phrased a bit differently. My instinct is that it will most likely be slower than the two-pass algorithm, assuming data in memory. But worth testing, if the two-pass performance is unacceptable for whatever reason.

2009年12月03日T15:37:42.247Z+00:00

Jon Skeet

Jon Skeet Over a year ago

@Steve: It depends on the sparseness of "true" values vs cost of generating a random number. If you have a million entries in the list, only 2 of which are "true", then this is likely to be a win. If, on the other hand, you have a million entries all of which are true, the two-pass algorithm will probably be faster. In general I just like the elegance of one-pass constant storage algorithms :)

2009年12月03日T15:44:50.297Z+00:00

Steve Jessop

Steve Jessop Over a year ago

Heh, I just made the same comment about sparseness on Johannes' answer. I agree about elegance, too, although I worry slightly that using a lot of random numbers makes it harder to analyse the effects of any weaknesses in the RNG.

2009年12月03日T15:50:05.577Z+00:00

momeara

momeara Over a year ago

Thanks Jon and Steve, I suspected there was some technique like this but I wasn't aware of it before!

2009年12月04日T05:57:16.423Z+00:00

CollectivesTM on Stack Overflow

Fast random selection algorithm

2 Answers 2

4 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

4 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related