I'm sure this is a fully 'solved' problem and expecting cross references but I can't find this on Google. I'm probably missing a keyword.
Here is a naive function for populating a bool
array (b[i]
) of length N
in a Bernoulli distribution.
P(b[i]==true)==p
and P(b[i]==false)==(1-p)
for each independent random variable b[i]
.
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
void randomize(bool*const bits,const size_t len,double p){
for(size_t i=0;i<len;++i){
if(p>=1.0||rand()<p*RAND_MAX){//Need to force the hand for p>=1.0 as 0 and RAND_MAX possible.
bits[i]=true;
}else{
bits[i]=false;
}
}
}
int main(void) {
const size_t len=100;//N in the text.
bool* bits=malloc(len*sizeof(*bits));//b[i] in the text.
if(bits==NULL){
return EXIT_FAILURE;
}
srand(1234);//Fixed for Reproducibility.
randomize(bits,len,0.3);
int count=0;
for(size_t i=0;i<len;++i){
printf("%c" , (bits[i]?'1':'0'));
if(bits[i]){
++count;
}
}
printf("\ncount=%f\n",((double)count)/((double)len));
free(bits);
return EXIT_SUCCESS;
}
Obviously it's linear \$\mathcal{O}(N)\$ for 'seeks' in the array.
However I could start with an array full of (0) and calculate a random variable (\$B\$) in the binomial distribution to get the number of bits to set then select them at random.
If \$p\$ is small that will involve far fewer seeks. It will still be (expected) asymptotically linear but with a far reduced constant.
If I naively pick the \$B\$ elements to set I might pick the same one twice (Birthday Paradox anyone?) and will need to pick again. If I just 'scan' forwards to the first free slot I will introduce a 'clumping' bias and violate the independence of the random variables P(b[i+1]==1|b[i]==1)>p
. Clumping is a no-no for my application.
Therein lies the rub. If \$p\$ is near to 1.0 it will take an increasingly massive amount of time to fill the array. When it's nearly full randomly picking a 0 will typically take a lot of retries.
I can be 'clever' and if \$B \gt \frac{N}{2}\$ I can block set to 1 and set \$N-B\$ back to 0. That means only counting 'seeks' I will always beat the full scan above. That's because the average retries at each step is \$\le 1 \$so expected seek count is \$ \lt N\$.
Does anyone have such a piece of code readily on hand?
PS: The data structure I'm working with isn't an array; it's a particular kind of tree. That's not important but might explain why I'm counting 'seeks' as the overriding parameter. However I am able to easily conjure up a 'tree of false' and a 'tree of true' to implement that 'block fill' requirement.
-
1\$\begingroup\$ @Jamal Thanks for the edits particularly the maths formatting. Beyond my knowledge. PS If I could be bothered I'd propose upvoting for edits on Meta Code Review. \$\endgroup\$user59064– user590642015年01月04日 20:06:49 +00:00Commented Jan 4, 2015 at 20:06
-
6\$\begingroup\$ You're very kind. :-) I don't do it for the rep; I do it for site quality. This isn't the place to talk about it, though. Feel free to post such a proposal if you wish. \$\endgroup\$Jamal– Jamal2015年01月04日 20:16:34 +00:00Commented Jan 4, 2015 at 20:16
1 Answer 1
For what I can see, the code fills an array with a given amounts of true
values (the rest being false
) in random order. I'd rather initialize first p*N
elements to true
, and then random_shuffle
the array.
That said I don't understand why you need this array at all. Can we see how your application uses it?
-
\$\begingroup\$ As to what I'm doing, I'll drop a hint. As stated I'm not working with an array but I think using my actual data structure may be distracting to this question. What I'm working with is a particular kind of tree in which each non-leaf node has exactly 4 child nodes. It may be that for higher values of p it's easier to recursively build the tree randomly from scratch. However it's obvious that for small values of p it's easier to follow my proposed Binomial approach and the 'critical' point for p is somewhere in the range most relevant to me (say 0.1 - 0.5). \$\endgroup\$user59064– user590642015年01月05日 07:40:50 +00:00Commented Jan 5, 2015 at 7:40
-
\$\begingroup\$ That said when p=0.5 there are numerous performance critical cryptographic applications in which N (bit-length) is non-trivial. In part that's why I'm surprised by a lack of ready answers. Though cryto is usually considering a packed array of bits - which doesn't massively apply to me though it could build a 'palette' of sub-trees and put them together. I lied a bit about it being a tree. It looks like a tree but duplicate sub-trees have been merged. Another clue. \$\endgroup\$user59064– user590642015年01月05日 07:55:02 +00:00Commented Jan 5, 2015 at 7:55
-
\$\begingroup\$ Random shuffle will require (2+p)*N visits on the average. 1 visit for each set and 2 visits for each swap. I'm also working in C - though obviously the algorithm is easy to implement! \$\endgroup\$user59064– user590642015年01月05日 08:33:35 +00:00Commented Jan 5, 2015 at 8:33