I have a set of objects, and each object has (let's say) a color
. Given a distribution such as:
red : 50%
green : 30%
blue : 20%
I need to produce an N-length array of objects with a color distribution as close to this as possible. It is possible that the length of the original list is smaller than N, so duplicates may be required.
Is there an efficient algorithm that might be suited for this kind of task? I imagine there is, but I'm lacking the terminology to find it.
2 Answers 2
The term you are looking for is "sampling", "sampling without replacement", "sampling with replacement", "generating random samples from a given distribution", or "inverse transform sampling". Lots of nice articles and references can be found if you use those keywords to search.
Here is a simple algorithm to simulate the given distribution, red:50% green:30% blue:20%
, using sampling with replacement.
Let arr be an empty array.
Repeat the following N times
generate a random float-point number between 0 inclusive and 1 exclusive.
if the number is between 0 inclusive and 0.50 exclusive, select a random red object.
if the number is between 0.50 inclusive and 0.80 exclusive, select a random green object.
if the number is between 0.80 inclusive and 1.0 exclusive, select a random blue object.
Append the selected object to arr.
Return arr.
-
$\begingroup$ Thanks for the terminology, just what I needed. $\endgroup$Graham– Graham2019年08月05日 15:06:53 +00:00Commented Aug 5, 2019 at 15:06
Do you need to guarantee that, in every sample, 50% of the objects are red, 30% are green and 20% are blue? If so, just make your array consist of 0ドル.5,円N$ red objects chosen uniformly at random, followed by 0ドル.3,円N$ green and 0ドル.2,円N$ blue. You'll need to deal with the rounding error if $N$ isn't divisible by 10ドル$. You can always shuffle the array if it's not desirable to have it sorted by colour.
If you only need to guarantee that the proportions of the colours are correct in the long-run, Apass.Jack's solution will do the job. If $N$ is very large, then each array generation is "the long-run" and will be close to the correct proportions (but not guaranteed to be exact).