Random sampling without replacement [closed]

Question 1

Create a function that will output a set of distinct random numbers drawn from a range. The order of the elements in the set is unimportant (they can even be sorted), but it must be possible for the contents of the set to be different each time the function is called.

The function will receive 3 parameters in whatever order you want:

Count of numbers in output set
Lower limit (inclusive)
Upper limit (inclusive)

Assume all numbers are integers in the range 0 (inclusive) to 2³¹ (exclusive). The output can be passed back any way you want (write to console, as an array, etc.)

Judging

Criteria includes the 3 R's

Run-time - tested on a quad-core Windows 7 machine with whatever compiler is freely or easily available (provide a link if necessary)
Robustness - does the function handle corner cases or will it fall into an infinite loop or produce invalid results - an exception or error on invalid input is valid
Randomness - it should produce random results that are not easily predictable with a random distribution. Using the built in random number generator is fine. But there should be no obvious biases or obvious predictable patterns. Needs to be better than that random number generator used by the Accounting Department in Dilbert

If it is robust and random then it comes down to run-time. Failing to be robust or random greatly hurts its standings.

Question 2

Is the output supposed to pass something like the DIEHARD or TestU01 tests, or how will you judge its randomness? Oh, and should the code run in 32 or 64 bit mode? (That will make a big difference for optimization.)

Question 3

TestU01 is probably a bit harsh, I guess. Does criterion 3 imply a uniform distribution? Also, why the non-repeating requirement? That's not particularly random, then.

Question 4

@Joey, sure it is. It's random sampling without replacement. As long as no-one claims that the different positions in the list are independent random variables there's no problem.

Question 5

Ah, indeed. But I'm not sure whether there are well-established libraries and tools for measuring randomness of sampling :-)

Question 6

@IlmariKaronen: RE: Randomness: I've seen implementations before that were woefully unrandom. Either they had a heavy bias, or lacked the ability to produce different results on consecutive runs. So we are not talking cryptographic level randomness, but more random than the Accounting Department's random number generator in Dilbert.

Question 7

Python

import random
def sample(n, lower, upper):
 result = []
 pool = {}
 for _ in xrange(n):
 i = random.randint(lower, upper)
 x = pool.get(i, i)
 pool[i] = pool.get(lower, lower)
 lower += 1
 result.append(x)
 return result

I probably just re-invented some well-known algorithm, but the idea is to (conceptually) perform a partial Fisher-Yates shuffle of the range lower..upper to get the length n prefix of a uniformly shuffled range.

Of course, storing the whole range would be rather expensive, so I only store the locations where the elements have been swapped.

This way, the algorithm should perform well both in the case where you're sampling numbers from a tight range (e.g. 1000 numbers in the range 1..1000), as well as the case where you're sampling numbers from a large range.

I'm not sure about the quality of randomness from the built-in generator in Python, but it's relatively simple to swap in any generator that can generate integers uniformly from some range.

Question 8

Python uses Mersenne Twister, so it's relatively decent.

Question 9

python 2.7

import random
print(lambda x,y,z:random.sample(xrange(y,z),x))(input(),input(),input())

not sure what your standing is on using builtin random methods, but here you go anyways. nice and short

edit: just noticed that range() doesn't like to make big lists. results in a memory error. will see if there is any other way to do this...

edit2: range was the wrong function, xrange works. The maximum integer is actually 2**31-1 for python

test:

python sample.py
10
0
2**31-1
[786475923, 2087214992, 951609341, 1894308203, 173531663, 211170399, 426989602, 1909298419, 1424337410, 2090382873]

Question 10

C

Returns an array containing x unique random ints between min and max. (caller must free)

#include <stdlib.h>
#include <stdint.h>
#define MAX_ALLOC ((uint32_t)0x40000000) //max allocated bytes, fix per platform
#define MAX_SAMPLES (MAX_ALLOC/sizeof(uint32_t))
int* randsamp(uint32_t x, uint32_t min, uint32_t max)
{
 uint32_t r,i=x,*a;
 if (!x||x>MAX_SAMPLES||x>(max-min+1)) return NULL;
 a=malloc(x*sizeof(uint32_t));
 while (i--) {
 r= (max-min+1-i);
 a[i]=min+=(r ? rand()%r : 0);
 min++;
 }
 while (x>1) {
 r=a[i=rand()%x--];
 a[i]=a[x];
 a[x]=r;
 }
 return a;
}

Works by generating x sequential random integers in the range, then shuffling them. Add a seed(time) somewhere in caller if you don't want the same results every run.

Question 11

Ruby >= 1.8.7

def pick(num, min, max)
 (min..max).to_a.sample(num)
end
p pick(5, 10, 20) #=>[12, 18, 13, 11, 10]

Question 12

R

s <- function(n, lower, upper) sample(lower:upper,n); s(10,0,2^31-2)

Question 13

The question is not correct. Do you need uniform sampling or not? In the case uniform sampling is needed I have the following code in R, which has average complexity O(s log s), where s is the sample size.

# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin 
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
 s=as.integer(size)
 if (s>n) {
 stop("Sample size is greater than the number of items to choose from")
 }
 # upv=integer(s) #level up edge is pointing to
 leftv=integer(s) #left edge is poiting to; must be filled with zeros
 rightv=integer(s) #right edge is pointig to; must be filled with zeros
 samp=integer(s) #the sample
 ordn=integer(s) #relative ordinal number
 ordn[1L]=1L #initial value for the root vertex
 samp[1L]=sample(n,1L) 
 if (s > 1L) for (j in 2L:s) {
 curn=sample(n-j+1L,1L) #current number sampled
 curordn=0L #currend ordinal number
 v=1L #current vertice
 from=1L #how have come here: 0 - by left edge, 1 - by right edge
 repeat {
 curordn=curordn+ordn[v]
 if (curn+curordn>samp[v]) { #going down by the right edge
 if (from == 0L) {
 ordn[v]=ordn[v]-1L
 }
 if (rightv[v]!=0L) {
 v=rightv[v]
 from=1L
 } else { #creating a new vertex
 samp[j]=curn+curordn
 ordn[j]=1L
 # upv[j]=v
 rightv[v]=j
 break
 }
 } else { #going down by the left edge
 if (from==1L) {
 ordn[v]=ordn[v]+1L
 }
 if (leftv[v]!=0L) {
 v=leftv[v]
 from=0L
 } else { #creating a new vertex
 samp[j]=curn+curordn-1L
 ordn[j]=-1L
 # upv[j]=v
 leftv[v]=j
 break
 }
 }
 }
 }
 return(samp) 
}

Of course, one may rewrite it in C for better performance. The complexity of this algorithm is discussed in: Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54. http://dx.doi.org/10.1515/mcma.1999年5月1日.39

You may look through this paper for another algorithm with the same average complexity.

But if you do not need uniform sampling, only requiring that all sampled numbers be different, then the situation changes dramatically. It is not hard to write an algorithm that has average complexity O(s).

See also for uniform sampling: P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209. DOI: 10.1080/00207168408803438

Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140. DOI: 10.1080/00207168208803304

In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R): https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

Question 14

APL, (削除) 18 (削除ここまで) 22 bytes

{⍵[0]+(1↑⍺)?⍵[1]-⍵[0]}

Declares an anonymous function that takes two arguments ⍺ and ⍵. ⍺ is the number of random numbers you want, ⍵ is a vector containing the lower and upper bounds, in that order.

a?b picks a random numbers between 0-b without replacement. By taking ⍵[1]-⍵[0] we get the range size. Then we pick ⍺ numbers (see below) from that range and add the lower bound. In C, this would be

lower + rand() * (upper - lower)

⍺ times without replacement. Parentheses not needed because APL operates right-to-left.

(削除) Assuming I've understood the conditions correctly, this fails the 'robustness' criteria because the function will fail if given improper arguments (e.g. passing a vector instead of a scalar as ⍺). (削除ここまで)

In the event that ⍺ is a vector rather than a scalar, 1↑⍺ takes the first element of ⍺. For a scalar, this is the scalar itself. For a vector, it's the first element. This should make the function meet the 'robustness' criteria.

Example:

Input: 100 {⍵[0]+⍺?⍵[1]-⍵[0]} 0 100
Output: 34 10 85 2 46 56 32 8 36 79 77 24 90 70 99 61 0 21 86 50 83 5 23 27 26 98 88 66 58 54 76 20 91 72 71 65 63 15 33 11 96 60 43 55 30 48 73 75 31 13 19 3 45 44 95 57 97 37 68 78 89 14 51 47 74 9 67 18 12 92 6 49 41 4 80 29 82 16 94 52 59 28 17 87 25 84 35 22 38 1 93 81 42 40 69 53 7 39 64 62

Question 15

This is not a code golf but a fastest cose, therefore the goal is producing the fastest code to perform the task rather than the shortest. Anyway, you don't really need to pick the items from the arguments like that, and you can determine their order, so {⍵+⍺?⎕-⍵} should suffice, where the prompt is for the upper bound and right arg is lower bound

Question 16

Scala

object RandSet {
 val random = util.Random 
 def rand (count: Int, lower: Int, upper: Int, sofar: Set[Int] = Set.empty): Set[Int] =
 if (count == sofar.size) sofar else 
 rand (count, lower, upper, sofar + (random.nextInt (upper-lower) + lower)) 
}
object RandSetRunner {
 def main (args: Array [String]) : Unit = {
 if (args.length == 4) 
 (0 until args (0).toInt).foreach { unused => 
 println (RandSet.rand (args (1).toInt, args (2).toInt, args (3).toInt).mkString (" "))
 }
 else Console.err.println ("usage: scala RandSetRunner OUTERCOUNT COUNT MIN MAX")
 }
}

compile and run:

scalac RandSetRunner.scala 
scala RandSetRunner 200 15 0 100

The second line will run 200 tests with 15 values from 0 to 100, because Scala produces fast bytecode but needs some startup time. So 200 starts with 15 values from 0 to 100 would consume more time.

Sample on a 2 Ghz Single Core:

time scala RandSetRunner 100000 10 0 1000000 > /dev/null
real 0m2.728s
user 0m2.416s
sys 0m0.168s

Logic:

Using the build-in random and recursively picking numbers in the range (max-min), adding min and checking, if the size of the set is the expected size.

Critique:

It will be fast for small samples of big ranges, but if the task is to pick nearly all elements of a sample (999 numbers out of 1000) it will repeatedly pick numbers, already in the set.
From the question, I'm unsure, whether I have to sanitize against unfulfillable requests like Take 10 distinct numbers from 4 to 8. This will now lead to an endless loop, but can easily be avoided with a prior check which I will append if requested.

Question 17

Scheme

Not sure why you need 3 parameters passed nor why i need to assume any range...

(import srfi-1) ;; for iota
(import srfi-27) ;; randomness
(import srfi-43) ;; for vector-swap!
(define rand (random-source-make-integers
 default-random-source))
;; n: length, i: lower limit
(define (random-range n i)
 (let ([v (list->vector (iota n i))])
 (let f ([n n])
 (let* ([i (rand n)] [n (- n 1)])
 (if (zero? n) v
 (begin (vector-swap! v n i) (f n)))))))

Question 18

R

random <- function(count, from, to) {
 rand.range <- to - from
 vec <- c()
 for (i in 1:count) {
 t <- sample(rand.range, 1) + from
 while(i %in% vec) {
 t <- sample(rand.range, 1) + from
 }
 vec <- c(vec, t)
 }
 return(vec)
}

Question 19

C++

This code is best when drawing many samples from the range.

#include <exception>
#include <stdexcept>
#include <cstdlib>
template<typename OutputIterator>
 void sample(OutputIterator out, int n, int min, int max)
{
 if (n < 0)
 throw std::runtime_error("negative sample size");
 if (max < min)
 throw std::runtime_error("invalid range");
 if (n > max-min+1)
 throw std::runtime_error("sample size larger than range");
 while (n>0)
 {
 double r = std::rand()/(RAND_MAX+1.0);
 if (r*(max-min+1) < n)
 {
 *out++ = min;
 --n;
 }
 ++min;
 }
}

Question 20

This can easily get stuck in an infinite loop unless max-min is much larger than n. Also, the output sequence is monotonically increasing, so you're getting very low quality randomness but still paying the cost of calling rand() multiple times per result. A random shuffle of the array would probably be worth the extra run-time.

Question 21

Q (19 characters)

f:{(neg x)?y+til z}

Then use f[x;y;z] as [count of numbers in output set;starting point;size of range]

e.g. f[5;10;10] will output 5 distinct random numbers between 10 and 19 inclusive.

q)\ts do[100000;f[100;1;10000]]
2418 131456j

Results above show performance at 100,000 iterations of picking 100 random numbers between 1 & 10,000.

Question 22

R, 31 or 40 bytes (depending on the meaning of the word "range")

If the input has 3 numbers, a[1], a[2], a[3], and by "range" you mean "an integer sequence from a[2] to a[3]", then you have this:

a=scan();sample(a[2]:a[3],a[1])

If you have an array n from which you are about to resample, but under the restriction of the lower and upper limits, like "resample values of the given array n from the range a[1]...a[2]", then use this:

a=scan();sample(n[n>=a[2]&n<=a[3]],a[1])

I am quite surprised why the previous result was not golfed considering the built-in sample with replacement facilities! We create a vector that satisfies the range condition and re-sample it.

Robustness: the corner cases (sequences of the same length as the range to sample from) are handled by default.
Run-time: extremely fast because it is built in.
Randomness: the seed is automatically changed every time the RNG is invoked.

Question 23

at least on my machine, 0:(2^31) causes an Error: cannot allocate a vector of size 16.0 Gb

Question 24

@Giuseppe Recently, I have been working with big-memory problems, and the solution to that is actually... running it on a better machine. The restrictions in the formulation of the task pertain to the processor, not to memory, so is it... rule abuse? Ah, I am an ass. I thought it was a code golf challenge, but actually it is... fastest-code. I lose I guess?

Question 25

Javascript (using external library) (64 bytes / 104 bytes??)

(a,b,n)=>_.Range(0,n).Select(x=>Math.random()*(b-a)+a).ToArray()

Link to lib: https://github.com/mvegh1/Enumerable/

Code explanation: Lambda expression accepts min,max,count as args. Create a collection of size n, and map each element to a random number fitting the min/max criteria. Convert to native JS array and return it. I ran this also on an input of size 5,000,000, and after applying a distinct transform still showed 5,000,000 elements. If it is agreed upon that this isn't safe enough of a guarantee for distinctness I will update the answer

I included some statistics in the image below...

enter image description here

EDIT: The below image shows code / performance that guarantees every element will be distinct. It's much much slower (6.65 seconds for 50,000 elements) vs the original code above for same args (0.012 seconds)

enter image description here

Question 26

K (oK), 14 bytes

Solution:

{y+(-x)?1+z-y}

Try it online!

Example:

> {y+(-x)?1+z-y}. 10 10 20 / note: there are two ways to provide input, dot or
13 20 16 17 19 10 14 12 11 18
> {y+(-x)?1+z-y}[10;10;20] / explicitly with [x;y;z]
12 11 13 19 15 17 18 20 14 10

Explanation:

Takes 3 implicit inputs per spec:

x, count of numbers in output set,
y, lower limit (inclusive)
z, upper limit (inclusive)

{y+(-x)?1+z-y} / the solution
{ } / lambda function with x, y and z as implicit inputs
 z-y / subtract lower limit from upper limit
 1+ / add 1
 (-x)? / take x many distinct items from 0..(1+z=y)
 y+ / add lower limit

Notes:

Also a polyglot in q/kdb+ with an extra set of brackets: {y+((-)x)?1+z-y} (16 bytes).

Question 27

Axiom + its library

f(n:PI,a:INT,b:INT):List INT==
 r:List INT:=[]
 a>b or n>99999999 =>r
 d:=1+b-a
 for i in 1..n repeat
 r:=concat(r,a+random(d)$INT)
 r

The above f() function return as error the empty list, in the case f(n,a,b) with a>b. In other cases of invalid input, it not run with one error message in Axiom window, because argument will be not of the right type. Examples

(6) -> f(1,1,5)
 (6) [2]
 Type: List Integer
(7) -> f(1,1,1)
 (7) [1]
 Type: List Integer
(10) -> f(10,1,1)
 (10) [1,1,1,1,1,1,1,1,1,1]
 Type: List Integer
(11) -> f(10,-20,-1)
 (11) [- 10,- 4,- 18,- 5,- 5,- 11,- 15,- 1,- 20,- 1]
 Type: List Integer
(12) -> f(10,-20,-1)
 (12) [- 4,- 5,- 3,- 4,- 18,- 1,- 2,- 14,- 19,- 8]
 Type: List Integer
(13) -> f(10,-20,-1)
 (13) [- 18,- 12,- 12,- 19,- 19,- 15,- 5,- 17,- 19,- 4]
 Type: List Integer
(14) -> f(10,-20,-1)
 (14) [- 8,- 11,- 20,- 10,- 4,- 8,- 11,- 3,- 10,- 16]
 Type: List Integer
(15) -> f(10,9,-1)
 (15) []
 Type: List Integer
(16) -> f(10,0,100)
 (16) [72,83,41,35,27,0,33,18,60,38]
 Type: List Integer

hammar hammar 4,30224 silver badges26 bronze badges · Accepted Answer · 2012-02-02 02:45:08Z

Python

import random
def sample(n, lower, upper):
 result = []
 pool = {}
 for _ in xrange(n):
 i = random.randint(lower, upper)
 x = pool.get(i, i)
 pool[i] = pool.get(lower, lower)
 lower += 1
 result.append(x)
 return result

I probably just re-invented some well-known algorithm, but the idea is to (conceptually) perform a partial Fisher-Yates shuffle of the range lower..upper to get the length n prefix of a uniformly shuffled range.

Of course, storing the whole range would be rather expensive, so I only store the locations where the elements have been swapped.

This way, the algorithm should perform well both in the case where you're sampling numbers from a tight range (e.g. 1000 numbers in the range 1..1000), as well as the case where you're sampling numbers from a large range.

I'm not sure about the quality of randomness from the built-in generator in Python, but it's relatively simple to swap in any generator that can generate integers uniformly from some range.

1

\$\begingroup\$ Python uses Mersenne Twister, so it's relatively decent. \$\endgroup\$

ESultanik
– ESultanik

2012年03月05日 13:49:20 +00:00
Commented Mar 5, 2012 at 13:49

Stack Exchange Network

Random sampling without replacement [closed]

16 Answers 16

Python

python 2.7

Ruby >= 1.8.7

R

APL, (削除) 18 (削除ここまで) 22 bytes

Scala

compile and run:

Sample on a 2 Ghz Single Core:

Logic:

Critique:

Scheme

R

C++

Q (19 characters)

R, 31 or 40 bytes (depending on the meaning of the word "range")

Javascript (using external library) (64 bytes / 104 bytes??)

K (oK), 14 bytes

Axiom + its library

Linked

Hot Network Questions

Random sampling without replacement [closed]

16 Answers 16

Python

python 2.7

Ruby >= 1.8.7

R

APL, (削除) 18 (削除ここまで) 22 bytes

Scala

compile and run:

Sample on a 2 Ghz Single Core:

Logic:

Critique:

Scheme

R

C++

Q (19 characters)

R, 31 or 40 bytes (depending on the meaning of the word "range")

Javascript (using external library) (64 bytes / 104 bytes??)

K (oK), 14 bytes

Axiom + its library

Linked

Related

Hot Network Questions