SAS and R: probability

Catalogs of posts

Showing posts with label probability. Show all posts

Thursday, February 23, 2012

Example 9.21: The birthday "problem" re-examined

The so-called birthday paradox or birthday problem is simply the counter-intutitive discovery that the probability of (at least) two people in a group sharing a birthday goes up surprisingly fast as the group size increases. If the group is only 23 people, there is a 50% chance that two of them share a birthday, and with 40 people it's about 90%. There is an excellent wikipedia page discussing this.

However, this analytically derived probability is based on the assumption that births are equally likely on any day of the year. (It also ignores the occasional February 29th, and any social factors that lead people born at the same time of year to seek like spouses, and so forth.) But this assumption does not appear to be true, as laid out anecdotally and in press.

As noted in the latter link, any disparity in the probability of birth between days will improve the chances of a match. But how much? An analytic solution seems quite complex, even if we approximate the true daily distribution with a constant birth probability per month. Simulation will be simpler. While we're at it, we'll include leap days as well, since February 29th approaches.

SAS

Our approach here is based on the observation that the probability of at least one match among N people is equal to the sum of the probabilities of exactly one match in 2,...,N people. In addition, rather than simulating groups of 2, estimating the probability of a match, and repeating for groups of 3,...,N, we'll keep adding people to a group until we have a match, finding the probability of a match in all group sizes at once.

Here we use arrays (section 1.11.5) to keep track of the number of days in a month and of the people in our group. To reduce computation, we'll check for matches as we add people to the group, and only generate their birthdays if there is not yet a match. We also demonstrate the useful hyphen tool for referring to ranges of variables (1.11.4).


data bd1;
array daysmo [12] _temporary_ (31 28.25 31 30 31 30 31 31 30 31 30 31);
array dob [367] dob1 - dob367; * these variables will hold the birthdays
 * the hyphen includes all the variables in the
 * sequence

do group = 1 to 10000000; * simulate this many groups;
 match = 0; * initialize whether there's a match in this 
 group, yet;
 do i = 1 to 367; * loop through up to 367 subjects... the maximum
 possible, obviously;
 month = rantbl(0, 31*.0026123, 28*.0026785, 31*.0026838, 30*.0026426,
 31*.0026702, 30*.0027424, 31*.0028655, 31*.0028954, 30*.0029407,
 31*.0027705, 30*.0026842);
 * choose a month of birth, by probabilities reported
 in the Science News link, which are daily by month;
 day = ceil((4 * daysmo[month] * uniform(0))/4); 
 * choose a day within the month,
 note the trick used to get Leap Days; 
 dob[i] = mdy(month, day, 1960);
 * convert month and day into a day in the year--
 1960 is a convenient leap year;
 do j = 1 to (i-1) until (match gt 0);
 * compare each old person to the new one;
 if dob[j] = dob[i] then match = i;
 * if there was a match, we needed i people in the 
 group to make it;
 end;
 if match gt 0 then leave; 
 * no need to generate the other 367-i people;
 end;
 output;
end;
run;

We note here that while we allow up to 367 birthdays before a match, the probability of more than 150 is so infinitesimal that we could save the space and speed up processing time by ignoring it. Now that the groups have been simulated, we just need to summarize and present them. We tabulate how many cases of groups of size N were recorded, generate the simple analytic answer, and merge them.


proc freq data = bd1;
tables match / out=bd2 outcum; * the bd2 data set has the results;
run;

data simpreal;
set bd2;
prob = 1 - ((fact(match) * comb (365,match)) / 365**match);
realprob = cum_freq/10000000;
diff = realprob-prob;
diffpct = 100 * (diff)/prob;
run;

It's easiest to interpret the results by way of a plot. We'll plot the absolute and the relative difference on the same image with two different axes. The axis and symbol statements will make it slightly prettier, and allow us to make 0 appear at the same point on both axes.


axis1 order = (-.75 to .75 by .25) minor = none;
axis2 order = (-.00025 to .00025 by .00005) minor = none;
symbol1 v = dot h = .75 c = blue;
symbol2 font=marker v = U h = .5 c = red;

proc gplot data= simpreal (obs = 89);
plot diffpct * match / vref = 0 vaxis=axis1 legend;
plot2 diff *match/ vaxis = axis2 legend;
run; quit;

The results, shown below, are very clear-- leap day and the disequilibrium in birth month probability does increase the probability of at least one match in any group of a given size, relative to the uniform distribution across days assumed in the analytic solution. But the difference is miniscule in both the absolute and relative scale.

R
Here we mimic the approach used above, but use the apply() function family in place of some of the looping.


dayprobs = c(.0026123,.0026785,.0026838,.0026426,.0026702,.0027424,.0028655,
 .0028954,.0029407,.0027705,.0026842,.0026864)
daysmo = c(31,28,31,30,31,30,31,31,30,31,30,31)
daysmo2 = c(31,28.25,31,30,31,30,31,31,30,31,30,31)
# need both: the former is how the probs are reported, 
# while the latter allows leap days

moprob = daysmo * dayprobs

With the monthly probabilities established, we can sample a birth month for everyone, and then choose a birth day within month. We use the same trick as above to allow birth days of February 29th. Here we show code for 10,000 groups; with the simple cloud R this code was developed on, more caused a crash.

We've stopped referencing our book exhaustively, and doing so here would be tedious. Instead, we'll just comment that the tools we use here can be found in sections 1.4.5, 1.4.15, 1.4.16, 1.5.2, 1.8.3, 1.8.4, 1.9.1, 1.11.1, 5.2.1, 5.6.1, B.5.2, and probably others.


mob = sample(1:12,10000 * 367,rep=TRUE,prob=moprob)
dob = sapply(mob,function(x) ceiling(sample((4*daysmo2[x]),1)/4) )
# The ceiling() function isn't vectorized, so we make the equivalent
# using sapply().

mobdob = paste(mob,dob)
# concatenate the month and day to make a single variable to compare
# between people. The ISOdate() function would approximate the SAS mdy() 
# function but would be much longer, and we don't need it.

# convert the vector into a matrix with the maximum
# group size as the number of columns
# as noted above, this could safely be truncated, with great savings
mdmat = matrix(mobdob, ncol=367, nrow=10000)

To find duplicate birthdays in each row of the matrix, we'll write a function to compare the number of unique values with the length of the vector, then call it repeatedly in a for() loop until there is a difference. Then, to save (a lot) of computations, we'll break out of the loop and report the number needed to make the match. Finally, we'll call this vector-based function using apply() to perform it on each row of the birthday matrix.


matchn = function(x) {
 for (i in 1:367){
 if (length(unique(x[1:i])) != i) break
 }
 return(i)
}

groups = apply(mdmat, 1, matchn)

bdprobs = cumsum(table(groups)/10000)
# find the N with each group number, divide by number of groups
# and get the cumulative sum

rgroups = as.numeric(names(bdprobs))
# extract the group sizes from the table
probs = 1 - ((factorial(rgroups) * choose(365,rgroups)) / 365**rgroups)
# calculate the analytic answer, for any group size 
# for which there was an observed simulated value

diffs = bdprobs - probs
diffpcts = diffs/probs

To plot the differences and percent differences in probabilities, we modify (slightly) the functions for a multiple-axis scatterplot we show in our book in section 5.6.1. You can find the code for this and all the book examples on the book web site.


addsecondy <- function(x, y, origy, yname="Y2") {
 prevlimits <- range(origy)
 axislimits <- range(y)
 axis(side=4, at=prevlimits[1] + diff(prevlimits)*c(0:5)/5,
 labels=round(axislimits[1] + diff(axislimits)*c(0:5)/5, 3))
 mtext(yname, side=4)
 newy <- (y-axislimits[1])/(diff(axislimits)/diff(prevlimits)) +
 prevlimits[1]
 points(x, newy, pch=2)
 abline(h=(-axislimits[1])/(diff(axislimits)/diff(prevlimits)) +
 prevlimits[1])
}

plottwoy <- function(x, y1, y2, xname="X", y1name="Y1", y2name="Y2")
{
 plot(x, y1, ylab=y1name, xlab=xname)
 abline(h=0)
 addsecondy(x, y2, y1, yname=y2name)
}

plottwoy(rgroups, diffs, diffpcts, xname="Number in group",
 y1name="Diff in prob", y2name="Diff in percent")
legend(80, .0013, pch=1:2, legend=c("Diffs", "Pcts"))

The resulting plot, (which is based on 100,000 groups, tolerable compute time on a laptop) is shown at the top. Aligning the 0 on each axis was more of a hassle than seemed worth it for today. However, the message is equally clear-- a clearly larger probability with the observed birth distribution, but not a meaningful difference.

Posted by Ken Kleinman

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: apply(), arrays, axis statement, ceiling(), for(), matrix(), mdy function, paste(), probability, range of variables, sapply(), symbol statement 3 comments

Friday, April 29, 2011

Example 8.36: Quadratic equation with real roots

[フレーム]

We often simulate data in SAS or R to confirm analytical results. For example, consider the following problem from the excellent text by Rice:

Let U1, U2, and U3 be independent random variables uniform on [0, 1]. What is the probability that the roots of the quadratic U1*x^2 + U2*x + U3 are real?

Recall that for a quadratic equation A*x^2 + B*x + C to have real roots we need the discriminant (B^2-4AC) to be non-negative.

The answer given in the second and third editions of Rice is 1/9. Here's how you might get there:

Since, B^2> 4*A*C <=> B> 2*sqrt(A*C), we need to integrate B over the range 2*sqrt(a*c) to 1, then integrate over all possible values for A and C (each from 0 to 1).

Another answer can be found by taking y = b^2 and w = 4ac and integrating over their joint distribution (they're independent, of course). That leads to an answer of approximately 0.254. Here's how to calculate this in R:


f = function(x) {
 A = x[1]; B = x[2]; C = x[3];
 return(as.numeric(B^2> 4*A*C))
}
library(cubature)
adaptIntegrate(f, c(0,0,0), c(1,1,1), tol=0.0001, max=1000000)

which generates the following output:


$integral
[1] 0.2543692

$error
[1] 0.005612558

$functionEvaluations
[1] 999999

$returnCode
[1] -1

We leave the details of the calculations aside for now, but both seem equally plausible, at first glance. A quick simulation can suggest which is correct.

For those who want more details, here's a more complete summary of this problem and solution.

SAS

Neither the SAS nor the R code is especially challenging.


data test;
 do trial = 1 to 10000;
 u1 = uniform(0); u2 = uniform(0); u3 = uniform(0);
 res = u2**2 - 4*u1*u3;
 realroot = (res ge 0);
 output;
 end;
run;

proc print data=test (obs=10); 
run;

proc means data=test;
 var realroot;
run;

Leading to the result:


 The MEANS Procedure

 Analysis Variable : realroot

 N Mean Std Dev Minimum Maximum
 -----------------------------------------------------------------
 10000 0.2556000 0.4362197 0 1.0000000
 -----------------------------------------------------------------


numsim = 10000
u1 = runif(numsim); u2 = runif(numsim); u3 = runif(numsim)
res = u2^2 - 4*u1*u3
realroots = res>=0
table(realroots)/numsim

With the result


realroots
FALSE TRUE 
0.747 0.253

The simulation demonstrates both that the first solution is incorrect. Here the simulation serves as a valuable check for complicated analysis.

Insights into where the 1/9 solution fails would be welcomed in the comments.

Posted by Nick Horton

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: adaptIntegrate(), calculus, cubature library, determinant, empirical problem solving, probability, quadratic equation, simulation studies 3 comments

Monday, March 15, 2010

Example 7.27: probability question reconsidered

[フレーム]

In Example 7.26, we considered a problem, from the xkcd blog:

Suppose I choose two (different) real numbers, by any process I choose. Then I select one at random (p= .5) to show Nick. Nick must guess whether the other is smaller or larger. Being right 50% of the time is easy. Can he do better?

Randall Munroe offers a solution which we tested in the previous example. However, his logic may not be obvious to all readers. Instead, Martin Kulldorff, author of the SaTScan software for cluster detection, offers the following intuition. Suppose Nick chooses a number k. If the first number is bigger than k, he'll guess that the hidden number is smaller, and vice versa. Let's condition to see what will happen. If both numbers are bigger than k, Nick will be right 50% of the time. If both are smaller than k, he'll also be right 50% of the time. But whenever the two numbers straddle k, he'll be right 100% of the time. So in the margin, he'll be right more than 50% of the time (because the two numbers are not identical). This strategy will sometimes work if I don't know Nick is using it and he makes a lucky choice of k. But even if I know Nick's strategy, he will still gain a rate over 50% by simply choosing k at random.

Munroe's strategy is the same as that outlined here, if we choose

k = log(u/(1-u))

where u is a Uniform(0,1) random variate, though this has a similar problem to Munroe's approach, in practice-- k rarely (only .000000002 of the time) exceeds about 20 in absolute value. A more effective strategy would be to choose k from a distribution with values more uniform across the reals. In the following code we use a Cauchy distribution (see section 1.10.1) spreading it out even further by multiplying each pseudo-random number by 1000.

SAS

The code is similar to that shown in the previous example, but does not require computing the function of the observed value.


data test2;
 do i = 1 to 100000;
 real1 = uniform(0) * 1000 + 1000;
 real2 = uniform(0) * 1000 + 1000;
 if uniform(0) gt .5 then do;
 env1 = real1;
 env2 = real2;
 end;
 else do;
 env1 = real2;
 env2 = real1;
 end;
 if (env1 gt (rand("CAUCHY") * 1000)) then guess = "l";
 else guess = "h";
 correct = (((env1 < env2) and (guess eq "h")) or 
 ((env1> env2) and (guess eq "l"))) ;
 output;
 end;
run;

proc freq data = test2;
tables correct / binomial (level='1');
run;

The FREQ Procedure

 Cumulative Cumulative
correct Frequency Percent Frequency Percent
------------------------------------------------------------
 0 48315 48.32 48315 48.32
 1 51685 51.69 100000 100.00


Proportion 0.5169
ASE 0.0016
95% Lower Conf Limit 0.5138
95% Upper Conf Limit 0.5199

Exact Conf Limits
95% Lower Conf Limit 0.5137
95% Upper Conf Limit 0.5200

ASE under H0 0.0016
Z 10.6569
One-sided Pr> Z <.0001
Two-sided Pr> |Z| <.0001

Sample Size = 100000


mk1 = function(n) {
 real1 = runif(n) * 1000 + 1000
 real2 = runif(n) * 1000 + 1000
 pickenv = (runif(n) < .5)
 env1 = ifelse(pickenv,real1,real2)
 env2 = ifelse(!pickenv,real1,real2)
 k = rcauchy(n) * 1000
 guess = ifelse(env1> k,"lower","higher")
 correct = ((env1 < env2) & (guess == "higher")) | ((env1> env2) &
 (guess == "lower"))
 return(correct)
}

> binom.test(sum(mk1(100000)),100000)

 Exact binomial test

data: sum(mk1(1e+05)) and 1e+05 
number of successes = 51764, number of trials = 1e+05, p-value <
2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5 
95 percent confidence interval:
 0.5145375 0.5207415 
sample estimates:
probability of success 
 0.51764

This approach is so much better than the other that we can detect the effect in only 100,000 trials!

Posted by Ken Kleinman

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: binomial probability, conditional execution, probability, xckd 0 comments

Monday, March 8, 2010

Example 7.26: probability question

[フレーム]

Here's a surprising problem, from the xkcd blog.

Suppose I choose two (different) real numbers, by any process I choose. Then I select one at random (p= .5) to show Nick. Nick must guess whether the other is smaller or larger. Being right 50% of the time is easy. Can he do better?

Of course, it wouldn't be an interesting question if the answer was no. Randall Munroe (author of the xkcd webcomic) offers a solution as follows:


1. Call the number you see x. 
 Calculate p(x) = (e^x)/(1 + e^x)
2. Draw u, a random uniform(0,1).
3. If u < p(x) then guess the hidden number is 
 smaller than the revealed one. Otherwise,
 guess that it is smaller.

Munroe shows that the probability of guessing correctly is

N = 0.5 * (1-p(A)) + 0.5 * p(B)

where A is the smaller of the two numbers. In practice, this only gains us an advantage when the observed value is between about -20 and 20-- outside that range, calculation of p(x) results in numbers so close to 0 or 1 that no improvement over a 50% rate can be achieved. If I knew Nick were using this method, I could keep him to a 50% rate by always choosing numbers over 1000 or so. He can improve Munroe's solution by using p(x/100000), though at the cost of a poorer improvement near 0.

We'll check the logic by simulating the experiment.

SAS

The SAS solution requires a lot of looping (section 1.11.1) and conditional execution (section 1.11.2).


data test;
 do i = 1 to 10000000;
 real1 = uniform(0) * 1000 + 1000;
 real2 = uniform(0) * 1000 + 1000;
 if uniform(0) gt .5 then do;
 env1 = real1;
 env2 = real2;
 end;
 else do;
 env1 = real2;
 env2 = real1;
 end;
 if uniform(0) lt (1/(1 + exp(-1 * env1/100000))) then guess = "l";
 else guess = "h";
 correct = ((env1 < env2) and (guess eq "h")) or 
 ((env1> env2) and (guess eq "l")) ;
 output;
 end;
run;

We can test the performance using proc freq with the binomial option (section 2.1.9).


proc freq data = test;
tables correct / binomial;
run;

The FREQ Procedure

 Cumulative Cumulative
correct Frequency Percent Frequency Percent
------------------------------------------------------------
 0 4996889 49.97 4996889 49.97
 1 5003111 50.03 10000000 100.00


Proportion 0.5003
ASE 0.0002
95% Lower Conf Limit 0.5000
95% Upper Conf Limit 0.5006

Exact Conf Limits
95% Lower Conf Limit 0.5000
95% Upper Conf Limit 0.5006

ASE under H0 0.0002
Z 1.9676
One-sided Pr> Z 0.0246
Two-sided Pr> |Z| 0.0491

Sample Size = 10000000

We can just barely detect the improvement-- but it took 10,000,000 trials to find CI that reliably exclude the possibility of a 50% success rate.

R

We begin by writing a function to calculate p(x/100000).


xkcdprob = function(x) {
1/(1 + exp(-1 * x/100000))}

Then we write a function to perform the experiment n times. Here we use the same process for picking the two numbers, choosing them to be Uniform between 1000 and 2000. In the code below we use the ifelse() function (section 1.11.2) to decide which envelope is shown first. The vectorization in R allows us to avoid loops entirely in writing the code.


larger1 = function(n) {
real1 = runif(n) * 1000 + 1000
real2 = runif(n) * 1000 + 1000
pickenv = (runif(n) < .5)
env1 = ifelse(pickenv,real1,real2)
env2 = ifelse(!pickenv,real1,real2)
guess = ifelse(runif(n) < xkcdprob(env1),"lower","higher")
correct = ((env1 < env2) & (guess == "higher")) | ((env1> env2) &
 (guess == "lower"))
return(correct)
}

Then we can run the experiment and check its results in one nested call, using the binom.test() function (section 2.1.9) to see whether the observed proportion is different from 0.5.


> binom.test(sum(larger1(10000000)),10000000)

 Exact binomial test

data: sum(larger1(1e+07)) and 1e+07 
number of successes = 5003996, number of trials = 1e+07, p-value =
0.01150
alternative hypothesis: true probability of success is not equal to 0.5 
95 percent confidence interval:
 0.5000897 0.5007095 
sample estimates:
probability of success 
 0.5003996

Posted by Ken Kleinman

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: binomial probability, probability, xckd 1 comments

Subscribe to SAS and R!

RSS: Or: Get SAS and R by Email

Search the SAS and R Blog

The book (second edition, 2014)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

Buy a book

SAS and R: Data Management, Statistical Analysis, and Graphics, Second Edition

Using SAS for Data Management, Statistical Analysis, and Graphics

Using R for Data Management, Statistical Analysis, and Graphics

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

Sidebar list of all entries.

SAS and R

Catalogs of posts

Thursday, February 23, 2012

Example 9.21: The birthday "problem" re-examined

Friday, April 29, 2011

Example 8.36: Quadratic equation with real roots

Monday, March 15, 2010

Example 7.27: probability question reconsidered

Monday, March 8, 2010

Example 7.26: probability question

About SAS and R

Topics discussed