Showing posts with label replicate(). Show all posts
Showing posts with label replicate(). Show all posts

Monday, November 17, 2014

Example 2014.13: Statistics doesn't have to be so hard! Resampling in R and SAS

A recent post pointed us to a great talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.

In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.

R

We'll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.

For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the shuffle() function facilitate carrying out a permutation test (see also section 5.4.5 of the book).

beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)
ds = data.frame(y = c(beer, water), 
 x = c(rep("beer", length(beer)), rep("water", length(water))))
require(mosaic)
obsdiff = compareMean(y ~ x, data=ds)
nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)
histogram(~ result, xlab="permutation differences", data=nulldist)
ladd(panel.abline(v=obsdiff, col="red", lwd=2))
> obsdiff
[1] -4.377778
> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)
 TRUE FALSE 
 0.1 99.9 

The do() operator evaluates the expression on the right hand side a specified number of times. In this case we shuffle (or permute) the group indicators.

We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.

For those not invested in the mosaic package, base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.
alldata = c(beer, water)
labels = c(rep("beer", length(beer)), rep("water", length(water)))
obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])
> obsdiff
[1] -4.377778

The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.
resample_labels = sample(labels)
resample_diff = mean(alldata[resample_labels=="beer"]) - 
 mean(alldata[resample_labels=="water"])
resample_diff
[1] 1.033333

In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.
resamp_means = function(data, labs){
 resample_labels = sample(labs)
 resample_diff = mean(data[resample_labels=="beer"]) - 
 mean(data[resample_labels=="water"])
 return(resample_diff)
}
nulldist = replicate(9999,resamp_means(alldata,labels))
hist(nulldist, col="cyan")
abline(v = obsdiff, col = "red")

The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:
alldiffs = c(obsdiff,nulldist)
p = sum(abs(alldiffs >= obsdiff)/ 10000)


SAS

The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.

The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.
data bites;;
if _n_ le 18 then drink="water";
 else drink="beer";
infile datalines delimiter=',';
input bites @@;
datalines;
21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20
27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24
28, 24, 29, 21, 21, 18, 27, 20
;
run;
proc summary nway data = bites;
class drink;
var bites;
output out=obsmeans mean=mean;
run;
proc transpose data = obsmeans out=om2;
var mean;
id drink;
run;
data om3; 
set om2;
obsdiff = beer-water;
call symput('obsdiff',obsdiff);
run;
proc print data = om3; var obsdiff; run;
 Obs obsdiff
 1 4.37778
(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don't understand t-tests, so we want to avoid them.)

To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.
data rerand;
set bites;
randorder = uniform(0);
run;
proc sort data = rerand; by randorder; run;
data rerand2;
set rerand;
if _n_ le 18 then redrink = "water";
 else redrink = "beer";
run;
proc summary nway data = rerand2;
class redrink;
var bites;
output out=rerand3 mean=mean;
run;
proc transpose data = rerand3 out=rerand4;
var mean;
id redrink;
run;
data rrdiff; 
set rerand4;
rrdiff = beer-water;
run;
proc print data = rrdiff; var rrdiff; run;
 Obs rrdiff
 1 -1.73778
One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the summary and transpose procedures as before, with the addition of the by replicate statement to make a data set with columns for each group mean and a row for each replicate.
proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run;
proc summary nway data = ssresamp;
by replicate;
class groupid;
var bites;
output out=ssresamp2 mean=mean;
run;
proc transpose data = ssresamp2 out=ssresamp3 prefix=group;
by replicate;
var mean;
id groupid;
run;
To get a p-value and make a histogram, we use the macro variable created earlier.
data ssresamp4;
set ssresamp3;
diff = group2 - group1;
exceeds = abs(diff) ge &obsdiff;
run;
proc means data = ssresamp4 sum; var exceeds; run;
 The MEANS Procedure
 Analysis Variable : exceeds
 Sum
 9.0000000
proc sgplot data = ssresamp4;
histogram diff;
refline &obsdiff /axis=x lineattrs=(thickness=2 color=red);
run;
The p-value is 0.001 (= (9+1)/10000).


An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, October 29, 2012

Example 10.7: Fisher vs. Pearson

In the early days of the discipline of statistics, R.A. Fisher argued with great vehemence against Egon Pearson (and Jerzy Neyman) over the foundational notions supporting statistical inference. The personal invective recorded is somewhat amusing and also reminds us how very puerile even very smart people can be.

Today, we'll compare Fisher's exact test for 2*2 tables with the Pearson chi-square, developed by Karl Pearson, Egon's father and another early pioneer of statistics. This blog entry was inspired by a questioner on LinkedIn who asked when should the one be preferred over the other. One commenter gave the typical rule of thumb-- "If the expected count in any cell is less than 5, use the exact test, otherwise use the chi-square." My copy of "Statistical rules of thumb" is AWOL at the moment, so I don't know if this one is covered there. A quick googling did not reveal an answer either.

The rule of thumb dates back to the days before the exact test became computationally feasible outside of small problems. In contrast, today it can be performed quickly for all tables, either through a complete enumeration of the possible tables or through Monte Carlo hypothesis testing, which is simple to apply in either SAS or R. My default in recent years has been to take advantage of this capability and use the exact test all the time, ignoring the traditional approximate chi-square test. My idea was that if there were any small cells, I'd be covered, while allowing a simpler methods section.

Let's develop some code to see what happens. Is the rule of thumb accurate? What's the power cost of using the exact test instead of the Chi-square?

Our approach will be to set cell probabilities and the sample size, and simulate data under this model, then perform each test and evaluate the proportion of rejections under each test. One complication here is that simulated data might result in a null margin, i.e., there might be no observed values in a row or in a column. We'll calculate rejections of the null only among the tables where this does not happen. This means that the average observed cell counts among included tables may be different from the expect cell counts. This makes sense from a practical perspective-- we probably would not do the test if we observed 0 subjects in one of our planned categories.

SAS
In SAS, we'll do the dumb straightforward thing and simulate 100 pairs of dichotomous variables. Here we just code the null case of no association, with margins of 70% in one column and 80% in one row. The smallest cell has an expected count of 6%, so that a total sample size of 83 will have an expected count of 5 in that cell.
data test;
pdot1 = .7;
p1dot = .8;
do tablen = 20, 50, 100, 200, 500, 1000;
 do ds = 1 to 10000;
 do i = 1 to tablen;
 xnull = uniform(0) gt pdot1;
 ynull = uniform(0) gt p1dot;
 output;
 end;
 end;
 end;
run;
Then proc freq can be used to generate the two p-values, using the by statement to do the calculations for all the tables at once. The output statement extracts the p-values into a data set.
ods select none;
options nonotes;
proc freq data = test;
by tablen ds;
tables ynull * xnull / chisq fisher;
output out = kk1 chisq fisher;
run;
options notes;
ods select all;
To get the proportion of rejections, we first use a data step to calculate whether each test was rejected, then go back to proc freq to find the proportion of rejections and the CI on the probability of rejections.
data summ;
set kk1 (keep = tablen p_pchi xp2_fish);
rej_pchi = (p_pchi lt 0.05);
rej_fish = (xp2_fish lt .05);
run;
ods output binomialprop = kk2;
proc freq data = summ;
by tablen;
tables rej_pchi rej_fish / binomial(level='1');
run;
You may have noticed that we didn't do anything to account for the tables with empty rows or columns. When the initial proc freq encounters such a table, it performs neither test. Thus the second proc freq is calculating the proportion and CI with a denominator that might be smaller than the number of tables we simulated. Happily, they'll still be correct, though the CI may be wider than we'd intended. Finally, we're ready to plot the results, using the hilob interpolation described in Example 10.4. Using hiloc instead shows the "close" as a tick mark between the high and low values.
data kk2a;
set kk2;
if table eq "Table rej_pchi" then tablen = tablen + 1;
run;
symbol1 i = hiloc;
symbol2 i = hiloc;
proc gplot data = kk2a;
where name1 in ("_BIN_","XL_BIN","XU_BIN");
plot nvalue1 * tablen = table / vref = 0.05 href=83;
/* ref lines where the expected count in the smallest cell is > 5, 
and the nominal alpha */
run; quit;
The results are shown above. The confidence limits should include 0.05 for all numbers of subjects in order to be generally recommended. Both tests reach this standard, with these margins, even for tables with only 20 subjects, i.e., with expected cell counts of 11, 5, 3, and 1. The exact test appears conservative (rejects less than 5% of the time), probably due to small cell counts and the resulting ties in the list of possible tables.

R
In R, we'll simulate observations from a multinomial distribution with the desired cell probabilities, and assemble the result into a table to calculate the p-values. This will make it easier to simulate tables under the alternative, as we need to do to assess power. If there are empty rows or columns, the chisq.test() function produces a p-value of "NaN", which will create problems later. To avoid this, we'll put the table generation inside a while() function. This operates like the do while construction in SAS (and other programming languages). The condition we check for is whether there is a row or column with 0 observations; if so, try generating the data again. We begin by initializing the table with 0's.
makeitm = function(n.in.table, probs) {
 myt = matrix(rep(0,4), ncol=2)
 while( (min(colSums(myt)) == 0) | (min(rowSums(myt)) == 0) ) { 
 myt = matrix(rmultinom(n=1, size=n.in.table, probs), ncol=2,byrow=TRUE)
}
 chisqp = chisq.test(myt, correct=FALSE)$p.value
 fishp = fisher.test(myt)$p.value
 return(c(chisqp, fishp))
}
With this basic building block in place, we can write a function to call it repeatedly (using the replicate() function, then calculate the proportion of rejections and get the CI for the probability of rejections.
many.ns = function(tablen, nds,probs) {
 res1 = t(replicate(nds,makeitm(tablen,probs)))
 res2 = res1 < 0.05
 pear = sum(res2[,1])/nds
 fish = sum(res2[,2])/nds
 pearci = binom.test(sum(res2[,1]),nds)$conf.int[1:2]
 fishci = binom.test(sum(res2[,2]),nds)$conf.int[1:2]
 return(c("N.ds" = nds, "N.table" = tablen, probs, 
 "Pearson.p" = pear, "PCIl"=pearci[1], "PCIu"=pearci[2],
 "Fisher.p" = fish, "FCIl" = fishci[1], "FCIu" = fishci[2]))
}
Finally, we're ready to actually do the experiment, using sapply() to call the function that calls replicate() to call the function that makes a table. The result is converted to a data frame to make the plotting simpler. The first call below replicates the SAS result shown above and has very similar estimates and CI, but is not displayed here. The second shows an odds ratio of 3, the third (plotted below) has an OR of 1.75, and the last an OR of 1.5.
#res3 = data.frame(t(sapply(c(20, 50, 100, 200, 500, 1000),many.ns, nds=10000, 
 probs = c(.56,.24,.14,.06))))
#res3 = data.frame(t(sapply(c(20, 50, 100, 200, 500, 1000),many.ns, nds=1000, 
 probs = c(.6,.2,.1,.1))))
res3 = data.frame(t(sapply(c(20, 50, 100, 200, 500, 1000),many.ns, nds=1000, 
 probs = c(.58,.22,.12,.08))))
#res3 = data.frame(t(sapply(c(20, 50, 100, 200, 500, 1000),many.ns, nds=1000, 
 probs = c(.57,.23,.13,.07))))
with(res3,plot(x = 1, y =1, type="n", ylim = c(0, max(c(PCIu,FCIu))), xlim=c(0,1000),
 xlab = "N in table", ylab="Rejections", main="Fisher vs. Pearson"))
with(res3,points(y=Pearson.p, x=N.table,col="blue"))
with(res3,segments(x0 = N.table, x1=N.table, y0 = PCIl, y1= PCIu, col = "blue"))
with(res3,points(y=Fisher.p, x=N.table + 2,col="red"))
with(res3,segments(x0 = N.table+2, x1=N.table+2, y0 = FCIl, y1= FCIu, col = "red"))
abline(v=83)
abline(h=0.05)
The plotting commands used above have been demonstrated in Examples 10.4 and 8.42.
Overall, the results show (in the SAS plot, at the top) that the Pearson chi-square test does perfectly well at protecting the alpha level under the null with these margins, even when the expected number of cases in one cell is as small as 1. In contrast, compared to the exact test, the Chi-square test has a bit more power, for these cell probabilities. The power difference is greatest when the N is smaller. Given this example, I would say that the rule of thumb may be too conservative, pushing people away from a more powerful test unnecessarily. In the future, I plan to be more positive about using the Pearson chi-square.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, October 15, 2012

Example 10.6: Should Poisson regression ever be used? Negative binomial vs. Poisson regression

In practice, we often find that count data is not well modeled by Poisson regression, though Poisson models are often presented as the natural approach for such data. In contrast, the negative binomial regression model is much more flexible and is therefore likely to fit better, if the data are not Poisson. In example 8.30 we compared the probability mass functions of the two distributions, and found that for a given mean, the negative binomial closely approximates the Poisson, as the scale parameter increases. But how does this affect the choice of regression model? How might another alternative, the overdispersed, or quasi-Poisson model compete with these? Today we generate a rudimentary toolkit for assessing the effects of Poisson, negative binomial, and quasi-Poisson models, assuming data are truly generated by one or the other process.

SAS
We'll begin by simulating Poisson and negative binomial data. Note that we also rely on the poismean_nb function that we created in example 8.30-- this is needed because SAS only accepts the natural parameters of the distribution, while the mean is a (simple) function of the two parameters. As is typical in such settings, we'll begin by generating data under the null of no association between, in this case, the normal covariate and the count outcome. The proportion of rejections should be no greater than alpha (5%, here). However, we'll include code to easily simulate data under the alternative as well. This will facilitate assessing the relative power of the models, later.
data nbp;
do ds = 1 to 10000;
 do i = 0 to 250;
 x = normal(0);
 mean = exp(-.25 + (0 * x));
 pois = rand("POISSON",mean);
 nb1 = rand("NEGBINOMIAL", poismean_nb(mean, 1), 1);
 output;
 end;
 end;
run;
The models will be fit in proc genmod. (See sections 4.1.3, 4.1.5, table 4.1.) It would be good to write a little macro to change the distribution and the output names, but it's not necessary. To save space here, the repetitive lines are omitted. The naming convention is that the true distribution (p or nb) is listed first, followed by the fit model (p, nb, or pod, for overdispersed).
 
ods select none;
ods output parameterestimates = ppests;
proc genmod data = nbp;
by ds;
model pois = x /dist=poisson;
run;
ods output parameterestimates = ppodests;
model pois = x /dist=poisson scale = p;
ods output parameterestimates = pnbests;
model pois = x /dist=negbin;
ods output parameterestimates = nbnbests;
model nb1 = x /dist=negbin;
ods output parameterestimates = nbpests;
model nb1 = x /dist=poisson;
ods output parameterestimates = nbpodests;
model nb1 = x /dist=poisson scale=p;
ods select all;
For analysis, we'll bring all the results together using the merge statement (section 1.5.7). Note that the output data sets contain the Wald CI limits as well as the estimates themselves; all have to be renamed in the merge, or they will overwrite each other.
data results;
merge
ppests (rename = (estimate = pp_est lowerwaldcl = pp_l 
 upperwaldcl = pp_u))
ppodests (rename = (estimate = ppod_est lowerwaldcl = ppod_l 
 upperwaldcl = ppod_u))
pnbests (rename = (estimate = pnb_est lowerwaldcl = pnb_l 
 upperwaldcl = pnb_u))
nbnbests (rename = (estimate = nbnb_est lowerwaldcl = nbnb_l 
 upperwaldcl = nbnb_u))
nbpests (rename = (estimate = nbp_est lowerwaldcl = nbp_l 
 upperwaldcl = nbp_u))
nbpodests (rename = (estimate = nbpod_est lowerwaldcl = nbpod_l 
 upperwaldcl = nbpod_u));
where parameter eq "x";
target = 0;
pprej = ((pp_l gt target) or (pp_u lt target));
ppodrej = ((ppod_l gt target) or (ppod_u lt target));
pnbrej = ((pnb_l gt target) or (pnb_u lt target));
nbnbrej = ((nbnb_l gt target) or (nbnb_u lt target));
nbprej = ((nbp_l gt target) or (nbp_u lt target));
nbpodrej = ((nbpod_l gt target) or (nbpod_u lt target));
run;
The indicators of CI that exclude the null are calculated with appropriate names using logical tests that are 1 if true (rejections) and 0 if false. (See, e.g., section 1.4.9.) The final results can be obtained from proc means
proc means data = results; 
var pp_est ppod_est pnb_est nbnb_est nbp_est nbpod_est 
 pprej ppodrej pnbrej nbnbrej nbprej nbpodrej; 
run;
 Variable Mean
 -------------------------
 pp_est -0.000349738
 ppod_est -0.000349738
 pnb_est -0.000344668
 nbnb_est 0.0013738
 nbp_est 0.0013588
 nbpod_est 0.0013588
 pprej 0.0505000
 ppodrej 0.0501000
 pnbrej 0.0468000
 nbnbrej 0.0535000
 nbprej 0.1427000
 nbpodrej 0.0555000
 -------------------------
All of the estimates appear to be unbiased. However, Poisson regression, when applied to the truly negative binomial data, appears to be dramatically anticonservative, rejecting the null (i.e., with CI excluding the null value) 14% of the time. The overdispersed model may be slightly biased as well. The estimated proportion of rejections is 5.55%, or 555 of 10,000 experiments. An exact CI for the proportion excludes 5%, here, although the anticonservative bias appears to be slight. To test other effect sizes, we'd change the mean, set in the first data step and the target in the results data. It would also be valuable to change the scale parameter for the negative binomial.


R
We begin by defining two simple functions: one to extract the standard errors from a model, and the second to assess whether Wald-type CI for parameter estimates exclude some value. It's a bit confusing that a standard error extracting function is not part of R. Or perhaps it is, and someone will point out the obvious function in the comments. It's useful to use the standard errors and construct the Wald CI in the current setting because the obvious alternative for constructing CI, the confint() function, uses profile likelihoods, which would be too time-consuming in a simulation setting. The second function accepts the parameter estimate, its standard error, and a fixed value which we want to know is in or out of the CI. Both functions are actually single expressions, but having them in hand will reduce the typing in the main function.
# this will work for any model object that works with vcov()
# the test for positive variance should be unnecessary but can't hurt
stderrs = function(model) {
 ifelse(min(diag(vcov(model)))> 0, sqrt(diag(vcov(model))), NA) 
}
# short and sweet: 1 if target is out of Wald CI, 0 if in
ciout = function(est, se, target){
 ifelse( (est - 1.96*se> target) | (est + 1.96*se < target), 1,0) } 
With these ingredients prepared, we're ready to write a function to fit the three models to the two sets of observed data. The function will accept a number of observations per data set and a true beta. The Poisson and overdispersed Poisson are fit with the glm() function (section 4.1.3, table 4.1) but the negative binomial uses the glm.nb() function found in the MASS package (section 4.1.5).
testpnb = function(n, beta) {
# make data
n = 250
x = rnorm(n)
mean = exp(-.25 + (beta * x))
pois = rpois(n,mean)
nb1 = rnbinom(n, size=1, mu=mean)
# fit models to Poisson data
pp = glm(pois ~x, family="poisson")
ppod = glm(pois ~x, family="quasipoisson")
pnb = glm.nb(pois~x)
# fit models to nb data
nbnb = glm.nb(nb1 ~x)
nbp = glm(nb1 ~x, family="poisson")
nbpod = glm(nb1 ~x, family="quasipoisson")
# extract parameter estimates using the coef() function
est = as.numeric(c(coef(pp)[2], coef(ppod)[2], coef(pnb)[2], coef(nbnb)[2], coef(nbp)[2], coef(nbpod)[2]))
# use our two new functions to get the SE and the CI indicator
se = c(stderrs(pp), stderrs(ppod), stderrs(pnb), stderrs(nbnb), stderrs(nbp), stderrs(nbpod))
ci = ciout(est, se, rep(beta,6))
return(matrix(c(est,se,ci),ncol=3))
}
Now we can use the convenient replicate() function to call the experiment many times. Since the output of testnb() is a matrix, the result of replicate() is a three-dimensional matrix, R * C * sheet, where sheet here corresponds to each experimental replicate. To summarize the results, we can use the rowMeans() function to get the proportion of rejections or the mean of the estimates.
mainout = replicate(10000,testpnb(250,0))
# the [,3,] below means "all rows, column 3, all sheets"> rowMeans(mainout[,3,])
[1] 0.0490 0.0514 0.0463 0.0490 0.1403 0.0493> rowMeans(mainout[,1,])
[1] 0.0003482834 0.0003482834 0.0003558526 -0.0004123949 -0.0003972441 -0.0003972441
The results agree completely with the SAS results discussed above. The naive Poisson regression would appear a bad idea--if the data are negative binomial, tests don't have the nominal size. It would be valuable to replicate the experiment with some other distribution for the real data as well. One approach to modeling count data would be to fit the Poisson and assess the quality of the fit, which can be done in several ways. However, this iterative fitting also jeopardizes the size of the test, in theory. Perhaps we'll explore the practical impact of this in a future entry. Fortunately, at least in this limited example, a nice alternative exists: We can just fit the negative binomial by default. The costs of this in terms of power could be assessed with a thorough simulation study, but are likely to be small, since only one additional parameter is estimated. And the size of the test is hardly affected at all. The quasi-Poisson model could also be recommended, but has the drawback of relying on what is actually not a viable distribution for the data. Some sources suggest that it may be even more flexible than the negative binomial, however.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, October 1, 2012

Example 10.4: Multiple comparisons and confidence limits


A colleague is a devotee of confidence intervals. To him, the CI have the magical property that they are immune to the multiple comparison problem-- in other words, he feels its OK to look at a bunch of 95% CI and focus on the ones that appear to exclude the null. This though he knows well the one-to-one relationship between 95% CIs that exclude the null and p-values below 0.05.

Today, we'll create a Monte Carlo experiment to demonstrate that fishing by CI is just as dangerous as fishing by p-value; generating the image above. We'll do this by replicating a bivariate experiment 100 times. Later, we'll examine the results of a single experiment with many predictors.


R
To begin with, we'll write a function to generate a single experiment, using a logistic regression. This is a simple modification of one of our first and most popular entries.
simci = function(){
 intercept = 0
 beta = 0
# beta = 0 because we're simulating under the null 
# make the variance of x in this experiment vary a bit
 xtest = rnorm(1000) * runif(1,.6,1.4)
 linpred = intercept + xtest*beta
 prob = exp(linpred)/(1 + exp(linpred))
 runis = runif(1000)
 ytest = ifelse(runis < prob,1,0)
# now, fit the model
 fit = glm(ytest~xtest,family=binomial(link="logit"))
# the standard error of the estimates is easiest to find in the
 pe = summary(fit)$coefficients
# calculate the Wald CI; an alternative would be confint(), but
# that calculated profile CI, which take longer to generate
 ci = exp(c(pe[2,1] - 1.96*pe[2,2], pe[2,1] + 1.96*pe[2,2] ))
 return(ci)
}

Then we can use the replicate() function to repeat the experiment 100 times. The t() function (section 1.9.2) transposes the resulting matrix to have one row per experiment.
sim100 = t(replicate(100,simci()))
plot(x = sim100[,1], y = 1:100, 
 xlim = c(min(sim100),max(sim100)), type="n")
segments(y0=1:100,x0=sim100[,1],y1 = 1:100,x1=sim100[,2], 
 col = ifelse(sim100[,1]>1 | sim100[,2]<1,"red","black")) abline(v=1) 

The result is shown at the top. In the code, we set the limits of the x-axis by finding the max and min across the whole matrix-- this is a little wasteful of CPU cycles, but saves some typing. The segments() function (see example 8.42) is a vector-enabled line-drawer. Here we draw a line from the lower CI limit to the upper, giving the experiment number as the x value for each. We assign a red plot line if the CI excludes 1, using the ifelse() function (section 1.11.2), a vectorwise logic test. Finally, a reference line helps the viewer see for far the end of the CI is from the null. We omit prettying up the axis labels.


SAS
In SAS, considerably more lines are required. We begin by simulating the data, as in example 7.2. The modifications are to generate 100 examples with an outside do loop (section 1.11.1) and the random element added to the variance.
data simci;
beta = 0;
intercept = 0;
do sim = 1 to 100; /* outer loop */
 xvar = (uniform(0) *.8) + .6; /* variance != 1 */
 do i = 1 to 1000;
 xtest = normal(0) * xvar;
 linpred = intercept + (xtest * beta);
 prob = exp(linpred)/(1 + exp(linpred));
 ytest = (uniform(0) < prob);
 output;
 end;
end;
run;

Then we fit the logistic regression. We leave in the ods trace commands (section A.7.1) to remind you how to find the SAS names of the output elements, needed to save the results in the ods output statement. The CI for the odds ratios are requested in the clodds statement, which accepts a pl value for a profile likelihood based interval.
*ods trace on/listing; 
ods select none;
ods output cloddswald = lrci;
proc logistic data = simci;
by sim;
model ytest(event='1')=xtest / clodds=wald;
run;
*ods trace off;
ods select all;

Our plotting approach will require the "long" data set style, with two rows for each experiment. We'll generate that while checking whether the null is excluded from the CI.
data lrp2;
set lrci;
red = 0;
if lowercl > 1 or uppercl < 1 then red = 1;
point = lowercl; output;
point = uppercl; output;
run;

Finally, we're ready to make the graphic. We use the hilob interpolation to connect the upper and lower CI for each experiment; the b requests bars instead of a line, and the bwidth option specifies a very narrow bar. These options prevent the default plotting of the "mean" value with a tick. The a*b=c syntax (section 5.2.2) allows the different line colors.
symbol1 i=hilob bwidth=.05 c=black;
symbol2 i=hilob bwidth=.05 c=red;
proc gplot data = lrp2;
plot point * sim = red / vref = 1;
run;quit;

The result is just below. The vertical alignment seen in the R plot seems more natural, but this would not be possible with the hilo interpolation. As theory and logic would suggest, quite a few of the hundred simulated CI exclude the null, sometimes by a large proportion of the CI width.




An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, June 18, 2012

Example 9.35: Discrete randomization and formatted output

A colleague asked for help with randomly choosing a kid within a family. This is for a trial in which families are recruited at well-child visits, but in each family only one of the children having a well-child visit that day can be in the study. The idea is that after recruiting the family, the research assistant needs to choose one child, but if they make that choice themselves, the children are unlikely to be representative. Instead, we'll allow them to make a random decision through an easily used slip that can be put into sealed envelopes. The envisioned process is that the RA will recruit the family, determine the number of eligible children, then open the envelope to find out which child was randomly selected.

One thought here would be to generate separate stacks of envelopes for each given family size, and have the research assistant open an envelope from the appropriate stack. However, this could be logistically challenging, especially since the RAs will spend weeks away from the home office. Instead, we'll include all plausible family sizes on each slip of paper. It seems unlikely that more than 5 children in a family will have well-child visits on the same day.

SAS
We'll use the SAS example to demonstrate using SAS macros to write SAS code, as well as showing a plausible use for SAS formats (section 1.4.12) and making use of proc print.

/* the following macro will write out equal probabilities for selecting
each integer between 1 and the argument, in the format needed for the
rand function. E.g., if the argument is 3,
it will write out
1/3,1/3,1/3
*/

%macro tbls(n);
%do i = 1 %to &n;
1/&n %if &i < &n %then ,
%end;
%mend tbls;

/* then we can use the %tbls macro to create the randomization
via rand("TABLE") (section 1.10.4). */
data kids;
do family = 1 to 10000;
nkids = 2; chosen = rand("TABLE",%tbls(2)); output;
nkids = 3; chosen = rand("TABLE",%tbls(3)); output;
nkids = 4; chosen = rand("TABLE",%tbls(4)); output;
nkids = 5; chosen = rand("TABLE",%tbls(5)); output;
end;
run;

/* check randomization */
proc freq data = kids;
table nkids * chosen / nocol nopercent;
run;

nkids chosen

Frequency|
Row Pct | 1| 2| 3| 4| 5| Total
---------+--------+--------+--------+--------+--------+
2 | 50256 | 49744 | 0 | 0 | 0 | 100000
| 50.26 | 49.74 | 0.00 | 0.00 | 0.00 |
---------+--------+--------+--------+--------+--------+
3 | 33429 | 33292 | 33279 | 0 | 0 | 100000
| 33.43 | 33.29 | 33.28 | 0.00 | 0.00 |
---------+--------+--------+--------+--------+--------+
4 | 25039 | 24839 | 25245 | 24877 | 0 | 100000
| 25.04 | 24.84 | 25.25 | 24.88 | 0.00 |
---------+--------+--------+--------+--------+--------+
5 | 19930 | 20074 | 20188 | 20036 | 19772 | 100000
| 19.93 | 20.07 | 20.19 | 20.04 | 19.77 |
---------+--------+--------+--------+--------+--------+
Total 128654 127949 78712 44913 19772 400000

Looks pretty good. Now we need to make the output usable to the research assistants, by formatting the results into English. We'll use the same format for each number of kids. This saves some keystrokes now, but may possibly cause the RAs some confusion-- it means that we might refer to the "4th oldest" of 4 children, rather than the "youngest". We could fix this using a different format for each number of children, analogous to the R version below.

proc format;
value chosen
1 = "oldest"
2 = '2nd oldest'
3 = '3rd oldest'
4 = '4th oldest'
5 = '5th oldest';
run;

/* now, make a text variable the concatenates (section 1.4.5) the variables
and some explanatory text */
data k2;
set kids;
if nkids eq 2 then
t1 = "If there are " || strip(nkids) ||" children then choose the " ||
strip(put(chosen,chosen.)) || " child.";
else
t1 = " " || strip(nkids) ||" ________________________ " ||
strip(put(chosen,chosen.));
run;

/* then we print. Notice the options to print in plain text, shorten the
page length and width, and remove the date and page number from the SAS output, as
well as in the proc print statement to remove the observation number and
show the line number, with a few other tricks */
options nonumber nodate ps = 60 ls = 68;
OPTIONS FORMCHAR="|----|+|---+=|-/\*";
proc print data = k2 (obs = 3) noobs label sumlabel;
by family;
var t1;
label t1 = '00'x family = "Envelope";
run;

---------------------------- Envelope=1 ----------------------------



If there are 2 children then choose the 2nd oldest child.
3 ________________________ 3rd oldest
4 ________________________ 4th oldest
5 ________________________ 5th oldest


---------------------------- Envelope=2 ----------------------------



If there are 2 children then choose the 2nd oldest child.
3 ________________________ oldest
4 ________________________ oldest
5 ________________________ 3rd oldest


---------------------------- Envelope=3 ----------------------------



If there are 2 children then choose the 2nd oldest child.
3 ________________________ 2nd oldest
4 ________________________ 3rd oldest
5 ________________________ 2nd oldest


R
For R, we leave some trial code in place, to demonstrate how one might discover, test, and build R code in this setting. Most results have been omitted.

sample(5, size = 1)
# choose a (discrete uniform) random integer between 1 and 5

apply(matrix(2:5),1,sample,size=1)
# choose a random integer between 1 and 2, then between 1 and 3, etc.,
# using apply() to repeat the call to sample() with different maximum number
# apply() needs a matrix or array input
# result of this is the raw data needed for one family

replicate(3,apply(matrix(2:5),1,sample,size=1))
# replicate() is in the apply() family and just repeats the
# function n times

[,1] [,2] [,3]
[1,] 2 1 2
[2,] 2 1 2
[3,] 2 2 2
[4,] 3 5 4

Now we have the raw data for the envelopes. Before formatting it for printing, let's check it to make sure it works correctly.

test=replicate(100000, apply(matrix(2:5), 1, sample, size=1))
apply(test, 1, summary)
[,1] [,2] [,3] [,4]
Min. 1.0 1 1.000 1.000
1st Qu. 1.0 1 1.000 2.000
Median 1.0 2 2.000 3.000
Mean 1.5 2 2.492 3.003
3rd Qu. 2.0 3 3.000 4.000
Max. 2.0 3 4.000 5.000
# this is not so helpful-- need the count or percent for each number
# this would be the default if the data were factors, but they aren't
# check to see if we can trick summary() into treating these integers
# as if they were factors
methods(summary)
# yes, there's a summary() method for factors-- let's apply it
# there's also apply(test,1,table) which might be better, if you remember it
apply(test, 1, summary.factor)
[[1]]
1 2
50025 49975

[[2]]
1 2 3
33329 33366 33305

[[3]]
1 2 3 4
25231 25134 24849 24786

[[4]]
1 2 3 4 5
19836 20068 20065 20022 20009
# apply(test,1,table) will give similar results, if you remember it

Well, that's not too pretty, but it's clear that the randomization is working. Now it's time to work on formatting the output.

mylist=replicate(5, apply(matrix(2:5), 1, sample, size=1))
# brief example data set

# We'll need to use some formatted values (section 1.14.12), as in SAS.
# Here, we'll make new value labels for each number of children,
# which will make the output easier to read. We add in an envelope
# number and wrap it all into a data frame.
df = data.frame(envelope = 1:5,
twokids=factor(mylist[1,],1:2,labels=c("youngest","oldest")),
threekids=factor(mylist[2,],1:3,labels=c("youngest", "middle", "oldest")),
fourkids=factor(mylist[3,],1:4,labels=c("youngest", "second youngest",
"second oldest", "oldest")),
fivekids=factor(mylist[4,],1:5,labels=c("youngest", "second youngest",
"middle", "second oldest", "oldest"))
)

# now we need a function to take a row of the data frame and make a single slip
# the paste() function (section 1.4.5) puts together the fixed and variable
# content of each row, while the cat() function will print it without quotes
slip = function(kidvec) {
cat(paste("------------- Envelope", kidvec[1], "------------------"))
cat(paste("\nIf there are", 2:5, " children, select the", kidvec[2:5],"child"))
cat("\n \n \n")
}

# test it on one row
slip(df[1,])

# looks good-- now we can apply() it to each row of the data frame
apply(df, 1, slip)

------------- Envelope 1 ------------------
If there are 2 children, select the youngest child
If there are 3 children, select the youngest child
If there are 4 children, select the second youngest child
If there are 5 children, select the youngest child


------------- Envelope 2 ------------------
If there are 2 children, select the youngest child
If there are 3 children, select the youngest child
If there are 4 children, select the second oldest child
If there are 5 children, select the middle child


------------- Envelope 3 ------------------
If there are 2 children, select the youngest child
If there are 3 children, select the youngest child
If there are 4 children, select the youngest child
If there are 5 children, select the second youngest child

# and so forth

# finally, we can save the result in a file with
# capture.output()
capture.output(apply(df,1,slip), file="testslip.txt")


An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /