Showing posts with label proc transpose. Show all posts
Showing posts with label proc transpose. Show all posts
Monday, November 17, 2014
Example 2014.13: Statistics doesn't have to be so hard! Resampling in R and SAS
[フレーム]
A recent post pointed us to a great talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.
In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.
R
We'll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.
For mosaic, we begin by entering the data and creating a dataframe. The do() operator and theshuffle() function facilitate carrying out a permutation test (see also section 5.4.5 of the book).
The do() operator evaluates the expression on the right hand side a specified number of times. In this case we shuffle (or permute) the group indicators.
We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.
For those not invested in the mosaic package, base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.
The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.
In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.
The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:
SAS
The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.
The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.
To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.
An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.
R
We'll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.
For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the
beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)
ds = data.frame(y = c(beer, water),
x = c(rep("beer", length(beer)), rep("water", length(water))))
require(mosaic)
obsdiff = compareMean(y ~ x, data=ds)
nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)
histogram(~ result, xlab="permutation differences", data=nulldist)
ladd(panel.abline(v=obsdiff, col="red", lwd=2))
> obsdiff
[1] -4.377778
> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)
TRUE FALSE
0.1 99.9
The do() operator evaluates the expression on the right hand side a specified number of times. In this case we shuffle (or permute) the group indicators.
We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.
For those not invested in the mosaic package, base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.
alldata = c(beer, water)
labels = c(rep("beer", length(beer)), rep("water", length(water)))
obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])
> obsdiff
[1] -4.377778
The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.
resample_labels = sample(labels) resample_diff = mean(alldata[resample_labels=="beer"]) - mean(alldata[resample_labels=="water"]) resample_diff [1] 1.033333
In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.
resamp_means = function(data, labs){
resample_labels = sample(labs)
resample_diff = mean(data[resample_labels=="beer"]) -
mean(data[resample_labels=="water"])
return(resample_diff)
}
nulldist = replicate(9999,resamp_means(alldata,labels))
hist(nulldist, col="cyan")
abline(v = obsdiff, col = "red")
The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:
alldiffs = c(obsdiff,nulldist) p = sum(abs(alldiffs >= obsdiff)/ 10000)
SAS
The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.
The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.
data bites;;
if _n_ le 18 then drink="water";
else drink="beer";
infile datalines delimiter=',';
input bites @@;
datalines;
21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20
27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24
28, 24, 29, 21, 21, 18, 27, 20
;
run;
proc summary nway data = bites;
class drink;
var bites;
output out=obsmeans mean=mean;
run;
proc transpose data = obsmeans out=om2;
var mean;
id drink;
run;
data om3;
set om2;
obsdiff = beer-water;
call symput('obsdiff',obsdiff);
run;
proc print data = om3; var obsdiff; run;
Obs obsdiff
1 4.37778
(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don't understand t-tests, so we want to avoid them.)
To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.
data rerand; set bites; randorder = uniform(0); run; proc sort data = rerand; by randorder; run; data rerand2; set rerand; if _n_ le 18 then redrink = "water"; else redrink = "beer"; run; proc summary nway data = rerand2; class redrink; var bites; output out=rerand3 mean=mean; run; proc transpose data = rerand3 out=rerand4; var mean; id redrink; run; data rrdiff; set rerand4; rrdiff = beer-water; run; proc print data = rrdiff; var rrdiff; run; Obs rrdiff 1 -1.73778One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the summary and transpose procedures as before, with the addition of the by replicate statement to make a data set with columns for each group mean and a row for each replicate.
proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run; proc summary nway data = ssresamp; by replicate; class groupid; var bites; output out=ssresamp2 mean=mean; run; proc transpose data = ssresamp2 out=ssresamp3 prefix=group; by replicate; var mean; id groupid; run;To get a p-value and make a histogram, we use the macro variable created earlier.
data ssresamp4; set ssresamp3; diff = group2 - group1; exceeds = abs(diff) ge &obsdiff; run; proc means data = ssresamp4 sum; var exceeds; run; The MEANS Procedure Analysis Variable : exceeds Sum 9.0000000 proc sgplot data = ssresamp4; histogram diff; refline &obsdiff /axis=x lineattrs=(thickness=2 color=red); run;The p-value is 0.001 (= (9+1)/10000).
An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
Tuesday, October 21, 2014
Example 2014.12: Changing repeated measures data from wide to narrow format
[フレーム]
Data with repeated measures often come to us in the "wide" format, as shown below for the HELP data set we use in our book. Here we show just an ID, the CESD depression measure from four follow-up assessments, plus the baseline CESD.
Obs ID CESD1 CESD2 CESD3 CESD4 CESD 1 1 7 . 8 5 49 2 2 11 . . . 30 3 3 14 . . 49 39 4 4 44 . . 20 15 5 5 26 27 15 28 39 ...
Frequently for data analysis we need to convert the data to the "long" format, with a single column for the repeated time-varying CESD measures and column indicating the time of measurement. This is needed, for example, in SAS proc mixed or in the lme4 package in R. The data should look something like this:
Obs ID time CESD cesd_tv 1 1 1 49 7 2 1 2 49 . 3 1 3 49 8 4 1 4 49 5 ...
In section 2.3.7 (2nd Edition) we discuss this problem, and we provide an example in section 7.10.9. Today we're adding a blog post to demonstrate some handy features in SAS and how the problem can be approached using plain R and, alternatively, using the new-ish R packages dplyr and tidyr, contributed by Hadley Wickham.
R
We'll begin by making a narrower data frame with just the columns noted above. We use the select() function from the dplyr package to do this; the syntax is simply to provide the the name of the input data frame as the first argument and then the names of the columns to be included in the output data frame. We use this function instead of the similar base R function subset(..., select=) because of dplyr's useful starts_with() function. This operates on the column names as character vectors in a hopefully obvious way.
load("c:/book/savedfile")
library(dplyr)
wide = select(ds, id, starts_with("cesd"))
Now we'll convert to the long format. The standard R approach is to use the reshape() function. The documentation for this is a bit of a slog, and the function can generate error messages that are not so helpful. But for simple problems like this one, it works well.
long = reshape(wide, varying = c("cesd1", "cesd2", "cesd3", "cesd4"),
v.names = "cesd_tv",
idvar = c("id", "cesd"), direction="long")
long[long$id == 1,]
id cesd time cesd_tv
1.49.1 1 49 1 7
1.49.2 1 49 2 NA
1.49.3 1 49 3 8
1.49.4 1 49 4 5
In the preceding, the varying parameter is a list of columns which vary over time, while the id.var columns appear at each time. The v.names parameter is the name of the column which will hold the values of the varying variables.
Another option would be to use base R knowledge to separate, rename, and then recombine the data as follows. The main hassle here is renaming the columns in each separate data frame so that they can be combined later.
c1 = subset(wide, select= c(id, cesd, cesd1)) c1$time = 1 names(c1)[3] = "cesd_tv" c2 = subset(wide, select= c(id, cesd, cesd2)) c2$time = 2 names(c2)[3] = "cesd_tv" c3 = subset(wide, select= c(id, cesd, cesd3)) c3$time = 3 names(c3)[3] = "cesd_tv" c4 = subset(wide, select= c(id, cesd, cesd4)) c4$time = 4 names(c4)[3] = "cesd_tv" long = rbind(c1,c2,c3,c4) long[long$id==1,] id cesd cesd_tv time 1 1 49 7 1 454 1 49 NA 2 907 1 49 8 3 1360 1 49 5 4
This is cumbersome, but effective.
More interesting is to use the tools provided by dplyr and tidyr.
library(tidyr) gather(wide, key=names, value=cesd_tv, cesd1,cesd2,cesd3,cesd4) %>% mutate(time = as.numeric(substr(names,5,5))) %>% arrange(id,time) -> long head(long) id cesd names cesd_tv time 1 1 49 cesd1 7 1 2 1 49 cesd2 NA 2 3 1 49 cesd3 8 3 4 1 49 cesd4 5 4 5 2 30 cesd1 11 1 6 2 30 cesd2 NA 2
The gather() function takes a data frame (the first argument) and returns new columns named in the key and value parameter. The contents of the columns are the names (in the key) and the values (in the value) of the former columns listed. The result is a new data frame with a row for every column in the original data frame, for every row in the original data frame. Any columns not named are repeated in the output data frame. The mutate function is like the R base function transform() but has some additional features and may be faster in some settings. Finally, the arrange() function is a much more convenient sorting facility than is available in standard R. The input is a data frame and a list of columns to sort by, and the output is a sorted data frame. This saves us having to select out a subject to display
The %>% operator is a "pipe" or "chain" operator that may be familiar if you're a *nix user. It feeds the result of the last function into the next function as the first argument. This can cut down on the use of nested parentheses and may make reading R code easier for some folks. The effect of the piping is that the mutate() function should be read as taking the result of the gather() as its input data frame, and sending its output data frame into the arrange() function. For Ken, the right assignment arrow (-> long) makes sense as a way to finish off this set of piping rules, but Nick and many R users would prefer to write this as long = gather... or long <- gather.. , etc.
SAS
In SAS, we'll make the narrow data set using the keep statement in the data step, demonstrating meanwhile the convenient colon operator, that performs the same function provided by starts_with() in dplyr.
data all; set "c:/book/help.sas7bdat"; run; data wide; set all; keep id cesd:; run;
The simpler way to make the desired data set is with the transpose procedure. Here the by statement forces the variables listed in that statement not to be transposed. The notsorted options save us having to actually sort the variables. Otherwise the procedure works like gather(): each transposed variable becomes a row in the output data set for every observation in the input data set. SAS uses standard variable names for gather()'s key (SAS: _NAME_)and value (SAS: COL1) though these can be changed.
proc transpose data = wide out = long_a; by notsorted id notsorted cesd; run; data long; set long_a; time = substr(_name_, 5); rename col1=cesd_tv; run; proc print data = long; where id eq 1; var id time cesd cesd_tv; run; Obs ID time CESD cesd_tv 1 1 1 49 7 2 1 2 49 . 3 1 3 49 8 4 1 4 49 5
As with R, it's trivial, though somewhat cumbersome, to generate this effect using basic coding.
data long; set wide; time = 1; cesd_tv = cesd1; output; time = 2; cesd_tv = cesd2; output; time = 3; cesd_tv = cesd3; output; time = 4; cesd_tv = cesd4; output; run; proc print data = long; where id eq 1; var id time cesd cesd_tv; run; Obs ID time CESD cesd_tv 1 1 1 49 7 2 1 2 49 . 3 1 3 49 8 4 1 4 49 5
An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
Labels:
arrange(),
dplyr,
gather(),
Hadley Wickham,
proc transpose,
reshape(),
tidyr,
wide to narrow
4
comments
Tuesday, January 14, 2014
Example 2014.1: "Power" for a binomial probability, plus: News!
[フレーム]
Hello, folks! I'm pleased to report that Nick and I have turned in the manuscript for the second edition of SAS and R: Data Management, Statistical Analysis, and Graphics. It should be available this summer. New material includes some of our more popular blog posts, plus reproducible analysis, RStudio, and more.
To celebrate, here's a new example. Parenthetically, I was fortunate to be able to present my course: R Boot Camp for SAS users at Boston University last week. One attendee cornered me after the course. She said: "Ken, R looks great, but you use SAS for all your real work, don't you?" Today's example might help a SAS diehard to see why it might be helpful to know R.
OK, the example: A colleague contacted me with a typical "5-minute" question. She needed to write a convincing power calculation for the sensitivity-- the probability that a test returns a positive result when the disease is present, for a fixed number of cases with the disease. I don't know how well this has been explored in the peer-reviewed literature, but I suggested the following process:
1. Guess at the true underlying sensitivity
2. Name a lower bound (less than the truth) which we would like the observed CI to exclude
3. Use basic probability results to report the probability of exclusion, marginally across the unknown number of observed positive tests.
This is not actually a power calculation, of course, but it provides some information about the kinds of statements that it's likely to be possible to make.
R
In R, this is almost trivial. We can get the probability of observing x positive tests simply, using the dbinom() function applied to a vector of numerators and the fixed denominator. Finding the confidence limits is a little trickier. Well, finding them is easy, using lapply() on binom.test(), but extracting them requires using sapply() on the results from lapply(). Then it's trivial to generate a logical vector indicating whether the value we want to exclude is in the CI or not, and the sum of the probabilities we see a number of positive tests where we include this value is our desired result.
> truesense = .9 > exclude = .6 > npos = 20 > probobs = dbinom(0:npos,npos,truesense) > cis = t(sapply(lapply(0:npos,binom.test, n=npos), function(bt) return(bt$conf.int))) > included = cis[,1] < exclude & cis[,2] > exclude > myprob = sum(probobs*included) > myprob [1] 0.1329533(Note that I calculated the inclusion probability, not the exclusion probability.)
Of course, the real beauty and power of R is how simple it is to turn this into a function:
> probinc = function(truesense, exclude, npos) {
probobs = dbinom(0:npos,npos,truesense)
cis = t(sapply(lapply(0:npos,binom.test, n=npos),
function(bt) return(bt$conf.int)))
included = cis[,1] < exclude & cis[,2] > exclude
return(sum(probobs*included))
}
> probinc(.9,.6,20)
[1] 0.1329533
SAS
My SAS process took about 4 times as long to write.
I begin by making a data set with a variable recording both the number of events (positive tests) and non-events (false negatives) for each possible value. These serve as weights in the proc freq I use to generate the confidence limits.
%let truesense = .9; %let exclude = .6; %let npos = 20; data rej; do i = 1 to &npos; w = i; event = 1; output; w = &npos - i; event = 0; output; end; run; ods output binomialprop = rej2; proc freq data = rej; by i; tables event /binomial(level='1'); weight w; run;Note that I repeat the proc freq for each number of events using the by statement. After saving the results with the ODS system, I have to use proc transpose to make a table with one row for each number of positive tests-- before this, every statistic in the output has its own row.
proc transpose data = rej2 out = rej3; where name1 eq "XL_BIN" or name1 eq "XU_BIN"; by i; id name1; var nvalue1; run;In my fourth data set, I can find the probability of observing each number of events and multiply this with my logical test of whether the CI included my target value or not. But here there is another twist. The proc freq approach won't generate a CI for both the situation where there are 0 positive tests and the setting where all are positive in the same run. My solution to this was to omit the case with 0 positives from my for loop above, but now I need to account for that possibility. Here I use the end=option to the set statement to figure out when I've reached the case with all positive (sensitivity =1). Then I can use the reflexive property to find the confidence limits for the case with 0 events. Then I'm finally ready to sum up the probabilities associated with the number of positive tests where the CI includes the target value.
data rej4;
set rej3 end = eof;
prob = pdf('BINOMIAL',i,&truesense,&npos);
prob_include = prob * ((xl_bin < &exclude) and (xu_bin > &exclude));
output;
if eof then do;
prob = pdf('BINOMIAL',0,&truesense,&npos);
prob_include = prob * (((1 - xu_bin) < &exclude) and ((1 - xl_bin) > &exclude));
output;
end;
run;
proc means data = rej4 sum;
var prob_include;
run;
Elegance is a subjective thing, I suppose, but to my eye, the R solution is simple and graceful, while the SAS solution is rather awkward. And I didn't even make a macro out of it yet!
An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
Labels:
binomial probability,
end =,
exact CI,
lapply(),
proc transpose,
sapply(),
sensitivity
2
comments
Monday, May 21, 2012
Example 9.32: Multiple testing simulation
[フレーム]
In examples 9.30 and 9.31 we explored corrections for multiple testing and then extracting p-values adjusted by the Benjamini and Hochberg (or FDR) procedure. In this post we'll develop a simulation to explore the impact of "strong" and "weak" control of the family-wise error rate offered in multiple comparison corrections. Loosely put, weak control procedures may fail when some of the null hypotheses are actually false, in that the remaining (true) nulls may be rejected more than the nominal proportion of times.For our simulation, we'll develop flexible code to generate some p-values from false nulls and others from true nulls. We'll assume that the true nulls have p-values distributed uniform (0,1); the false nulls will have p-values distributed uniform with a user-determined maximum. We'll also allow the number of tests overall and the number of false nulls to be set.
SAS
In SAS, a macro does the job. It accepts the user parameters described above, then generates false and true nulls for each desired simulation. With the data created, we can use proc multtest to apply the FDR procedure, with the ODS system saving the results. Note how the by statement allows us to replicate the analysis for each simulated set of p-values without creating a separate data set for each one. (Also note that we do not use proc sort before that by statement-- this can be risky, but works fine here.)
%macro fdr(nsims=1, ntests = 20, nfalse=10, howfalse=.01);
ods select none;
data test;
do sim = 1 to &nsims;
do i = 1 to &ntests;
raw_p = uniform(0) *
( ((i le &nfalse) * &howfalse ) + ((i gt &nfalse) * 1 ) );
output;
end;
end;
run;
ods output pvalues = __pv;
proc multtest inpvalues=test fdr;
by sim;
run;
With the results in hand, (still within the macro) we need to do some massaging to make the results usable. First we'll recode the rejections (assuming a 0.05 alpha level) so that non-rejections are 0 and rejections are 1/number of tests. That way we can just sum across the results to get the proportion of rejections. Next, we transform the data to get each simulation in a row (section 1.5.4). (The data output from proc multtest has nsims*ntests rows. After transposing, there are nsims rows.) Finally, we can sum across the rows to get the proportion of tests rejected in each simulated family of tests. The results are shown in a table made with proc freq.
data __pv1;
set __pv;
if falsediscoveryrate lt 0.05 then fdrprop = 1/&ntests;
else fdrprop =0;
run;
proc transpose data = __pv1 (keep =sim fdrprop) out = pvals_a;
by sim; run;
data pvals;
set pvals_a;
prop = sum(of col1 - col&ntests);
run;
ods select all;
proc freq data = pvals; tables prop; run;
%mend fdr;
%fdr(nsims = 1000, ntests = 20, nfalse = 10, howfalse=.001);
Cumulative Cumulative
prop Frequency Percent Frequency Percent
---------------------------------------------------------
0.5 758 75.80 758 75.80
0.55 210 21.00 968 96.80
0.6 27 2.70 995 99.50
0.65 5 0.50 1000 100.00
So true nulls were rejected 24% of the time, which seems like a lot. Multiple comparison procedures with "strong" control of the familywise error rate will reject them only 5% of the time. Building this simulation as a macro facilitates exploring the effects of the multiple comparison procedures in a variety of settings.
R
As in example 9.31, the R code is rather simpler, though perhaps a bit opaque. To make the p-values, we make them first for all of tests with the false, then for all of the tests with the true nulls. The matrix function reads these in by column, by default, meaning that the first nfalse columns get the nsims*nfalse observations. The apply function generates the FDR p-values for each row of the data set. The t() function just transposes the resulting matrix so that we get back a row for each simulation. As in the SAS version, we'll count each rejection as 1/ntests, and non-rejections as 0; we do this with the ifelse() statement. Then we sum across the simulations with another call to apply() and show the results with a simple table.
checkfdr = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
raw_p = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
fdr = t(apply(raw_p, 1, p.adjust, "fdr"))
reject = ifelse(fdr<.05, 1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}
> checkfdr(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6 0.65
0.755 0.210 0.032 0.003
The results are reassuringly similar to those from SAS. In this R code, it's particularly simple to try a different test-- just replace "fdr" in the p.adjust() call. Here's the result with the Hochberg test, which has strong control.
checkhoch = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
pvals = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
hochberg = t(apply(pvals, 1, p.adjust,"hochberg"))
reject = ifelse(hochberg<.05,1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}
> checkhoch(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6
0.951 0.046 0.003
With this procedure one or more of the true nulls is rejected an appropriate 4.9% of the time. For the most part, we feel more comfortable using multiple testing procedures with "strong control".
An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
Subscribe to:
Comments (Atom)