Showing posts with label p.adjust(). Show all posts
Showing posts with label p.adjust(). Show all posts

Monday, May 21, 2012

Example 9.32: Multiple testing simulation

In examples 9.30 and 9.31 we explored corrections for multiple testing and then extracting p-values adjusted by the Benjamini and Hochberg (or FDR) procedure. In this post we'll develop a simulation to explore the impact of "strong" and "weak" control of the family-wise error rate offered in multiple comparison corrections. Loosely put, weak control procedures may fail when some of the null hypotheses are actually false, in that the remaining (true) nulls may be rejected more than the nominal proportion of times.

For our simulation, we'll develop flexible code to generate some p-values from false nulls and others from true nulls. We'll assume that the true nulls have p-values distributed uniform (0,1); the false nulls will have p-values distributed uniform with a user-determined maximum. We'll also allow the number of tests overall and the number of false nulls to be set.

SAS
In SAS, a macro does the job. It accepts the user parameters described above, then generates false and true nulls for each desired simulation. With the data created, we can use proc multtest to apply the FDR procedure, with the ODS system saving the results. Note how the by statement allows us to replicate the analysis for each simulated set of p-values without creating a separate data set for each one. (Also note that we do not use proc sort before that by statement-- this can be risky, but works fine here.)

%macro fdr(nsims=1, ntests = 20, nfalse=10, howfalse=.01);
ods select none;
data test;
do sim = 1 to &nsims;
do i = 1 to &ntests;
raw_p = uniform(0) *
( ((i le &nfalse) * &howfalse ) + ((i gt &nfalse) * 1 ) );
output;
end;
end;
run;

ods output pvalues = __pv;
proc multtest inpvalues=test fdr;
by sim;
run;

With the results in hand, (still within the macro) we need to do some massaging to make the results usable. First we'll recode the rejections (assuming a 0.05 alpha level) so that non-rejections are 0 and rejections are 1/number of tests. That way we can just sum across the results to get the proportion of rejections. Next, we transform the data to get each simulation in a row (section 1.5.4). (The data output from proc multtest has nsims*ntests rows. After transposing, there are nsims rows.) Finally, we can sum across the rows to get the proportion of tests rejected in each simulated family of tests. The results are shown in a table made with proc freq.

data __pv1;
set __pv;
if falsediscoveryrate lt 0.05 then fdrprop = 1/&ntests;
else fdrprop =0;
run;

proc transpose data = __pv1 (keep =sim fdrprop) out = pvals_a;
by sim; run;

data pvals;
set pvals_a;
prop = sum(of col1 - col&ntests);
run;
ods select all;

proc freq data = pvals; tables prop; run;
%mend fdr;

%fdr(nsims = 1000, ntests = 20, nfalse = 10, howfalse=.001);

Cumulative Cumulative
prop Frequency Percent Frequency Percent
---------------------------------------------------------
0.5 758 75.80 758 75.80
0.55 210 21.00 968 96.80
0.6 27 2.70 995 99.50
0.65 5 0.50 1000 100.00

So true nulls were rejected 24% of the time, which seems like a lot. Multiple comparison procedures with "strong" control of the familywise error rate will reject them only 5% of the time. Building this simulation as a macro facilitates exploring the effects of the multiple comparison procedures in a variety of settings.

R
As in example 9.31, the R code is rather simpler, though perhaps a bit opaque. To make the p-values, we make them first for all of tests with the false, then for all of the tests with the true nulls. The matrix function reads these in by column, by default, meaning that the first nfalse columns get the nsims*nfalse observations. The apply function generates the FDR p-values for each row of the data set. The t() function just transposes the resulting matrix so that we get back a row for each simulation. As in the SAS version, we'll count each rejection as 1/ntests, and non-rejections as 0; we do this with the ifelse() statement. Then we sum across the simulations with another call to apply() and show the results with a simple table.

checkfdr = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
raw_p = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
fdr = t(apply(raw_p, 1, p.adjust, "fdr"))
reject = ifelse(fdr<.05, 1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}

> checkfdr(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6 0.65
0.755 0.210 0.032 0.003

The results are reassuringly similar to those from SAS. In this R code, it's particularly simple to try a different test-- just replace "fdr" in the p.adjust() call. Here's the result with the Hochberg test, which has strong control.

checkhoch = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
pvals = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
hochberg = t(apply(pvals, 1, p.adjust,"hochberg"))
reject = ifelse(hochberg<.05,1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}

> checkhoch(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6
0.951 0.046 0.003

With this procedure one or more of the true nulls is rejected an appropriate 4.9% of the time. For the most part, we feel more comfortable using multiple testing procedures with "strong control".


An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Monday, May 14, 2012

Example 9.31: Exploring multiple testing procedures

In example 9.30 we explored the effects of adjusting for multiple testing using the Bonferroni and Benjamini-Hochberg (or false discovery rate, FDR) procedures. At the time we claimed that it would probably be inappropriate to extract the adjusted p-values from the FDR method from their context. In this entry we attempt to explain our misgivings about this practice.

The FDR procedure is described in Benjamini and Hochberg (JRSSB, 1995) as a "step-down" procedure. Put simply, the procedure has the following steps:

0. Choose the familywise alpha
1. Rank order the unadjusted p-values
2. Beginning with the Mth of the ordered p-values p(m),
2a. if p(m) < alpha*(m/M), then reject all tests 1 ... m,
2b. if not, m = m-1
3. Repeat steps 2a and 2b until the condition is met
or p(1)> alpha/M

where M is the number of tests. The "adjusted p-value" based on this procedure is the smallest familywise alpha under which the current test would have been rejected. To calculate this, we can modify the routine above:

1. Rank order the unadjusted p-values
2. For ordered p-values p(m) M to 1,
2a. candidate ap(m) = p(m) *(M/m)
2b. if candidate ap(m)> ap(m+1) then ap(m) = ap(m+1)
2c. else ap(m) = candidate ap(m)

where ap(m) refers to the adjusted p-value corresponding to the mth ordered unadjusted p-value. It's interesting to note that the adjusted p-value for the Mth ordered test is the same as the unadjusted p-value, while the candidate adjusted p-value for the smallest test is the Bonferroni adjusted p-value. The primary difficulty with taking these p-values (as opposed to the test results) out of context is captured in steps 2b and 2c. They imply that the p-value for a given test may be lowered by other observed p-values in the family of tests. It's also true that the adjusted p-value depends on the number of tests included in the family, but this seems somewhat less troubling.

To examine the impact of the procedure on the adjusted p-values for the individual tests, we'll compare the candidate ap(m) from step 2a against the actual ap(m). Our sense is that to the degree these are different, the adjusted p-value should not be extracted from the context of the observed family of tests.


SAS
Our SAS code relies heavily on the array statement (section 1.11.5). We loop through the p-values from largest to smallest, calculating the candidate fdr p-value as above, before arriving at the final adjusted p-value. To compare the values conveniently, we make a new data set with two copies of the original data set, renaming first the candidate and then the adjusted p-values to have the same names. The in = data set option creates a temporary variable which identifies which data set an observation was read from; here it denotes which version of the same data set (and which set of p-values) was used.

data fdr;
array pvals [10] pval1 - pval10
(.001 .001 .001 .001 .001 .03 .035 .04 .05 .05);
array cfdrpvals [10] cfdr1 - cfdr10;
array fdrpvals [10] fdr1 - fdr10;
fdrpvals[10] = pvals[10];
do i = 9 to 1 by -1;
cfdrpvals[i] = pvals[i] * 10/i;
if cfdrpvals[i]> fdrpvals[i+1] then fdrpvals[i] = fdrpvals[i+1];
else fdrpvals[i] = cfdrpvals[i];
end;
run;

data compare;
set fdr (in = cfdr rename=(cfdr1=c1 cfdr2=c2 cfdr3=c3 cfdr4=c4
cfdr5=c5 cfdr6=c6 cfdr7=c7 cfdr8=c8 cfdr9=c9))
fdr (in = fdr rename=(fdr1=c1 fdr2=c2 fdr3=c3 fdr4=c4 fdr5=c5
fdr6=c6 fdr7=c7 fdr8=c8 fdr9=c9));
if cfdr then adjustment = "Candidate fdr";
if fdr then adjustment = "Final fdr";
run;

proc print data = compare; var adjustment c1-c9; run;

adjustment c1 c2 c3 c4 c5 c6 c7 c8 c9

Candidate fdr 0.010 .005 .0033 .0025 .002 .05 .05 .05 .055
Final fdr 0.002 .002 .0020 .0020 .002 .05 .05 .05 .050

(We omit the last p-value because the adjustment does not affect it.) The result shows that for many of the tests in this family, a substantially smaller p-value is obtained with the final FDR p-value than the candidate. To this degree, the FDR p-value is dependent on the observed values of the p-values in the tests in the family, and ought not to be removed from the context of these other tests. We would recommend caution in displaying the FDR p-values in such settings, given readers' propensity to use them as if they were ordinary p-values, safely adjusted for multiple testing.

R
Comparison of the R and SAS code may make SAS programmers weep. The candidate values are easily calculated, and can be presented with the final p-values in one step using the p.adjust() function. Three lines of code, albeit incorporating multiple functions in each line. (And it could sensibly be done in two, calculating the candidate p-values within the rbind() function call.) Note especially the line calculating the candidate p-values, in which vectorization allows a for loop to be avoided in a very natural fashion.

fakeps = c(rep(.2, 5), 6, 7, 8, 10, 10)/200
cfdr = fakeps * 10/(1:10)
rbind(cfdr, fdr=p.adjust(fakeps, "fdr"))[,1:9]

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
cfdr 0.010 0.005 0.0033 0.0025 0.002 0.05 0.05 0.05 0.0556 0.05
fdr 0.002 0.002 0.0020 0.0020 0.002 0.05 0.05 0.05 0.0500 0.05


An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /