The generic version of what I am trying to do is to conduct a simulation study where I manipulate a few variables to see how that impacts a result. I'm having some speed issues with R. The latest simulation worked with a few iterations (10 per experiment). However, when I moved to a large-scale version (10k per experiment), the simulation takes too long (it has been running for 14 hours and is still going).
Below is the code (with comments) that I am running. Being a rookie with R, I am struggling to optimize the simulation. My hope is to learn from the comments and suggestions provided here to optimize this code and use the code future simulation studies.
Let me say a few things about what this code is supposed to do. I am manipulating two variables: effect size and sample size. Each combination is run 10k times (i.e., 10k experiments per condition). I initialize a data frame to store my results (called Results). I loop over three variables: Effect size, sample size, and iterations (10k).
Within the loops, I initialize four NULL components: p.test, p.rep, d.test, and d.rep. The former two capture the p-value of the initial t-test and the p-value of the replication (i.e., replicated under similar conditions). The latter two calculate the effect size (Cohen's d).
I generate my random data from a standard normal for the control condition (DVcontrol), and I use my effect size as the mean for the experimental condition (DVexperiment). I take the difference between the values and throw the result into the t-test function in R (paired-samples t-test). I store the results in a list called Trials and I rbind this to the Results data frame. This process is repeated 10k times until completion.
# Set Simulation Parameters
## Effect Sizes (ES is equal to mean difference when SD equals Variance equals 1)
effect_size_range <- seq(0, 2, .1) ## ES
## Sample Sizes
sample_size_range <- seq(10, 1000, 10) ## SS
## Iterations for each ES-SS Combination
iter <- 10000
# Initialize the Vector of Results
Results <- data.frame()
# Set Random Seed
set.seed(12)
# Loop over the Different ESs
for(ES in effect_size_range) {
# Loop over the Different Sample Sizes
for(SS in sample_size_range) {
# Create p-value Vectors
p.test <- NULL
p.rep <- NULL
d.test <- NULL
d.rep <- NULL
# Loop over the iterations
for(i in 1:iter) {
# Generate Test Data
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)
DVdiff <- DVexperiment - DVcontrol
p.test[i] <- t.test(DVdiff, alternative="greater")$p.value
d.test[i] <- mean(DVdiff) / sd(DVdiff)
# Generate Replication Data
DVcontrol <- rnorm(iter, mean=0, sd=1)
DVexperiment <- rnorm(iter, mean=ES, sd=1)
DVdiff <- DVexperiment - DVcontrol
p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
d.rep[i] <- mean(DVdiff) / sd(DVdiff)
}
# Results
Trial <- list(ES=ES, SS=SS,
d.test=mean(d.test), d.rep=mean(d.rep),
p.test=mean(p.test), p.rep=mean(p.rep),
r=cor(p.test, p.rep, method="kendall"),
r.log=cor(log2(p.test)*(-1), log2(p.rep)*(-1), method= "kendall"))
Results <- rbind(Results, Trial)
}
}
1 Answer 1
Possible Bug
If I understand what you are trying to do correctly, the lines...
DVcontrol <- rnorm(iter, mean=0, sd=1)
DVexperiment <- rnorm(iter, mean=ES, sd=1)
...should probably be
DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)
This was noted in the comments by @flodel.
Identifying slow steps
The best way to figure out what lines of the code are slowest is to use a line profiling tool. I like profvis. In general, the slow steps in R are element-wise assignment to vectors/matrices/data.frames, and unvectorized functions. In your case,
p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
d.rep[i] <- mean(DVdiff) / sd(DVdiff)
are likely to be slow. And the t.test()
function in R is unvectorized as also a bottleneck.
Possible solutions
Vectorization of assignments
I'm a big fan of the tidyverse system of packages. For this case, the tidyr
, purrr
, and dplyr
system of packages allows a very nice functional implementation of your problem. However, they won't automatically vectorize functions like t.test()
.
Vectorization of t.test()
Googling around, it appears there are a few vectorized implementations of t.test()
. One I found easy to use is from the genefilter package, colttest()
. And although it isn't as much of a bottleneck, I also used a vectorized computation of d
by using the colSds()
and colMeans()
functions from matrixStats
.
Putting it together
Below I implement some of what your original code does. I didn't worry about the particular flavor of t-test that you were using (i.e. alternative = "greater"
), and I didn't implement computing the correlation between p.values.
require(tidyverse)
require(purrr)
require(genefilter)
require(matrixStats)
N_ITER <- 10000
effect_size_range <- seq(0, 2, 0.5)
sample_size_range <- seq(10, 1000, 100)
test_or_repeat <- c('test', 'rep')
input <-
expand.grid(effect_size_range, sample_size_range, test_or_repeat)
names(input) <- c('ES', 'SS', 'test_or_rep')
output <-
input %>%
group_by(ES, SS) %>%
mutate(ctrl = map(SS, ~rnorm(.x * N_ITER)),
expt = map2(SS, ES, ~rnorm(.x * N_ITER, mean=.y)),
diff = map2(expt, ctrl, ~ .x - .y),
diff.matrix = map(diff, ~matrix(., nrow=N_ITER)),
p = map(diff.matrix, ~colttests(., gl(1, nrow(.)))$p.value),
d = map(diff.matrix, ~colMeans(.)/colSds(.)),
mean.p = map(p, ~mean(.)),
mean.d = map(d, ~mean(.))
) %>%
select(ES, SS, test_or_rep, mean.p, mean.d) %>%
unite(p_and_d, mean.p, mean.d, sep='___') %>%
spread(test_or_rep, p_and_d) %>%
separate(test, c('p.test', 'd.test'), sep='___') %>%
separate(rep, c('p.rep', 'd.rep'), sep='___')
This run, which covers about 1/5th of the effect sizes and 1/10th of the sample sizes you are interested in, takes about 5 minutes to run on my laptop. Since it uses N_ITER = 10000
, I think scaling up to your full system size would take about 50 times this number, or about 4 hours. The bottlenecks here are still the t-tests, but also simply calling rnorm
, a sign that we got rid of the assignment-by-element bottlenecks and reduced the t-test bottlneck at least somewhat.
Explore related questions
See similar questions with these tags.
norm(iter, ...)
unlikenorm(SS, ...)
for the test data? This makes your algorithm's complexity quadratic with respect toiter
so it is not surprising that your computation times are going through the roof as you go from 10 to 10k iterations. \$\endgroup\$