Simulation study in R with two variables and 10000 iterations

Question 1

The generic version of what I am trying to do is to conduct a simulation study where I manipulate a few variables to see how that impacts a result. I'm having some speed issues with R. The latest simulation worked with a few iterations (10 per experiment). However, when I moved to a large-scale version (10k per experiment), the simulation takes too long (it has been running for 14 hours and is still going).

Below is the code (with comments) that I am running. Being a rookie with R, I am struggling to optimize the simulation. My hope is to learn from the comments and suggestions provided here to optimize this code and use the code future simulation studies.

Let me say a few things about what this code is supposed to do. I am manipulating two variables: effect size and sample size. Each combination is run 10k times (i.e., 10k experiments per condition). I initialize a data frame to store my results (called Results). I loop over three variables: Effect size, sample size, and iterations (10k).

Within the loops, I initialize four NULL components: p.test, p.rep, d.test, and d.rep. The former two capture the p-value of the initial t-test and the p-value of the replication (i.e., replicated under similar conditions). The latter two calculate the effect size (Cohen's d).

I generate my random data from a standard normal for the control condition (DVcontrol), and I use my effect size as the mean for the experimental condition (DVexperiment). I take the difference between the values and throw the result into the t-test function in R (paired-samples t-test). I store the results in a list called Trials and I rbind this to the Results data frame. This process is repeated 10k times until completion.

# Set Simulation Parameters
## Effect Sizes (ES is equal to mean difference when SD equals Variance equals 1)
effect_size_range <- seq(0, 2, .1) ## ES
## Sample Sizes
sample_size_range <- seq(10, 1000, 10) ## SS
## Iterations for each ES-SS Combination
iter <- 10000
# Initialize the Vector of Results
Results <- data.frame()
# Set Random Seed
set.seed(12)
# Loop over the Different ESs
for(ES in effect_size_range) {
 # Loop over the Different Sample Sizes
 for(SS in sample_size_range) {
 # Create p-value Vectors
 p.test <- NULL
 p.rep <- NULL
 d.test <- NULL
 d.rep <- NULL
 # Loop over the iterations
 for(i in 1:iter) {
 # Generate Test Data
 DVcontrol <- rnorm(SS, mean=0, sd=1)
 DVexperiment <- rnorm(SS, mean=ES, sd=1)
 DVdiff <- DVexperiment - DVcontrol
 p.test[i] <- t.test(DVdiff, alternative="greater")$p.value
 d.test[i] <- mean(DVdiff) / sd(DVdiff)
 # Generate Replication Data
 DVcontrol <- rnorm(iter, mean=0, sd=1)
 DVexperiment <- rnorm(iter, mean=ES, sd=1)
 DVdiff <- DVexperiment - DVcontrol
 p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
 d.rep[i] <- mean(DVdiff) / sd(DVdiff)
 }
 # Results
 Trial <- list(ES=ES, SS=SS,
 d.test=mean(d.test), d.rep=mean(d.rep),
 p.test=mean(p.test), p.rep=mean(p.rep),
 r=cor(p.test, p.rep, method="kendall"),
 r.log=cor(log2(p.test)*(-1), log2(p.rep)*(-1), method= "kendall"))
 Results <- rbind(Results, Trial)
 }
}

Question 2

Cross-posted from Stack Overflow

Question 3

Is it intentional that the replication data uses norm(iter, ...) unlike norm(SS, ...) for the test data? This makes your algorithm's complexity quadratic with respect to iter so it is not surprising that your computation times are going through the roof as you go from 10 to 10k iterations.

Question 4

This can definitely improved by doing two things: Parallelization (instead of unicore-R) and vectorization (instead of binding Results and Trial) every time. your problem is, however, not supereasily translated into parallelization (meaning: I couldn't solve it so far).

Question 5

Possible Bug

If I understand what you are trying to do correctly, the lines...

DVcontrol <- rnorm(iter, mean=0, sd=1)
DVexperiment <- rnorm(iter, mean=ES, sd=1)

...should probably be

DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)

This was noted in the comments by @flodel.

Identifying slow steps

The best way to figure out what lines of the code are slowest is to use a line profiling tool. I like profvis. In general, the slow steps in R are element-wise assignment to vectors/matrices/data.frames, and unvectorized functions. In your case,

p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
d.rep[i] <- mean(DVdiff) / sd(DVdiff)

are likely to be slow. And the t.test() function in R is unvectorized as also a bottleneck.

Possible solutions

Vectorization of assignments

I'm a big fan of the tidyverse system of packages. For this case, the tidyr, purrr, and dplyr system of packages allows a very nice functional implementation of your problem. However, they won't automatically vectorize functions like t.test().

Vectorization of `t.test()`

Googling around, it appears there are a few vectorized implementations of t.test(). One I found easy to use is from the genefilter package, colttest(). And although it isn't as much of a bottleneck, I also used a vectorized computation of d by using the colSds() and colMeans() functions from matrixStats.

Putting it together

Below I implement some of what your original code does. I didn't worry about the particular flavor of t-test that you were using (i.e. alternative = "greater"), and I didn't implement computing the correlation between p.values.

require(tidyverse)
require(purrr)
require(genefilter)
require(matrixStats)
N_ITER <- 10000
effect_size_range <- seq(0, 2, 0.5)
sample_size_range <- seq(10, 1000, 100)
test_or_repeat <- c('test', 'rep')
input <-
 expand.grid(effect_size_range, sample_size_range, test_or_repeat)
names(input) <- c('ES', 'SS', 'test_or_rep')
output <- 
 input %>%
 group_by(ES, SS) %>%
 mutate(ctrl = map(SS, ~rnorm(.x * N_ITER)),
 expt = map2(SS, ES, ~rnorm(.x * N_ITER, mean=.y)),
 diff = map2(expt, ctrl, ~ .x - .y),
 diff.matrix = map(diff, ~matrix(., nrow=N_ITER)),
 p = map(diff.matrix, ~colttests(., gl(1, nrow(.)))$p.value),
 d = map(diff.matrix, ~colMeans(.)/colSds(.)),
 mean.p = map(p, ~mean(.)),
 mean.d = map(d, ~mean(.))
 ) %>%
 select(ES, SS, test_or_rep, mean.p, mean.d) %>%
 unite(p_and_d, mean.p, mean.d, sep='___') %>%
 spread(test_or_rep, p_and_d) %>%
 separate(test, c('p.test', 'd.test'), sep='___') %>%
 separate(rep, c('p.rep', 'd.rep'), sep='___')

This run, which covers about 1/5th of the effect sizes and 1/10th of the sample sizes you are interested in, takes about 5 minutes to run on my laptop. Since it uses N_ITER = 10000, I think scaling up to your full system size would take about 50 times this number, or about 4 hours. The bottlenecks here are still the t-tests, but also simply calling rnorm, a sign that we got rid of the assignment-by-element bottlenecks and reduced the t-test bottlneck at least somewhat.

Curt F. Curt F. 1,65611 silver badges22 bronze badges · Accepted Answer · 2017-10-22 17:17:07Z

Possible Bug

If I understand what you are trying to do correctly, the lines...

DVcontrol <- rnorm(iter, mean=0, sd=1)
DVexperiment <- rnorm(iter, mean=ES, sd=1)

...should probably be

DVcontrol <- rnorm(SS, mean=0, sd=1)
DVexperiment <- rnorm(SS, mean=ES, sd=1)

This was noted in the comments by @flodel.

Identifying slow steps

The best way to figure out what lines of the code are slowest is to use a line profiling tool. I like profvis. In general, the slow steps in R are element-wise assignment to vectors/matrices/data.frames, and unvectorized functions. In your case,

p.rep[i] <- t.test(DVdiff, alternative="greater")$p.value
d.rep[i] <- mean(DVdiff) / sd(DVdiff)

are likely to be slow. And the t.test() function in R is unvectorized as also a bottleneck.

Possible solutions

Vectorization of assignments

I'm a big fan of the tidyverse system of packages. For this case, the tidyr, purrr, and dplyr system of packages allows a very nice functional implementation of your problem. However, they won't automatically vectorize functions like t.test().

Vectorization of `t.test()`

Googling around, it appears there are a few vectorized implementations of t.test(). One I found easy to use is from the genefilter package, colttest(). And although it isn't as much of a bottleneck, I also used a vectorized computation of d by using the colSds() and colMeans() functions from matrixStats.

Putting it together

Below I implement some of what your original code does. I didn't worry about the particular flavor of t-test that you were using (i.e. alternative = "greater"), and I didn't implement computing the correlation between p.values.

require(tidyverse)
require(purrr)
require(genefilter)
require(matrixStats)
N_ITER <- 10000
effect_size_range <- seq(0, 2, 0.5)
sample_size_range <- seq(10, 1000, 100)
test_or_repeat <- c('test', 'rep')
input <-
 expand.grid(effect_size_range, sample_size_range, test_or_repeat)
names(input) <- c('ES', 'SS', 'test_or_rep')
output <- 
 input %>%
 group_by(ES, SS) %>%
 mutate(ctrl = map(SS, ~rnorm(.x * N_ITER)),
 expt = map2(SS, ES, ~rnorm(.x * N_ITER, mean=.y)),
 diff = map2(expt, ctrl, ~ .x - .y),
 diff.matrix = map(diff, ~matrix(., nrow=N_ITER)),
 p = map(diff.matrix, ~colttests(., gl(1, nrow(.)))$p.value),
 d = map(diff.matrix, ~colMeans(.)/colSds(.)),
 mean.p = map(p, ~mean(.)),
 mean.d = map(d, ~mean(.))
 ) %>%
 select(ES, SS, test_or_rep, mean.p, mean.d) %>%
 unite(p_and_d, mean.p, mean.d, sep='___') %>%
 spread(test_or_rep, p_and_d) %>%
 separate(test, c('p.test', 'd.test'), sep='___') %>%
 separate(rep, c('p.rep', 'd.rep'), sep='___')

This run, which covers about 1/5th of the effect sizes and 1/10th of the sample sizes you are interested in, takes about 5 minutes to run on my laptop. Since it uses N_ITER = 10000, I think scaling up to your full system size would take about 50 times this number, or about 4 hours. The bottlenecks here are still the t-tests, but also simply calling rnorm, a sign that we got rid of the assignment-by-element bottlenecks and reduced the t-test bottlneck at least somewhat.

Stack Exchange Network

Simulation study in R with two variables and 10000 iterations

1 Answer 1

Possible Bug

Identifying slow steps

Possible solutions

Vectorization of assignments

Vectorization of `t.test()`

Putting it together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simulation study in R with two variables and 10000 iterations

1 Answer 1

Possible Bug

Identifying slow steps

Possible solutions

Vectorization of assignments

Vectorization of t.test()

Putting it together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Vectorization of `t.test()`