Repeated measures simulator optimisation

Question 1

I am learning to program in R and to do that and create something useful in the process, I have decided to rewrite this Java applet for repeated measures simulation and to implement some new functionality.

I have succeeded, but the code seems to run very slowly (5K simulations in around 8 seconds as opposed to ~ 0.8 seconds in that applet) so it is likely to be a bad solution. I am looking for improvements to the code, tips on how to speed up the process (surely, it can not be that slower than Java, best practices, and overall comments on my solution.

In order to compare, run the applet and click Simulate 5000 and then import the code to R and run simRM(5000). It is a lot, but I have tried to make it clear in the comments.

> system.time(simRM(5000))
 user system elapsed 
 8.896 0.063 8.843

library(MASS)
library(ggplot2)
# n = number of observations; m1 = mean of the first group; m2 = mean of the second group;
# sd = standard deviation of both groups; rho = correlation coefficient; type = experimental design
sml <- function(n = 8, m1 = 10, m2 = 15, sd = 5, rho = 0.5, type = "within") {
 n <<- n # Make number of observation pairs global
 type <<- type # Make type of experiment global
 if (type == "within") { # If paired (within group), then correlation rho
 cor <- matrix(rho, nrow = 2, ncol = 2)
 } else if (type == "between") { # If independent (between groups), then correlation 0
 rho <- 0
 cor <- matrix(rho, nrow = 2, ncol = 2)
 }
 diag(cor) <- 1 
 #sigma <- cor*as.matrix(c(sd,sd))%*%t(as.matrix(c(sd,sd))) # If different standard deviations
 sigma <- cor*(sd^2) # Compute covariance matrix.
 res <- as.data.frame(mvrnorm(n, c(m1, m2), sigma)) # Simulate from a multivariate normal distribution using MASS::mvrnorm()
 dFrame <- data.frame(por=c(rep("A",n),rep("B",n)),cis=c(1:n,1:n),val=c(res[,1],res[,2])) # Data frame: por = phase; cis = subject ID; val = value
 #print(describeBy(dFrame, group=dFrame$por)) # Debugging
 return(dFrame)
}
glRes <- vector(length=4) # Prepare a vector to write multiple results in
names(glRes) <- c("rep", "significant", "insignificant", "percent") # Name individual items
# rep = number of repetitions/simulations; alpha = the probability of making a type I error
simRM <- function(rep=1, alpha=0.05, ...) {
 t <- vector() # Prepare a vector to write t values in
 pb <- txtProgressBar(min = 0, max = rep, style = 3) # Set up the progress bar
 for (i in 1:rep) { 
 dFrame <- sml(...)
 #x <- t.test(dFrame[dFrame$por=="A",3],dFrame[dFrame$por=="B",3],paired=T)$p.value # For p values
 if (type == "within") { # Student's paired t-test 
 x <- unname(t.test(dFrame[dFrame$por=="A",3],dFrame[dFrame$por=="B",3],paired=T)$statistic)
 } else if (type == "between") { # Welch's t-test
 x <- unname(t.test(dFrame[dFrame$por=="A",3],dFrame[dFrame$por=="B",3])$statistic)
 }
 t <- abs(round(append(t, x),4)) # Add rounded t values to the vector t
 setTxtProgressBar(pb, i) # Update progress bar
 }
 cat("\n\n") # Introduce two line breaks to the output for better readability
 if (rep == 1) { # Print plot only if there is a single repetition/simulation
 if (type == "within") { # Plot the values and connect those coming from the same subject
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 geom_line() +
 theme_bw() 
 } else if(type == "between") { # Just plot the values
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 theme_bw()
 }
 print(plot)
 }
 if (type == "within") { # Compute the critical value
 criticalValue <- abs(qt(alpha/2, n-1))
 } else if (type == "between") {
 criticalValue <- abs(qt(alpha/2, 2*n-2))
 }
 sig <- length(t[t > criticalValue]) # The number of significant outcomes (t > critical value)
 insig <- length(t[t < criticalValue]) # The number of insignificant outcomes (t < critical value)
 res <- c(sig, insig, (sig/rep)*100) # The result containing sig, insig and the percentage of sig
 names(res) <- c("significant", "insignificant", "percent") # Name items accordingly
 glRes[1] <<- glRes[1] + rep # Update each value accordingly and add to the global result variable glRes
 glRes[2] <<- glRes[2] + sig
 glRes[3] <<- glRes[3] + insig
 glRes[4] <<- round(glRes[2] / glRes[1] * 100, 2)
 return(glRes) # Return the final result
}

Question 2

I am not sure if this will be under 2 seconds on your machine, but should be quite close.

The main points of optimisation:

Converting matrix (from mvrnorm) to data.frame is slow. I dropped data.frame and used resulting matrix directly.
sigma can be computed only once, so I removed it from the loop.
I generated multivariate normally distributed variables with length rep * n in one go. I subselect the rows in each iteration of the loop.
It is better to define t with required length, not to update it in loop.
Some other small tweeks.

It is possible to reduce the execution time even more by implementing parallel computing. All iterations are independent from each other so it will be effective approach.

Timing of the original code on my machine:

 user system elapsed 
 14.345 0.072 14.603

Timing of optimised code on my machine:

 user system elapsed 
 3.088 0.072 3.446

The code:

rm(list = ls())
gc()
library(MASS)
library(ggplot2)
sml <- function(n, m1, m2, sigma) mvrnorm(n, c(m1, m2), sigma)
glRes <- vector(length=4) # Prepare a vector to write multiple results in
names(glRes) <- c("rep", "significant", "insignificant", "percent") # Name individual items
# rep = number of repetitions/simulations; alpha = the probability of making a type I error
# n = number of observations; m1 = mean of the first group; m2 = mean of the second group;
# sd = standard deviation of both groups; rho = correlation coefficient; type = experimental design
simRM <- function(rep = 1, alpha = 0.05,
 n = 8, m1 = 10, m2 = 15,
 sd = 5, rho = 0.5, type = "within") {
 if (type == "within") { # If paired (within group), then correlation rho
 sigma <- matrix(c(1, rho, rho, 1), nrow = 2) * (sd^2)
 } else if (type == "between") { # If independent (between groups), then correlation 0
 sigma <- matrix(c(0, rho, rho, 0), nrow = 2) * (sd^2)
 } else stop("Wrong type")
 dFrame <- sml(n = n * rep, m1 = m1, m2 = m2, sigma = sigma)
 t <- vector(mode = "double", length = rep) # Prepare a vector to write t values in
 pb <- txtProgressBar(min = 0, max = rep, style = 3) # Set up the progress bar
 for (i in 1:rep) {
 dF <- dFrame[n*(i-1)+(1:n), ]
 t[i] <- (t.test(dF[, 1], dF[, 2], paired = (type == "within"))$statistic)
 setTxtProgressBar(pb, i) # Update progress bar
 }
 #t <- abs(round(t, 4)) ### Why do you round t?
 t <- abs(t)
 cat("\n\n") # Introduce two line breaks to the output for better readability
 if (rep == 1) { # Print plot only if there is a single repetition/simulation
 if (type == "within") { # Plot the values and connect those coming from the same subject
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 geom_line() +
 theme_bw() 
 } else if(type == "between") { # Just plot the values
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 theme_bw()
 }
 print(plot)
 }
 if (type == "within") { # Compute the critical value
 criticalValue <- abs(qt(alpha/2, n-1))
 } else if (type == "between") {
 criticalValue <- abs(qt(alpha/2, 2*n-2))
 }
 sig <- length(t[t > criticalValue]) # The number of significant outcomes (t > critical value)
 # insig <- length(t[t < criticalValue]) # The number of insignificant outcomes (t < critical value)
 insig <- rep - sig
 res <- c(sig, insig, (sig / rep) * 100) # The result containing sig, insig and the percentage of sig
 names(res) <- c("significant", "insignificant", "percent") # Name items accordingly
 glRes[1] <<- glRes[1] + rep # Update each value accordingly and add to the global result variable glRes
 glRes[2] <<- glRes[2] + sig
 glRes[3] <<- glRes[3] + insig
 glRes[4] <<- round(glRes[2] / glRes[1] * 100, 2)
 return(glRes) # Return the final result
}
set.seed(1)
system.time(simRM(5000))
glRes

Question 3

Thank you very much, you are the man! It is excellent and runs just under two seconds so I shall award the bounty as soon as the system allows me. :-) I had to make a couple of adjustments (e.g. ggplot2 needs a data frame) and realised that made a couple of really primitive mistakes. I especially adore how you have made the cycle—so simple, yet I managed to make it inefficient. I have learned a lot, thanks again!

Question 4

One more thing, actually—someone has suggested that if I really want to learn to program in R, I should do it first without using <<-, hence global variables. If possible, could you please elaborate on how could this task be done without using global variables? The one who had suggested it has not responded.

Question 5

You are right, I did not tested ggplot part. I agree that usage of <<- is not advisable. It is a bad practice of programming when function depends on global variables. I would use glRes as another argument for simRM, update it inside the function, and return the updated glRes.

Question 6

Here is another way to structure the inner loop. It builds off of @djhurio's solution but is slower. However, I think it is conceptually clearer. Also, global assignment is removed to instead return a vector of results.

library("plyr")
simRM <- function(rep = 1, alpha = 0.05,
 n = 8, m1 = 10, m2 = 15,
 sd = 5, rho = 0.5, type = "within") {
 if (type == "within") { # If paired (within group), then correlation rho
 sigma <- matrix(c(1, rho, rho, 1), nrow = 2) * (sd^2)
 } else if (type == "between") { # If independent (between groups), then correlation 0
 sigma <- matrix(c(0, rho, rho, 0), nrow = 2) * (sd^2)
 } else stop("Wrong type")
 dFrame <- sml(n = n * rep, m1 = m1, m2 = m2, sigma = sigma)
 dFrame2 <- aperm(array(dFrame, dim=c(n, rep, 2)), c(1,3,2))
 t <- aaply(dFrame2, 3, function(DF) {
 t.test(DF[, 1], DF[, 2], paired = (type=="within"))$statistic
 }, .progress="text")
 t <- abs(t)
 cat("\n\n") # Introduce two line breaks to the output for better readability
 if (rep == 1) { # Print plot only if there is a single repetition/simulation
 if (type == "within") { # Plot the values and connect those coming from the same subject
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 geom_line() +
 theme_bw() 
 } else if(type == "between") { # Just plot the values
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 theme_bw()
 }
 print(plot)
 }
 if (type == "within") { # Compute the critical value
 criticalValue <- abs(qt(alpha/2, n-1))
 } else if (type == "between") {
 criticalValue <- abs(qt(alpha/2, 2*n-2))
 }
 sig <- length(t[t > criticalValue]) # The number of significant outcomes (t > critical value)
 c("rep" = rep,
 "significant" = sig,
 "insignificant" = rep - sig,
 "percent" = 100*(sig / rep))
}

The key difference is the computation of the t vector inside the function. First, dFrame is restructured from a (n*rep)x2 matrix into an n x 2 x rep array (3 dimensional array). This gives the benefit of generating all the data at once, but also gives a data structure which more easily operates on each piece/repetition. The aaply function from plyr is then used to iterate over the 3rd dimension and operates on each n x 2 matrix separately. Advantages of using plyr is the availability of progress bars simply as well as being able to split the processing into parallel processing with not much additional effort.

Since the values for glRes are returned rather than assigned, the calling semantics are slightly different:

set.seed(1)
system.time(glRes <- simRM(5000))
glRes

The slowdown is not that great:

@djhurio's solution:

 user system elapsed 
 1.40 0.01 1.42

My solution:

 user system elapsed 
 2.20 0.03 2.23

The resulting values of glRes are identical.

djhurio djhurio 2861 silver badge7 bronze badges · Accepted Answer · 2013-05-05 08:40:04Z

I am not sure if this will be under 2 seconds on your machine, but should be quite close.

The main points of optimisation:

Converting matrix (from mvrnorm) to data.frame is slow. I dropped data.frame and used resulting matrix directly.
sigma can be computed only once, so I removed it from the loop.
I generated multivariate normally distributed variables with length rep * n in one go. I subselect the rows in each iteration of the loop.
It is better to define t with required length, not to update it in loop.
Some other small tweeks.

It is possible to reduce the execution time even more by implementing parallel computing. All iterations are independent from each other so it will be effective approach.

Timing of the original code on my machine:

 user system elapsed 
 14.345 0.072 14.603

Timing of optimised code on my machine:

 user system elapsed 
 3.088 0.072 3.446

The code:

rm(list = ls())
gc()
library(MASS)
library(ggplot2)
sml <- function(n, m1, m2, sigma) mvrnorm(n, c(m1, m2), sigma)
glRes <- vector(length=4) # Prepare a vector to write multiple results in
names(glRes) <- c("rep", "significant", "insignificant", "percent") # Name individual items
# rep = number of repetitions/simulations; alpha = the probability of making a type I error
# n = number of observations; m1 = mean of the first group; m2 = mean of the second group;
# sd = standard deviation of both groups; rho = correlation coefficient; type = experimental design
simRM <- function(rep = 1, alpha = 0.05,
 n = 8, m1 = 10, m2 = 15,
 sd = 5, rho = 0.5, type = "within") {
 if (type == "within") { # If paired (within group), then correlation rho
 sigma <- matrix(c(1, rho, rho, 1), nrow = 2) * (sd^2)
 } else if (type == "between") { # If independent (between groups), then correlation 0
 sigma <- matrix(c(0, rho, rho, 0), nrow = 2) * (sd^2)
 } else stop("Wrong type")
 dFrame <- sml(n = n * rep, m1 = m1, m2 = m2, sigma = sigma)
 t <- vector(mode = "double", length = rep) # Prepare a vector to write t values in
 pb <- txtProgressBar(min = 0, max = rep, style = 3) # Set up the progress bar
 for (i in 1:rep) {
 dF <- dFrame[n*(i-1)+(1:n), ]
 t[i] <- (t.test(dF[, 1], dF[, 2], paired = (type == "within"))$statistic)
 setTxtProgressBar(pb, i) # Update progress bar
 }
 #t <- abs(round(t, 4)) ### Why do you round t?
 t <- abs(t)
 cat("\n\n") # Introduce two line breaks to the output for better readability
 if (rep == 1) { # Print plot only if there is a single repetition/simulation
 if (type == "within") { # Plot the values and connect those coming from the same subject
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 geom_line() +
 theme_bw() 
 } else if(type == "between") { # Just plot the values
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 theme_bw()
 }
 print(plot)
 }
 if (type == "within") { # Compute the critical value
 criticalValue <- abs(qt(alpha/2, n-1))
 } else if (type == "between") {
 criticalValue <- abs(qt(alpha/2, 2*n-2))
 }
 sig <- length(t[t > criticalValue]) # The number of significant outcomes (t > critical value)
 # insig <- length(t[t < criticalValue]) # The number of insignificant outcomes (t < critical value)
 insig <- rep - sig
 res <- c(sig, insig, (sig / rep) * 100) # The result containing sig, insig and the percentage of sig
 names(res) <- c("significant", "insignificant", "percent") # Name items accordingly
 glRes[1] <<- glRes[1] + rep # Update each value accordingly and add to the global result variable glRes
 glRes[2] <<- glRes[2] + sig
 glRes[3] <<- glRes[3] + insig
 glRes[4] <<- round(glRes[2] / glRes[1] * 100, 2)
 return(glRes) # Return the final result
}
set.seed(1)
system.time(simRM(5000))
glRes

Thank you very much, you are the man! It is excellent and runs just under two seconds so I shall award the bounty as soon as the system allows me. :-) I had to make a couple of adjustments (e.g. ggplot2 needs a data frame) and realised that made a couple of really primitive mistakes. I especially adore how you have made the cycle—so simple, yet I managed to make it inefficient. I have learned a lot, thanks again!
One more thing, actually—someone has suggested that if I really want to learn to program in R, I should do it first without using <<-, hence global variables. If possible, could you please elaborate on how could this task be done without using global variables? The one who had suggested it has not responded.
You are right, I did not tested ggplot part. I agree that usage of <<- is not advisable. It is a bad practice of programming when function depends on global variables. I would use glRes as another argument for simRM, update it inside the function, and return the updated glRes.

Stack Exchange Network

Repeated measures simulator optimisation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Repeated measures simulator optimisation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions