1
\$\begingroup\$

I am learning to program in R and to do that and create something useful in the process, I have decided to rewrite this Java applet for repeated measures simulation and to implement some new functionality.

I have succeeded, but the code seems to run very slowly (5K simulations in around 8 seconds as opposed to ~ 0.8 seconds in that applet) so it is likely to be a bad solution. I am looking for improvements to the code, tips on how to speed up the process (surely, it can not be that slower than Java, best practices, and overall comments on my solution.

In order to compare, run the applet and click Simulate 5000 and then import the code to R and run simRM(5000). It is a lot, but I have tried to make it clear in the comments.

> system.time(simRM(5000))
 user system elapsed 
 8.896 0.063 8.843

library(MASS)
library(ggplot2)
# n = number of observations; m1 = mean of the first group; m2 = mean of the second group;
# sd = standard deviation of both groups; rho = correlation coefficient; type = experimental design
sml <- function(n = 8, m1 = 10, m2 = 15, sd = 5, rho = 0.5, type = "within") {
 n <<- n # Make number of observation pairs global
 type <<- type # Make type of experiment global
 if (type == "within") { # If paired (within group), then correlation rho
 cor <- matrix(rho, nrow = 2, ncol = 2)
 } else if (type == "between") { # If independent (between groups), then correlation 0
 rho <- 0
 cor <- matrix(rho, nrow = 2, ncol = 2)
 }
 diag(cor) <- 1 
 #sigma <- cor*as.matrix(c(sd,sd))%*%t(as.matrix(c(sd,sd))) # If different standard deviations
 sigma <- cor*(sd^2) # Compute covariance matrix.
 res <- as.data.frame(mvrnorm(n, c(m1, m2), sigma)) # Simulate from a multivariate normal distribution using MASS::mvrnorm()
 dFrame <- data.frame(por=c(rep("A",n),rep("B",n)),cis=c(1:n,1:n),val=c(res[,1],res[,2])) # Data frame: por = phase; cis = subject ID; val = value
 #print(describeBy(dFrame, group=dFrame$por)) # Debugging
 return(dFrame)
}
glRes <- vector(length=4) # Prepare a vector to write multiple results in
names(glRes) <- c("rep", "significant", "insignificant", "percent") # Name individual items
# rep = number of repetitions/simulations; alpha = the probability of making a type I error
simRM <- function(rep=1, alpha=0.05, ...) {
 t <- vector() # Prepare a vector to write t values in
 pb <- txtProgressBar(min = 0, max = rep, style = 3) # Set up the progress bar
 for (i in 1:rep) { 
 dFrame <- sml(...)
 #x <- t.test(dFrame[dFrame$por=="A",3],dFrame[dFrame$por=="B",3],paired=T)$p.value # For p values
 if (type == "within") { # Student's paired t-test 
 x <- unname(t.test(dFrame[dFrame$por=="A",3],dFrame[dFrame$por=="B",3],paired=T)$statistic)
 } else if (type == "between") { # Welch's t-test
 x <- unname(t.test(dFrame[dFrame$por=="A",3],dFrame[dFrame$por=="B",3])$statistic)
 }
 t <- abs(round(append(t, x),4)) # Add rounded t values to the vector t
 setTxtProgressBar(pb, i) # Update progress bar
 }
 cat("\n\n") # Introduce two line breaks to the output for better readability
 if (rep == 1) { # Print plot only if there is a single repetition/simulation
 if (type == "within") { # Plot the values and connect those coming from the same subject
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 geom_line() +
 theme_bw() 
 } else if(type == "between") { # Just plot the values
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 theme_bw()
 }
 print(plot)
 }
 if (type == "within") { # Compute the critical value
 criticalValue <- abs(qt(alpha/2, n-1))
 } else if (type == "between") {
 criticalValue <- abs(qt(alpha/2, 2*n-2))
 }
 sig <- length(t[t > criticalValue]) # The number of significant outcomes (t > critical value)
 insig <- length(t[t < criticalValue]) # The number of insignificant outcomes (t < critical value)
 res <- c(sig, insig, (sig/rep)*100) # The result containing sig, insig and the percentage of sig
 names(res) <- c("significant", "insignificant", "percent") # Name items accordingly
 glRes[1] <<- glRes[1] + rep # Update each value accordingly and add to the global result variable glRes
 glRes[2] <<- glRes[2] + sig
 glRes[3] <<- glRes[3] + insig
 glRes[4] <<- round(glRes[2] / glRes[1] * 100, 2)
 return(glRes) # Return the final result
}
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked May 4, 2013 at 20:26
\$\endgroup\$

2 Answers 2

4
+50
\$\begingroup\$

I am not sure if this will be under 2 seconds on your machine, but should be quite close.

The main points of optimisation:

  1. Converting matrix (from mvrnorm) to data.frame is slow. I dropped data.frame and used resulting matrix directly.
  2. sigma can be computed only once, so I removed it from the loop.
  3. I generated multivariate normally distributed variables with length rep * n in one go. I subselect the rows in each iteration of the loop.
  4. It is better to define t with required length, not to update it in loop.
  5. Some other small tweeks.

It is possible to reduce the execution time even more by implementing parallel computing. All iterations are independent from each other so it will be effective approach.

Timing of the original code on my machine:

 user system elapsed 
 14.345 0.072 14.603 

Timing of optimised code on my machine:

 user system elapsed 
 3.088 0.072 3.446 

The code:

rm(list = ls())
gc()
library(MASS)
library(ggplot2)
sml <- function(n, m1, m2, sigma) mvrnorm(n, c(m1, m2), sigma)
glRes <- vector(length=4) # Prepare a vector to write multiple results in
names(glRes) <- c("rep", "significant", "insignificant", "percent") # Name individual items
# rep = number of repetitions/simulations; alpha = the probability of making a type I error
# n = number of observations; m1 = mean of the first group; m2 = mean of the second group;
# sd = standard deviation of both groups; rho = correlation coefficient; type = experimental design
simRM <- function(rep = 1, alpha = 0.05,
 n = 8, m1 = 10, m2 = 15,
 sd = 5, rho = 0.5, type = "within") {
 if (type == "within") { # If paired (within group), then correlation rho
 sigma <- matrix(c(1, rho, rho, 1), nrow = 2) * (sd^2)
 } else if (type == "between") { # If independent (between groups), then correlation 0
 sigma <- matrix(c(0, rho, rho, 0), nrow = 2) * (sd^2)
 } else stop("Wrong type")
 dFrame <- sml(n = n * rep, m1 = m1, m2 = m2, sigma = sigma)
 t <- vector(mode = "double", length = rep) # Prepare a vector to write t values in
 pb <- txtProgressBar(min = 0, max = rep, style = 3) # Set up the progress bar
 for (i in 1:rep) {
 dF <- dFrame[n*(i-1)+(1:n), ]
 t[i] <- (t.test(dF[, 1], dF[, 2], paired = (type == "within"))$statistic)
 setTxtProgressBar(pb, i) # Update progress bar
 }
 #t <- abs(round(t, 4)) ### Why do you round t?
 t <- abs(t)
 cat("\n\n") # Introduce two line breaks to the output for better readability
 if (rep == 1) { # Print plot only if there is a single repetition/simulation
 if (type == "within") { # Plot the values and connect those coming from the same subject
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 geom_line() +
 theme_bw() 
 } else if(type == "between") { # Just plot the values
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 theme_bw()
 }
 print(plot)
 }
 if (type == "within") { # Compute the critical value
 criticalValue <- abs(qt(alpha/2, n-1))
 } else if (type == "between") {
 criticalValue <- abs(qt(alpha/2, 2*n-2))
 }
 sig <- length(t[t > criticalValue]) # The number of significant outcomes (t > critical value)
 # insig <- length(t[t < criticalValue]) # The number of insignificant outcomes (t < critical value)
 insig <- rep - sig
 res <- c(sig, insig, (sig / rep) * 100) # The result containing sig, insig and the percentage of sig
 names(res) <- c("significant", "insignificant", "percent") # Name items accordingly
 glRes[1] <<- glRes[1] + rep # Update each value accordingly and add to the global result variable glRes
 glRes[2] <<- glRes[2] + sig
 glRes[3] <<- glRes[3] + insig
 glRes[4] <<- round(glRes[2] / glRes[1] * 100, 2)
 return(glRes) # Return the final result
}
set.seed(1)
system.time(simRM(5000))
glRes
answered May 5, 2013 at 8:40
\$\endgroup\$
3
  • \$\begingroup\$ Thank you very much, you are the man! It is excellent and runs just under two seconds so I shall award the bounty as soon as the system allows me. :-) I had to make a couple of adjustments (e.g. ggplot2 needs a data frame) and realised that made a couple of really primitive mistakes. I especially adore how you have made the cycle—so simple, yet I managed to make it inefficient. I have learned a lot, thanks again! \$\endgroup\$ Commented May 5, 2013 at 16:17
  • \$\begingroup\$ One more thing, actually—someone has suggested that if I really want to learn to program in R, I should do it first without using <<-, hence global variables. If possible, could you please elaborate on how could this task be done without using global variables? The one who had suggested it has not responded. \$\endgroup\$ Commented May 5, 2013 at 16:26
  • \$\begingroup\$ You are right, I did not tested ggplot part. I agree that usage of <<- is not advisable. It is a bad practice of programming when function depends on global variables. I would use glRes as another argument for simRM, update it inside the function, and return the updated glRes. \$\endgroup\$ Commented May 5, 2013 at 19:15
1
\$\begingroup\$

Here is another way to structure the inner loop. It builds off of @djhurio's solution but is slower. However, I think it is conceptually clearer. Also, global assignment is removed to instead return a vector of results.

library("plyr")
simRM <- function(rep = 1, alpha = 0.05,
 n = 8, m1 = 10, m2 = 15,
 sd = 5, rho = 0.5, type = "within") {
 if (type == "within") { # If paired (within group), then correlation rho
 sigma <- matrix(c(1, rho, rho, 1), nrow = 2) * (sd^2)
 } else if (type == "between") { # If independent (between groups), then correlation 0
 sigma <- matrix(c(0, rho, rho, 0), nrow = 2) * (sd^2)
 } else stop("Wrong type")
 dFrame <- sml(n = n * rep, m1 = m1, m2 = m2, sigma = sigma)
 dFrame2 <- aperm(array(dFrame, dim=c(n, rep, 2)), c(1,3,2))
 t <- aaply(dFrame2, 3, function(DF) {
 t.test(DF[, 1], DF[, 2], paired = (type=="within"))$statistic
 }, .progress="text")
 t <- abs(t)
 cat("\n\n") # Introduce two line breaks to the output for better readability
 if (rep == 1) { # Print plot only if there is a single repetition/simulation
 if (type == "within") { # Plot the values and connect those coming from the same subject
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 geom_line() +
 theme_bw() 
 } else if(type == "between") { # Just plot the values
 plot <- ggplot(data=dFrame, aes(x=por, y=val, group=cis)) +
 geom_point() +
 theme_bw()
 }
 print(plot)
 }
 if (type == "within") { # Compute the critical value
 criticalValue <- abs(qt(alpha/2, n-1))
 } else if (type == "between") {
 criticalValue <- abs(qt(alpha/2, 2*n-2))
 }
 sig <- length(t[t > criticalValue]) # The number of significant outcomes (t > critical value)
 c("rep" = rep,
 "significant" = sig,
 "insignificant" = rep - sig,
 "percent" = 100*(sig / rep))
}

The key difference is the computation of the t vector inside the function. First, dFrame is restructured from a (n*rep)x2 matrix into an n x 2 x rep array (3 dimensional array). This gives the benefit of generating all the data at once, but also gives a data structure which more easily operates on each piece/repetition. The aaply function from plyr is then used to iterate over the 3rd dimension and operates on each n x 2 matrix separately. Advantages of using plyr is the availability of progress bars simply as well as being able to split the processing into parallel processing with not much additional effort.

Since the values for glRes are returned rather than assigned, the calling semantics are slightly different:

set.seed(1)
system.time(glRes <- simRM(5000))
glRes

The slowdown is not that great:

@djhurio's solution:

 user system elapsed 
 1.40 0.01 1.42 

My solution:

 user system elapsed 
 2.20 0.03 2.23 

The resulting values of glRes are identical.

answered Nov 11, 2013 at 19:52
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.