2
\$\begingroup\$

I'm writing a code which task is to grow Random Forest trees based on multiple parameters. In short:

  1. Firstly, I declare a data frame in which model parameters and some stats will be saved.
  2. Secondly, I declare model parameters and the loop iterator (it will be showed after every loop iteration).
  3. Next, I have a nested loops with the model and prediction function.
  4. Furthermore, parameters and some stats from the confusion matrix are saved to the dataframe.
  5. Additionally, the number of iterations is printed and counted.
  6. Last but not least, garbage collector is called.

The code looks like this:

## data frame in which model parameters and some stats will be saved
model_eff <- data.frame("ntrees" = numeric(0),
 "zeros" = numeric(0), 
 "mvars"= numeric(0),
 "eff" = numeric(0),
 "0_0" = numeric(0),
 "0_1" = numeric(0),
 "1_0" = numeric(0),
 "1_1" = numeric(0),
 "predict_sum" = numeric(0),
 "triangle" = numeric(0))
## parameteres
ntrees <- c(300, 500)
zeros <- sum(train.target) * c(1, 2, 3, 4, 5)
mvars <- c(30, 50, 70, 90, 110, 130)
## loop counter
i = 1
## loop with model, prediction etc.
for (j in 1:length(ntrees)){
 for (k in 1:length(zeros)){
 for (l in 1:length(mvars)){
 ## i-th model
 model <- randomForest(train,
 y = as.factor(train.target),
 ntree = ntrees[j],
 do.trace = T,
 sampsize = c('0' = zeros[k], '1' = sum(train.target)),
 mtry = mvars[l])
 ## prediction - my function, apart from a regular prediction
 ## outputs additional info
 predict.model(model, val, val.target)
 ## inserting model parameters and stats to a data frame for further comparisons
 model_eff <- rbind(model_eff,
 c("ntrees" = ntrees[j],
 "zeros" = zeros[k],
 "mvars"= mvars[l],
 "eff" = eff_measures$eff,
 "0_0" = eff_measures$c.m[1, 1],
 "0_1" = eff_measures$c.m[1, 2],
 "1_0" = eff_measures$c.m[2, 1],
 "1_1" = eff_measures$c.m[2, 2],
 "predict_sum" = sum(TARGET3),
 "triangle" = eff_measures$triangle))
 ## printing the number of iteration
 cat("iteration =", i)
 i <- i+1
 ## calling garbage collector to assure free space in RAM
 gc()
 }
 }
}

I have already split the train/validation data sets and their target variables, knowing that Random Forest deals with such data mor efficiently. I also tried to use the "foreach" package for parallelizing computations, however, the growing time for only one tree was 10-15% longer than without using all the cores.

I would like to know if I can shorten the time of execution of this code, especially if there is a way to avoid multiple loops since I heard that they are not the best way of programming in R.

asked Mar 14, 2016 at 13:32
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

Reproducible Example

Unfortunately, the code snippet that you gave does not lend itself to being reproducible. So, the advice being given is constrained.

Caches are nice

There are certain times where you should be caching a summation if the value is known to be constant through different iterations. In this particular case, we have: sum(train.target) and sum(TARGET3) that should be cached. Say:

stt = sum(train.target)
st3 = sum(TARGET3)

Knowledge (of size) is Power!

Immediately, one of the key issue you will face is the fact that you are rbind 60 items since you avoid giving stable numerical entries in the data.frame

## parameteres
ntrees <- c(300, 500)
zeros <- sum(train.target) * c(1, 2, 3, 4, 5)
mvars <- c(30, 50, 70, 90, 110, 130)
nitr = length(ntrees)*length(zeros)*length(mvars)
model_eff <- data.frame("ntrees" = numeric(nitr),
 "zeros" = numeric(nitr), 
 "mvars" = numeric(nitr),
 "eff" = numeric(nitr),
 "0_0" = numeric(nitr),
 "0_1" = numeric(nitr),
 "1_0" = numeric(nitr),
 "1_1" = numeric(nitr),
 "predict_sum" = numeric(nitr),
 "triangle" = numeric(nitr),
 stringsAsFactors = F)

Declare count = 1 before the 3x for loops. Then save results using:

model_eff[count,] = c("ntrees" = ntrees[j],
 "zeros" = zeros[k],
 "mvars"= mvars[l],
 "eff" = eff_measures$eff,
 "0_0" = eff_measures$c.m[1, 1],
 "0_1" = eff_measures$c.m[1, 2],
 "1_0" = eff_measures$c.m[2, 1],
 "1_1" = eff_measures$c.m[2, 2],
 "predict_sum" = st3 ,
 "triangle" = eff_measures$triangle))
count = count + 1

Parallel RandomForest via caret

The only other suggestion I have it to parallelize the build of the random forest via:

# caret modeling framework
library(caret)
# Parallel backend
library(doParallel)
# Register a cluster
registerDoParallel(cores = 5)
rf_model = train(train.target~.,data=train,method="rf",
 prox=TRUE,allowParallel=TRUE)
answered Mar 19, 2016 at 17:59
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.