I'm writing a code which task is to grow Random Forest trees based on multiple parameters. In short:
- Firstly, I declare a data frame in which model parameters and some stats will be saved.
- Secondly, I declare model parameters and the loop iterator (it will be showed after every loop iteration).
- Next, I have a nested loops with the model and prediction function.
- Furthermore, parameters and some stats from the confusion matrix are saved to the dataframe.
- Additionally, the number of iterations is printed and counted.
- Last but not least, garbage collector is called.
The code looks like this:
## data frame in which model parameters and some stats will be saved
model_eff <- data.frame("ntrees" = numeric(0),
"zeros" = numeric(0),
"mvars"= numeric(0),
"eff" = numeric(0),
"0_0" = numeric(0),
"0_1" = numeric(0),
"1_0" = numeric(0),
"1_1" = numeric(0),
"predict_sum" = numeric(0),
"triangle" = numeric(0))
## parameteres
ntrees <- c(300, 500)
zeros <- sum(train.target) * c(1, 2, 3, 4, 5)
mvars <- c(30, 50, 70, 90, 110, 130)
## loop counter
i = 1
## loop with model, prediction etc.
for (j in 1:length(ntrees)){
for (k in 1:length(zeros)){
for (l in 1:length(mvars)){
## i-th model
model <- randomForest(train,
y = as.factor(train.target),
ntree = ntrees[j],
do.trace = T,
sampsize = c('0' = zeros[k], '1' = sum(train.target)),
mtry = mvars[l])
## prediction - my function, apart from a regular prediction
## outputs additional info
predict.model(model, val, val.target)
## inserting model parameters and stats to a data frame for further comparisons
model_eff <- rbind(model_eff,
c("ntrees" = ntrees[j],
"zeros" = zeros[k],
"mvars"= mvars[l],
"eff" = eff_measures$eff,
"0_0" = eff_measures$c.m[1, 1],
"0_1" = eff_measures$c.m[1, 2],
"1_0" = eff_measures$c.m[2, 1],
"1_1" = eff_measures$c.m[2, 2],
"predict_sum" = sum(TARGET3),
"triangle" = eff_measures$triangle))
## printing the number of iteration
cat("iteration =", i)
i <- i+1
## calling garbage collector to assure free space in RAM
gc()
}
}
}
I have already split the train/validation data sets and their target variables, knowing that Random Forest deals with such data mor efficiently. I also tried to use the "foreach" package for parallelizing computations, however, the growing time for only one tree was 10-15% longer than without using all the cores.
I would like to know if I can shorten the time of execution of this code, especially if there is a way to avoid multiple loops since I heard that they are not the best way of programming in R.
1 Answer 1
Reproducible Example
Unfortunately, the code snippet that you gave does not lend itself to being reproducible. So, the advice being given is constrained.
Caches are nice
There are certain times where you should be caching a summation if the value is known to be constant through different iterations. In this particular case, we have: sum(train.target)
and sum(TARGET3)
that should be cached. Say:
stt = sum(train.target)
st3 = sum(TARGET3)
Knowledge (of size) is Power!
Immediately, one of the key issue you will face is the fact that you are rbind 60 items since you avoid giving stable numerical entries in the data.frame
## parameteres
ntrees <- c(300, 500)
zeros <- sum(train.target) * c(1, 2, 3, 4, 5)
mvars <- c(30, 50, 70, 90, 110, 130)
nitr = length(ntrees)*length(zeros)*length(mvars)
model_eff <- data.frame("ntrees" = numeric(nitr),
"zeros" = numeric(nitr),
"mvars" = numeric(nitr),
"eff" = numeric(nitr),
"0_0" = numeric(nitr),
"0_1" = numeric(nitr),
"1_0" = numeric(nitr),
"1_1" = numeric(nitr),
"predict_sum" = numeric(nitr),
"triangle" = numeric(nitr),
stringsAsFactors = F)
Declare count = 1
before the 3x for loops. Then save results using:
model_eff[count,] = c("ntrees" = ntrees[j],
"zeros" = zeros[k],
"mvars"= mvars[l],
"eff" = eff_measures$eff,
"0_0" = eff_measures$c.m[1, 1],
"0_1" = eff_measures$c.m[1, 2],
"1_0" = eff_measures$c.m[2, 1],
"1_1" = eff_measures$c.m[2, 2],
"predict_sum" = st3 ,
"triangle" = eff_measures$triangle))
count = count + 1
Parallel RandomForest via caret
The only other suggestion I have it to parallelize the build of the random forest via:
# caret modeling framework
library(caret)
# Parallel backend
library(doParallel)
# Register a cluster
registerDoParallel(cores = 5)
rf_model = train(train.target~.,data=train,method="rf",
prox=TRUE,allowParallel=TRUE)
Explore related questions
See similar questions with these tags.