I use the caret package for training a randomForest object with 10x10CV.
library(caret)
tc <- trainControl("repeatedcv", number=10, repeats=10, classProbs=TRUE, savePred=T)
RFFit <- train(Defect ~., data=trainingSet, method="rf", trControl=tc, preProc=c("center", "scale"))
After that, I test the randomForest on a testSet (new data)
RF.testSet$Prediction <- predict(RFFit, newdata=testSet)
The confusion matrix shows me, that the model isn't that bad.
confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
Reference
Prediction 0 1
0 886 179
1 53 126
Accuracy : 0.8135
95% CI : (0.7907, 0.8348)
No Information Rate : 0.7548
P-Value [Acc > NIR] : 4.369e-07
Kappa : 0.4145
I now want to test the $finalModel and I think it should give me the same result, but somehow I receive
> RF.testSet$Prediction <- predict(RFFit$finalModel, newdata=RF.testSet)
> confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 323 66
1 616 239
Accuracy : 0.4518
95% CI : (0.4239, 0.4799)
No Information Rate : 0.7548
P-Value [Acc > NIR] : 1
Kappa : 0.0793
What am I missing?
edit @topepo :
I also learned another randomForest without the preProcessed option and got another result:
RFFit2 <- train(Defect ~., data=trainingSet, method="rf", trControl=tc)
testSet$Prediction2 <- predict(RFFit2, newdata=testSet)
confusionMatrix(data=testSet$Prediction2, testSet$Defect)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 878 174
1 61 131
Accuracy : 0.8111
95% CI : (0.7882, 0.8325)
No Information Rate : 0.7548
P-Value [Acc > NIR] : 1.252e-06
Kappa : 0.4167
1 Answer 1
The difference is the pre-processing. predict.train
automatically centers and scales the new data (since you asked for that) while predict.randomForest
takes whatever it is given. Since the tree splits are based on the processed values, the predictions will be off.
Max
-
$\begingroup$ but the
RFFit
object is created with the preProcessedtrain
method...so it should return a centered and scaled object (shouldn´t it?). If so -> the$finalModel
should also be scaled and centered $\endgroup$Frank– Frank2014年01月09日 06:30:22 +00:00Commented Jan 9, 2014 at 6:30 -
2$\begingroup$ Yes but, according to the code above, you have not applied the centering and scaling to
testSet
.predict.train
does that butpredict.randomForest
does not. $\endgroup$topepo– topepo2014年01月09日 12:55:59 +00:00Commented Jan 9, 2014 at 12:55 -
$\begingroup$ so there is no difference in using
predict(RFFit$finalModel, testSet)
andpredict(RFFit, testSet)
on the same testSet? $\endgroup$Frank– Frank2014年01月10日 14:05:14 +00:00Commented Jan 10, 2014 at 14:05 -
6$\begingroup$
predict(RFFit$finalModel, testSet)
andpredict(RFFit, testSet)
will be different if you use thepreProc
option intrain
. If you do not, they are training on the same dataset. In other words, any pre-processing that you ask for is done to the training set prior to runningrandomForest
. It also applied the same pre-processing to any data that you predict on (usingpredict(RFFit, testSet)
). If you use thefinalModel
object, you are usingpredict.randomForest
instead ofpredict.train
and none of the pre-processing is done before prediction. $\endgroup$topepo– topepo2014年01月14日 23:14:51 +00:00Commented Jan 14, 2014 at 23:14
RFFit
, in the second time you predicted using the model object, I guess. So the difference might be in passing other things along with the train object that processed your new test data somehow differently than without using the train object. $\endgroup$train
model you will get a slightly different result unless you set the random number seed before running it (see?set.seed
). The accuracy values are 0.8135 and 0.8111, which are pretty close and only due to the randomness of resampling and the model calculations. $\endgroup$