Is preprocessing needed before prediction using FinalModel of RandomForest with caret package?

Question 1

I use the caret package for training a randomForest object with 10x10CV.

library(caret)
tc <- trainControl("repeatedcv", number=10, repeats=10, classProbs=TRUE, savePred=T) 
RFFit <- train(Defect ~., data=trainingSet, method="rf", trControl=tc, preProc=c("center", "scale"))

After that, I test the randomForest on a testSet (new data)

RF.testSet$Prediction <- predict(RFFit, newdata=testSet)

The confusion matrix shows me, that the model isn't that bad.

confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
 Reference
 Prediction 0 1
 0 886 179
 1 53 126 
 Accuracy : 0.8135 
 95% CI : (0.7907, 0.8348)
No Information Rate : 0.7548 
P-Value [Acc > NIR] : 4.369e-07 
 Kappa : 0.4145

I now want to test the $finalModel and I think it should give me the same result, but somehow I receive

> RF.testSet$Prediction <- predict(RFFit$finalModel, newdata=RF.testSet)
> confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
Confusion Matrix and Statistics
 Reference
Prediction 0 1
 0 323 66
 1 616 239
 Accuracy : 0.4518 
 95% CI : (0.4239, 0.4799)
 No Information Rate : 0.7548 
 P-Value [Acc > NIR] : 1 
 Kappa : 0.0793

What am I missing?

edit @topepo :

I also learned another randomForest without the preProcessed option and got another result:

RFFit2 <- train(Defect ~., data=trainingSet, method="rf", trControl=tc)
testSet$Prediction2 <- predict(RFFit2, newdata=testSet)
confusionMatrix(data=testSet$Prediction2, testSet$Defect)
Confusion Matrix and Statistics
 Reference
Prediction 0 1
 0 878 174
 1 61 131
 Accuracy : 0.8111 
 95% CI : (0.7882, 0.8325)
 No Information Rate : 0.7548 
 P-Value [Acc > NIR] : 1.252e-06 
 Kappa : 0.4167

Question 2

in the first instance, you predicted with a train object which you called RFFit, in the second time you predicted using the model object, I guess. So the difference might be in passing other things along with the train object that processed your new test data somehow differently than without using the train object.

Question 3

For the 2nd train model you will get a slightly different result unless you set the random number seed before running it (see ?set.seed). The accuracy values are 0.8135 and 0.8111, which are pretty close and only due to the randomness of resampling and the model calculations.

Question 4

The difference is the pre-processing. predict.train automatically centers and scales the new data (since you asked for that) while predict.randomForest takes whatever it is given. Since the tree splits are based on the processed values, the predictions will be off.

Max

Question 5

but the RFFit object is created with the preProcessed train method...so it should return a centered and scaled object (shouldn´t it?). If so -> the $finalModel should also be scaled and centered

Question 6

Yes but, according to the code above, you have not applied the centering and scaling to testSet. predict.train does that but predict.randomForest does not.

Question 7

so there is no difference in using predict(RFFit$finalModel, testSet) and predict(RFFit, testSet) on the same testSet?

Question 8

predict(RFFit$finalModel, testSet) and predict(RFFit, testSet) will be different if you use the preProc option in train. If you do not, they are training on the same dataset. In other words, any pre-processing that you ask for is done to the training set prior to running randomForest. It also applied the same pre-processing to any data that you predict on (using predict(RFFit, testSet)). If you use the finalModel object, you are using predict.randomForest instead of predict.train and none of the pre-processing is done before prediction.

topepo topepo 6,0101 gold badge22 silver badges25 bronze badges · Accepted Answer · 2014-01-08 22:09:18Z

17

$\begingroup$

The difference is the pre-processing. predict.train automatically centers and scales the new data (since you asked for that) while predict.randomForest takes whatever it is given. Since the tree splits are based on the processed values, the predictions will be off.

Max

Share

Improve this answer

answered Jan 8, 2014 at 22:09

topepo's user avatar

topepo topepo

6,0101 gold badge22 silver badges25 bronze badges

$\endgroup$

4

$\begingroup$ but the RFFit object is created with the preProcessed train method...so it should return a centered and scaled object (shouldn´t it?). If so -> the $finalModel should also be scaled and centered $\endgroup$

Frank
– Frank

2014年01月09日 06:30:22 +00:00
Commented Jan 9, 2014 at 6:30
2

$\begingroup$ Yes but, according to the code above, you have not applied the centering and scaling to testSet. predict.train does that but predict.randomForest does not. $\endgroup$

topepo
– topepo

2014年01月09日 12:55:59 +00:00
Commented Jan 9, 2014 at 12:55
$\begingroup$ so there is no difference in using predict(RFFit$finalModel, testSet) and predict(RFFit, testSet) on the same testSet? $\endgroup$

Frank
– Frank

2014年01月10日 14:05:14 +00:00
Commented Jan 10, 2014 at 14:05
6

$\begingroup$ predict(RFFit$finalModel, testSet) and predict(RFFit, testSet) will be different if you use the preProc option in train. If you do not, they are training on the same dataset. In other words, any pre-processing that you ask for is done to the training set prior to running randomForest. It also applied the same pre-processing to any data that you predict on (using predict(RFFit, testSet)). If you use the finalModel object, you are using predict.randomForest instead of predict.train and none of the pre-processing is done before prediction. $\endgroup$

topepo
– topepo

2014年01月14日 23:14:51 +00:00
Commented Jan 14, 2014 at 23:14

Add a comment |

Stack Exchange Network

Is preprocessing needed before prediction using FinalModel of RandomForest with caret package?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Is preprocessing needed before prediction using FinalModel of RandomForest with caret package?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions