How to use random forests in R with missing values?

Question 1

library(randomForest)
rf.model <- randomForest(WIN ~ ., data = learn)

I would like to fit a random forest model, but I get this error:

Error in na.fail.default(list(WIN = c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, : 
missing values in object

I have data frame learn with 16 numeric atributes and WIN is a factor with levels 0 1.

Question 2

In it's current state, this question will be very difficult to answer. Can you update your question with some sample data?

Question 3

@MattO'Brien Also amusing that the quality of a question is discussed based on viewcount and not on the merits of the question itself. And the answer, since @ Joran had no problem figuring out what is being asked and provided what appears to be a good solution for the asker's problem.

Question 4

If your data has missing values, you have (basically) two choices:

Use a different tool (rpart handles missing values nicely.)
Impute the missing values

Not surprisingly, the randomForest package has a function for doing just this, rfImpute. The documentation at ?rfImpute runs through a basic example of its use.

There's also na.roughfix which will replace missing values with the median/mode. You can use it by setting na.action = na.roughfix when you call randomForest.

If only a small number of cases have missing values, you might also try setting na.action = na.omit to simply drop those cases.

And of course, this answer is a bit of a guess that your problem really is simply having missing values.

Question 5

do you happen to know what WIN ~ . in the first argument on the OP means? This is certainly not the best place to ask the question, but was wondering if you would know. Thanks.

Question 6

The question is about missing values in the response variable, not the predictors.

Question 7

Breiman's random forest, which the randomForest package is based on, actually does handle missing values in predictors. In the randomForest package, you can set

 na.action = na.roughfix

It will start by using median/mode for missing values, but then it grows a forest and computes proximities, then iterate and construct a forest using these newly filled values etc. This is not well explained in the randomForest documentation (p10). It only states

....NAs are replaced with column medians .... This is used as a starting point for imputing missing values by random forest

On Breiman's homepage you find a little bit more information

missfill= 1,2 does a fast replacement of the missing values, for the training set (if equal to 1) and a more careful replacement (if equal to 2).

mfixrep= k with missfill=2 does a slower, but usually more effective, replacement using proximities with k iterations on the training set only. (Requires nprox >0).

Question 8

This answer is way more informative (and polite) than the accepted one. -_-

Question 9

If there is possibility that missing values are informative then you can inpute missing values and add additional binary variables (with new.vars<-is.na(your_dataset) ) and check if it lowers error, if new.var is too large set to add it to your_dataset then you could use it alone, pick significiant variables with varImpPlot and add them to your_dataset, you could also try to add single variable to your_dataset which counts number of NA's new.var <- rowSums(new.vars)

This is not off-topick answer, if missing variables are informative accounting for them could correct for increase of model error due to inperfect imputation procedure alone.

Missing values are informative then they arise due to non-random causes, its expecially common in social experiments settings.

joran joran 174k33 gold badges439 silver badges484 bronze badges · Accepted Answer · 2011-12-04 02:10:54Z

If your data has missing values, you have (basically) two choices:

Use a different tool (rpart handles missing values nicely.)
Impute the missing values

Not surprisingly, the randomForest package has a function for doing just this, rfImpute. The documentation at ?rfImpute runs through a basic example of its use.

There's also na.roughfix which will replace missing values with the median/mode. You can use it by setting na.action = na.roughfix when you call randomForest.

If only a small number of cases have missing values, you might also try setting na.action = na.omit to simply drop those cases.

And of course, this answer is a bit of a guess that your problem really is simply having missing values.

do you happen to know what WIN ~ . in the first argument on the OP means? This is certainly not the best place to ask the question, but was wondering if you would know. Thanks.
The question is about missing values in the response variable, not the predictors.

CollectivesTM on Stack Overflow

How to use random forests in R with missing values?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related