I've been playing around with random forests for regression and am having difficulty working out exactly what the two measures of importance mean, and how they should be interpreted.
The importance() function gives two values for each variable: %IncMSE and IncNodePurity.
Is there simple interpretations for these 2 values?
For IncNodePurity in particular, is this simply the amount the RSS increase following the removal of that variable?
2 Answers 2
The first one can be 'interpreted' as follows: if a predictor is important in your current model, then assigning other values for that predictor randomly but 'realistically' (i.e.: permuting this predictor's values over your dataset), should have a negative influence on prediction, i.e.: using the same model to predict from data that is the same except for the one variable, should give worse predictions.
So, you take a predictive measure (MSE) with the original dataset and then with the 'permuted' dataset, and you compare them somehow. One way, particularly since we expect the original MSE to always be smaller, the difference can be taken. Finally, for making the values comparable over variables, these are scaled.
For the second one: at each split, you can calculate how much this split reduces node impurity (for regression trees, indeed, the difference between RSS before and after the split). This is summed over all splits for that variable, over all trees.
Note: a good read is Elements of Statistical Learning by Hastie, Tibshirani and Friedman...
-
3$\begingroup$ Cheers, I actually have that book open now :) $\endgroup$dcl– dcl2011年07月04日 09:24:03 +00:00Commented Jul 4, 2011 at 9:24
-
$\begingroup$ What does RSS mean? $\endgroup$DavideChicco.it– DavideChicco.it2016年10月27日 21:53:39 +00:00Commented Oct 27, 2016 at 21:53
-
1$\begingroup$ RSS is Residual Sum of Squares $\endgroup$Barker– Barker2016年12月06日 21:06:56 +00:00Commented Dec 6, 2016 at 21:06
-
$\begingroup$ The link to Hastie et al. above is broken. This one works as of Feb 2025: hastie.su.domains/ElemStatLearn $\endgroup$InColorado– InColorado2025年02月02日 19:02:03 +00:00Commented Feb 2 at 19:02
Random Forest importance metrics as implemented in the randomForest package in R have quirks in that correlated predictors get low importance values.
http://bioinformatics.oxfordjournals.org/content/early/2010/04/12/bioinformatics.btq134.full.pdf
I have a modified implementation of random forests out on CRAN which implements their approach of estimating empirical p values and false discovery rates, here
-
1$\begingroup$ does this explain the different output of variable importance if you use randomForest with the caret package like
caret::train(method="rf", importance = TRUE, ...)?? $\endgroup$Agile Bean– Agile Bean2018年12月21日 08:01:30 +00:00Commented Dec 21, 2018 at 8:01
Explore related questions
See similar questions with these tags.
?importance? There's an explanation there on what both measures mean... $\endgroup$