46
$\begingroup$

I've been playing around with random forests for regression and am having difficulty working out exactly what the two measures of importance mean, and how they should be interpreted.

The importance() function gives two values for each variable: %IncMSE and IncNodePurity. Is there simple interpretations for these 2 values?

For IncNodePurity in particular, is this simply the amount the RSS increase following the removal of that variable?

Ferdi
5,25510 gold badges48 silver badges64 bronze badges
asked Jul 4, 2011 at 8:25
$\endgroup$
2
  • 1
    $\begingroup$ Have you looked at ?importance ? There's an explanation there on what both measures mean... $\endgroup$ Commented Jul 4, 2011 at 8:30
  • 3
    $\begingroup$ @Nick Sabbe, I have, and am trying to wrap my head around them. I was wondering if there are any nice intuitive interpretations for them. $\endgroup$ Commented Jul 4, 2011 at 8:53

2 Answers 2

49
$\begingroup$

The first one can be 'interpreted' as follows: if a predictor is important in your current model, then assigning other values for that predictor randomly but 'realistically' (i.e.: permuting this predictor's values over your dataset), should have a negative influence on prediction, i.e.: using the same model to predict from data that is the same except for the one variable, should give worse predictions.

So, you take a predictive measure (MSE) with the original dataset and then with the 'permuted' dataset, and you compare them somehow. One way, particularly since we expect the original MSE to always be smaller, the difference can be taken. Finally, for making the values comparable over variables, these are scaled.

For the second one: at each split, you can calculate how much this split reduces node impurity (for regression trees, indeed, the difference between RSS before and after the split). This is summed over all splits for that variable, over all trees.

Note: a good read is Elements of Statistical Learning by Hastie, Tibshirani and Friedman...

chl
55.4k23 gold badges235 silver badges411 bronze badges
answered Jul 4, 2011 at 9:17
$\endgroup$
4
  • 3
    $\begingroup$ Cheers, I actually have that book open now :) $\endgroup$ Commented Jul 4, 2011 at 9:24
  • $\begingroup$ What does RSS mean? $\endgroup$ Commented Oct 27, 2016 at 21:53
  • 1
    $\begingroup$ RSS is Residual Sum of Squares $\endgroup$ Commented Dec 6, 2016 at 21:06
  • $\begingroup$ The link to Hastie et al. above is broken. This one works as of Feb 2025: hastie.su.domains/ElemStatLearn $\endgroup$ Commented Feb 2 at 19:02
11
$\begingroup$

Random Forest importance metrics as implemented in the randomForest package in R have quirks in that correlated predictors get low importance values.

http://bioinformatics.oxfordjournals.org/content/early/2010/04/12/bioinformatics.btq134.full.pdf

I have a modified implementation of random forests out on CRAN which implements their approach of estimating empirical p values and false discovery rates, here

http://cran.r-project.org/web/packages/pRF/index.html

answered Apr 10, 2015 at 17:19
$\endgroup$
1
  • 1
    $\begingroup$ does this explain the different output of variable importance if you use randomForest with the caret package like caret::train(method="rf", importance = TRUE, ...) ?? $\endgroup$ Commented Dec 21, 2018 at 8:01

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.