Measures of variable importance in random forests

Question 1

I've been playing around with random forests for regression and am having difficulty working out exactly what the two measures of importance mean, and how they should be interpreted.

The importance() function gives two values for each variable: %IncMSE and IncNodePurity. Is there simple interpretations for these 2 values?

For IncNodePurity in particular, is this simply the amount the RSS increase following the removal of that variable?

Question 2

Have you looked at ?importance ? There's an explanation there on what both measures mean...

Question 3

@Nick Sabbe, I have, and am trying to wrap my head around them. I was wondering if there are any nice intuitive interpretations for them.

Question 4

The first one can be 'interpreted' as follows: if a predictor is important in your current model, then assigning other values for that predictor randomly but 'realistically' (i.e.: permuting this predictor's values over your dataset), should have a negative influence on prediction, i.e.: using the same model to predict from data that is the same except for the one variable, should give worse predictions.

So, you take a predictive measure (MSE) with the original dataset and then with the 'permuted' dataset, and you compare them somehow. One way, particularly since we expect the original MSE to always be smaller, the difference can be taken. Finally, for making the values comparable over variables, these are scaled.

For the second one: at each split, you can calculate how much this split reduces node impurity (for regression trees, indeed, the difference between RSS before and after the split). This is summed over all splits for that variable, over all trees.

Note: a good read is Elements of Statistical Learning by Hastie, Tibshirani and Friedman...

Question 5

Cheers, I actually have that book open now :)

Question 6

What does RSS mean?

Question 7

RSS is Residual Sum of Squares

Question 8

The link to Hastie et al. above is broken. This one works as of Feb 2025: hastie.su.domains/ElemStatLearn

Question 9

Random Forest importance metrics as implemented in the randomForest package in R have quirks in that correlated predictors get low importance values.

http://bioinformatics.oxfordjournals.org/content/early/2010/04/12/bioinformatics.btq134.full.pdf

I have a modified implementation of random forests out on CRAN which implements their approach of estimating empirical p values and false discovery rates, here

http://cran.r-project.org/web/packages/pRF/index.html

Question 10

does this explain the different output of variable importance if you use randomForest with the caret package like caret::train(method="rf", importance = TRUE, ...) ??

Nick Sabbe 13k2 gold badges39 silver badges47 bronze badges · Accepted Answer · 2011-07-04 09:17:37Z

The first one can be 'interpreted' as follows: if a predictor is important in your current model, then assigning other values for that predictor randomly but 'realistically' (i.e.: permuting this predictor's values over your dataset), should have a negative influence on prediction, i.e.: using the same model to predict from data that is the same except for the one variable, should give worse predictions.

So, you take a predictive measure (MSE) with the original dataset and then with the 'permuted' dataset, and you compare them somehow. One way, particularly since we expect the original MSE to always be smaller, the difference can be taken. Finally, for making the values comparable over variables, these are scaled.

For the second one: at each split, you can calculate how much this split reduces node impurity (for regression trees, indeed, the difference between RSS before and after the split). This is summed over all splits for that variable, over all trees.

Note: a good read is Elements of Statistical Learning by Hastie, Tibshirani and Friedman...

The link to Hastie et al. above is broken. This one works as of Feb 2025: hastie.su.domains/ElemStatLearn

Stack Exchange Network

Measures of variable importance in random forests

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Measures of variable importance in random forests

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions