Go to the end
to download the full example code. or to run this example in your browser via JupyterLite or Binder
Permutation Importance vs Random Forest Feature Importance (MDI)#
In this example, we will compare the impurity-based feature importance of
RandomForestClassifier with the
permutation importance on the titanic dataset using
permutation_importance. We will show that the
impurity-based feature importance can inflate the importance of numerical
features.
Furthermore, the impurity-based feature importance of random forests suffers
from being computed on statistics derived from the training dataset: the
importances can be high even for features that are not predictive of the target
variable, as long as the model has the capacity to use them to overfit.
This example shows how to use Permutation Importances as an alternative that
can mitigate those limitations.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Before inspecting the feature importances, it is important to check that
the model predictive performance is high enough. Indeed, there would be little
interest in inspecting the important features of a non-predictive model.
print(f"RF train accuracy: {rf.score(X_train,y_train):.3f}")print(f"RF test accuracy: {rf.score(X_test,y_test):.3f}")
RF train accuracy: 1.000
RF test accuracy: 0.814
Here, one can observe that the train accuracy is very high (the forest model
has enough capacity to completely memorize the training set) but it can still
generalize well enough to the test set thanks to the built-in bagging of
random forests.
It might be possible to trade some accuracy on the training set for a
slightly better accuracy on the test set by limiting the capacity of the
trees (for instance by setting min_samples_leaf=5 or
min_samples_leaf=10) so as to limit overfitting while not introducing too
much underfitting.
However, let us keep our high capacity random forest model for now so that we can
illustrate some pitfalls about feature importance on variables with many
unique values.
Tree’s Feature Importance from Mean Decrease in Impurity (MDI)#
The impurity-based feature importance ranks the numerical features to be the
most important features. As a result, the non-predictive random_num
variable is ranked as one of the most important features!
This problem stems from two limitations of impurity-based feature
importances:
impurity-based importances are biased towards high cardinality features;
impurity-based importances are computed on training set statistics and
therefore do not reflect the ability of feature to be useful to make
predictions that generalize to the test set (when the model has enough
capacity).
The bias towards high cardinality features explains why the random_num has
a really large importance in comparison with random_cat while we would
expect that both random features have a null importance.
The fact that we use training set statistics explains why both the
random_num and random_cat features have a non-null importance.
As an alternative, the permutation importances of rf are computed on a
held out test set. This shows that the low cardinality categorical feature,
sex and pclass are the most important features. Indeed, permuting the
values of these features will lead to the most decrease in accuracy score of the
model on the test set.
Also, note that both random features have very low importances (close to 0) as
expected.
fromsklearn.inspectionimportpermutation_importanceresult=permutation_importance(rf,X_test,y_test,n_repeats=10,random_state=42,n_jobs=2)sorted_importances_idx=result.importances_mean.argsort()importances=pd.DataFrame(result.importances[sorted_importances_idx].T,columns=X.columns[sorted_importances_idx],)ax=importances.plot.box(vert=False,whis=10)ax.set_title("Permutation Importances (test set)")ax.axvline(x=0,color="k",linestyle="--")ax.set_xlabel("Decrease in accuracy score")ax.figure.tight_layout()
Permutation Importances (test set)
It is also possible to compute the permutation importances on the training
set. This reveals that random_num and random_cat get a significantly
higher importance ranking than when computed on the test set. The difference
between those two plots is a confirmation that the RF model has enough
capacity to use that random numerical and categorical features to overfit.
result=permutation_importance(rf,X_train,y_train,n_repeats=10,random_state=42,n_jobs=2)sorted_importances_idx=result.importances_mean.argsort()importances=pd.DataFrame(result.importances[sorted_importances_idx].T,columns=X.columns[sorted_importances_idx],)ax=importances.plot.box(vert=False,whis=10)ax.set_title("Permutation Importances (train set)")ax.axvline(x=0,color="k",linestyle="--")ax.set_xlabel("Decrease in accuracy score")ax.figure.tight_layout()
Permutation Importances (train set)
We can further retry the experiment by limiting the capacity of the trees
to overfit by setting min_samples_leaf at 20 data points.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps
[('preprocess', ...), ('classifier', ...)]
transform_input
None
memory
None
verbose
False
Parameters
transformers
[('cat', ...), ('num', ...)]
remainder
'drop'
sparse_threshold
0.3
n_jobs
None
transformer_weights
None
verbose
False
verbose_feature_names_out
False
force_int_remainder_cols
'deprecated'
['pclass', 'sex', 'embarked', 'random_cat']
Parameters
categories
'auto'
dtype
<class 'numpy.float64'>
handle_unknown
'use_encoded_value'
unknown_value
-1
encoded_missing_value
-1
min_frequency
None
max_categories
None
['age', 'sibsp', 'parch', 'fare', 'random_num']
Parameters
missing_values
nan
strategy
'mean'
fill_value
None
copy
True
add_indicator
False
keep_empty_features
False
Parameters
n_estimators
100
criterion
'gini'
max_depth
None
min_samples_split
2
min_samples_leaf
20
min_weight_fraction_leaf
0.0
max_features
'sqrt'
max_leaf_nodes
None
min_impurity_decrease
0.0
bootstrap
True
oob_score
False
n_jobs
None
random_state
42
verbose
0
warm_start
False
class_weight
None
ccp_alpha
0.0
max_samples
None
monotonic_cst
None
Observing the accuracy score on the training and testing set, we observe that
the two metrics are very similar now. Therefore, our model is not overfitting
anymore. We can then check the permutation importances with this new model.
print(f"RF train accuracy: {rf.score(X_train,y_train):.3f}")print(f"RF test accuracy: {rf.score(X_test,y_test):.3f}")
forname,importancesinzip(["train","test"],[train_importances,test_importances]):ax=importances.plot.box(vert=False,whis=10)ax.set_title(f"Permutation Importances ({name} set)")ax.set_xlabel("Decrease in accuracy score")ax.axvline(x=0,color="k",linestyle="--")ax.figure.tight_layout()
Permutation Importances (train set)
Permutation Importances (test set)
Now, we can observe that on both sets, the random_num and random_cat
features have a lower importance compared to the overfitting random forest.
However, the conclusions regarding the importance of the other features are
still valid.
Total running time of the script: (0 minutes 4.558 seconds)