Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Permutation Importance vs Random Forest Feature Importance (MDI)#

In this example, we will compare the impurity-based feature importance of RandomForestClassifier with the permutation importance on the titanic dataset using permutation_importance. We will show that the impurity-based feature importance can inflate the importance of numerical features.

Furthermore, the impurity-based feature importance of random forests suffers from being computed on statistics derived from the training dataset: the importances can be high even for features that are not predictive of the target variable, as long as the model has the capacity to use them to overfit.

This example shows how to use Permutation Importances as an alternative that can mitigate those limitations.

References

L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Data Loading and Feature Engineering#

Let’s use pandas to load a copy of the titanic dataset. The following shows how to apply separate preprocessing on numerical and categorical features.

We further include two random variables that are not correlated in any way with the target variable (survived):

random_num is a high cardinality numerical variable (as many unique values as records).
random_cat is a low cardinality categorical variable (3 possible values).

importnumpyasnp
fromsklearn.datasetsimport fetch_openml
fromsklearn.model_selectionimport train_test_split
X, y = fetch_openml ("titanic", version=1, as_frame=True, return_X_y=True)
rng = np.random.RandomState (seed=42)
X["random_cat"] = rng.randint(3, size=X.shape[0])
X["random_num"] = rng.randn(X.shape[0])
categorical_columns = ["pclass", "sex", "embarked", "random_cat"]
numerical_columns = ["age", "sibsp", "parch", "fare", "random_num"]
X = X[categorical_columns + numerical_columns]
X_train, X_test, y_train, y_test = train_test_split (X, y, stratify=y, random_state=42)

We define a predictive model based on a random forest. Therefore, we will make the following preprocessing steps:

use OrdinalEncoder to encode the categorical features;
use SimpleImputer to fill missing values for numerical features using a mean strategy.

fromsklearn.composeimport ColumnTransformer
fromsklearn.ensembleimport RandomForestClassifier
fromsklearn.imputeimport SimpleImputer
fromsklearn.pipelineimport Pipeline
fromsklearn.preprocessingimport OrdinalEncoder
categorical_encoder = OrdinalEncoder (
 handle_unknown="use_encoded_value", unknown_value=-1, encoded_missing_value=-1
)
numerical_pipe = SimpleImputer (strategy="mean")
preprocessing = ColumnTransformer (
 [
 ("cat", categorical_encoder, categorical_columns),
 ("num", numerical_pipe, numerical_columns),
 ],
 verbose_feature_names_out=False,
)
rf = Pipeline (
 [
 ("preprocess", preprocessing),
 ("classifier", RandomForestClassifier (random_state=42)),
 ]
)
rf.fit(X_train, y_train)

Pipeline(steps=[('preprocess',
 ColumnTransformer(transformers=[('cat',
 OrdinalEncoder(encoded_missing_value=-1,
 handle_unknown='use_encoded_value',
 unknown_value=-1),
 ['pclass', 'sex', 'embarked',
 'random_cat']),
 ('num', SimpleImputer(),
 ['age', 'sibsp', 'parch',
 'fare', 'random_num'])],
 verbose_feature_names_out=False)),
 ('classifier', RandomForestClassifier(random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Permutation Importance vs Random Forest Feature Importance (MDI)#

Data Loading and Feature Engineering#

Accuracy of the Model#

Tree’s Feature Importance from Mean Decrease in Impurity (MDI)#

This Page