Note
Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.
Permutation Importance vs Random Forest Feature Importance (MDI)#
In this example, we will compare the impurity-based feature importance of
RandomForestClassifier with the
permutation importance on the titanic dataset using
permutation_importance. We will show that the
impurity-based feature importance can inflate the importance of numerical
features.
Furthermore, the impurity-based feature importance of random forests suffers from being computed on statistics derived from the training dataset: the importances can be high even for features that are not predictive of the target variable, as long as the model has the capacity to use them to overfit.
This example shows how to use Permutation Importances as an alternative that can mitigate those limitations.
References
# Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause
Data Loading and Feature Engineering#
Let’s use pandas to load a copy of the titanic dataset. The following shows how to apply separate preprocessing on numerical and categorical features.
We further include two random variables that are not correlated in any way
with the target variable (survived):
random_numis a high cardinality numerical variable (as many unique values as records).random_catis a low cardinality categorical variable (3 possible values).
importnumpyasnp fromsklearn.datasetsimport fetch_openml fromsklearn.model_selectionimport train_test_split X, y = fetch_openml ("titanic", version=1, as_frame=True, return_X_y=True) rng = np.random.RandomState (seed=42) X["random_cat"] = rng.randint(3, size=X.shape[0]) X["random_num"] = rng.randn(X.shape[0]) categorical_columns = ["pclass", "sex", "embarked", "random_cat"] numerical_columns = ["age", "sibsp", "parch", "fare", "random_num"] X = X[categorical_columns + numerical_columns] X_train, X_test, y_train, y_test = train_test_split (X, y, stratify=y, random_state=42)
We define a predictive model based on a random forest. Therefore, we will make the following preprocessing steps:
use
OrdinalEncoderto encode the categorical features;use
SimpleImputerto fill missing values for numerical features using a mean strategy.
fromsklearn.composeimport ColumnTransformer fromsklearn.ensembleimport RandomForestClassifier fromsklearn.imputeimport SimpleImputer fromsklearn.pipelineimport Pipeline fromsklearn.preprocessingimport OrdinalEncoder categorical_encoder = OrdinalEncoder ( handle_unknown="use_encoded_value", unknown_value=-1, encoded_missing_value=-1 ) numerical_pipe = SimpleImputer (strategy="mean") preprocessing = ColumnTransformer ( [ ("cat", categorical_encoder, categorical_columns), ("num", numerical_pipe, numerical_columns), ], verbose_feature_names_out=False, ) rf = Pipeline ( [ ("preprocess", preprocessing), ("classifier", RandomForestClassifier (random_state=42)), ] ) rf.fit(X_train, y_train)
Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('cat',
OrdinalEncoder(encoded_missing_value=-1,
handle_unknown='use_encoded_value',
unknown_value=-1),
['pclass', 'sex', 'embarked',
'random_cat']),
('num', SimpleImputer(),
['age', 'sibsp', 'parch',
'fare', 'random_num'])],
verbose_feature_names_out=False)),
('classifier', RandomForestClassifier(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
List of (name of step, estimator) tuples that are to be chained in
sequential order. To be compatible with the scikit-learn API, all steps
must define `fit`. All non-last steps must also define `transform`. See
:ref:`Combining Estimators
Parameters
['pclass', 'sex', 'embarked', 'random_cat']
Parameters
Desired dtype of output. <class 'numpy.float64'>
['age', 'sibsp', 'parch', 'fare', 'random_num']