1-way partial dependence with different models#
In this section, we will compute 1-way partial dependence with two different
machine-learning models: (i) a multi-layer perceptron and (ii) a
gradient-boosting model. With these two models, we illustrate how to compute and
interpret both partial dependence plot (PDP) for both numerical and categorical
features and individual conditional expectation (ICE).
Multi-layer perceptron#
Let’s fit a MLPRegressor and compute
single-variable partial dependence plots.
fromtimeimport time
fromsklearn.neural_networkimport MLPRegressor
fromsklearn.pipelineimport make_pipeline
print("Training MLPRegressor...")
tic = time ()
mlp_model = make_pipeline (
mlp_preprocessor,
MLPRegressor (
hidden_layer_sizes=(30, 15),
learning_rate_init=0.01,
early_stopping=True,
random_state=0,
),
)
mlp_model.fit(X_train, y_train)
print(f"done in {time ()-tic:.3f}s")
print(f"Test R2 score: {mlp_model.score(X_test,y_test):.2f}")
Training MLPRegressor...
done in 0.647s
Test R2 score: 0.61
We configured a pipeline using the preprocessor that we created specifically for the
neural network and tuned the neural network size and learning rate to get a reasonable
compromise between training time and predictive performance on a test set.
Importantly, this tabular dataset has very different dynamic ranges for its
features. Neural networks tend to be very sensitive to features with varying
scales and forgetting to preprocess the numeric feature would lead to a very
poor model.
It would be possible to get even higher predictive performance with a larger
neural network but the training would also be significantly more expensive.
Note that it is important to check that the model is accurate enough on a
test set before plotting the partial dependence since there would be little
use in explaining the impact of a given feature on the prediction function of
a model with poor predictive performance. In this regard, our MLP model works
reasonably well.
We will plot the averaged partial dependence.
importmatplotlib.pyplotasplt
fromsklearn.inspectionimport PartialDependenceDisplay
common_params = {
"subsample": 50,
"n_jobs": 2,
"grid_resolution": 20,
"random_state": 0,
}
print("Computing partial dependence plots...")
features_info = {
# features of interest
"features": ["temp", "humidity", "windspeed", "season", "weather", "hour"],
# type of partial dependence plot
"kind": "average",
# information regarding categorical features
"categorical_features": categorical_features,
}
tic = time ()
_, ax = plt.subplots (ncols=3, nrows=2, figsize=(9, 8), constrained_layout=True)
display = PartialDependenceDisplay.from_estimator (
mlp_model,
X_train,
**features_info,
ax=ax,
**common_params,
)
print(f"done in {time ()-tic:.3f}s")
_ = display.figure_.suptitle(
(
"Partial dependence of the number of bike rentals\n"
"for the bike rental dataset with an MLPRegressor"
),
fontsize=16,
)
Partial dependence of the number of bike rentals for the bike rental dataset with an MLPRegressorComputing partial dependence plots...
done in 0.617s
Gradient boosting#
Let’s now fit a HistGradientBoostingRegressor and
compute the partial dependence on the same features. We also use the
specific preprocessor we created for this model.
fromsklearn.ensembleimport HistGradientBoostingRegressor
print("Training HistGradientBoostingRegressor...")
tic = time ()
hgbdt_model = make_pipeline (
hgbdt_preprocessor,
HistGradientBoostingRegressor (
categorical_features=categorical_features,
random_state=0,
max_iter=50,
),
)
hgbdt_model.fit(X_train, y_train)
print(f"done in {time ()-tic:.3f}s")
print(f"Test R2 score: {hgbdt_model.score(X_test,y_test):.2f}")
Training HistGradientBoostingRegressor...
done in 0.132s
Test R2 score: 0.62
Here, we used the default hyperparameters for the gradient boosting model
without any preprocessing as tree-based models are naturally robust to
monotonic transformations of numerical features.
Note that on this tabular dataset, Gradient Boosting Machines are both
significantly faster to train and more accurate than neural networks. It is
also significantly cheaper to tune their hyperparameters (the defaults tend
to work well while this is not often the case for neural networks).
We will plot the partial dependence for some of the numerical and categorical
features.
print("Computing partial dependence plots...")
tic = time ()
_, ax = plt.subplots (ncols=3, nrows=2, figsize=(9, 8), constrained_layout=True)
display = PartialDependenceDisplay.from_estimator (
hgbdt_model,
X_train,
**features_info,
ax=ax,
**common_params,
)
print(f"done in {time ()-tic:.3f}s")
_ = display.figure_.suptitle(
(
"Partial dependence of the number of bike rentals\n"
"for the bike rental dataset with a gradient boosting"
),
fontsize=16,
)
Partial dependence of the number of bike rentals for the bike rental dataset with a gradient boostingComputing partial dependence plots...
done in 1.084s
Analysis of the plots#
We will first look at the PDPs for the numerical features. For both models, the
general trend of the PDP of the temperature is that the number of bike rentals is
increasing with temperature. We can make a similar analysis but with the opposite
trend for the humidity features. The number of bike rentals is decreasing when the
humidity increases. Finally, we see the same trend for the wind speed feature. The
number of bike rentals is decreasing when the wind speed is increasing for both
models. We also observe that MLPRegressor has much
smoother predictions than HistGradientBoostingRegressor.
Now, we will look at the partial dependence plots for the categorical features.
We observe that the spring season is the lowest bar for the season feature. With the
weather feature, the rain category is the lowest bar. Regarding the hour feature,
we see two peaks around the 7 am and 6 pm. These findings are in line with the
the observations we made earlier on the dataset.
However, it is worth noting that we are creating potential meaningless
synthetic samples if features are correlated.
ICE vs. PDP#
PDP is an average of the marginal effects of the features. We are averaging the
response of all samples of the provided set. Thus, some effects could be hidden. In
this regard, it is possible to plot each individual response. This representation is
called the Individual Effect Plot (ICE). In the plot below, we plot 50 randomly
selected ICEs for the temperature and humidity features.
print("Computing partial dependence plots and individual conditional expectation...")
tic = time ()
_, ax = plt.subplots (ncols=2, figsize=(6, 4), sharey=True, constrained_layout=True)
features_info = {
"features": ["temp", "humidity"],
"kind": "both",
"centered": True,
}
display = PartialDependenceDisplay.from_estimator (
hgbdt_model,
X_train,
**features_info,
ax=ax,
**common_params,
)
print(f"done in {time ()-tic:.3f}s")
_ = display.figure_.suptitle("ICE and PDP representations", fontsize=16)
ICE and PDP representationsComputing partial dependence plots and individual conditional expectation...
done in 0.459s
We see that the ICE for the temperature feature gives us some additional information:
Some of the ICE lines are flat while some others show a decrease of the dependence
for temperature above 35 degrees Celsius. We observe a similar pattern for the
humidity feature: some of the ICEs lines show a sharp decrease when the humidity is
above 80%.
Not all ICE lines are parallel, this indicates that the model finds
interactions between features. We can repeat the experiment by constraining the
gradient boosting model to not use any interactions between features using the
parameter interaction_cst:
fromsklearn.baseimport clone
interaction_cst = [[i] for i in range(X_train.shape[1])]
hgbdt_model_without_interactions = (
clone(hgbdt_model)
.set_params(histgradientboostingregressor__interaction_cst=interaction_cst)
.fit(X_train, y_train)
)
print(f"Test R2 score: {hgbdt_model_without_interactions.score(X_test,y_test):.2f}")
_, ax = plt.subplots (ncols=2, figsize=(6, 4), sharey=True, constrained_layout=True)
features_info["centered"] = False
display = PartialDependenceDisplay.from_estimator (
hgbdt_model_without_interactions,
X_train,
**features_info,
ax=ax,
**common_params,
)
_ = display.figure_.suptitle("ICE and PDP representations", fontsize=16)
ICE and PDP representations
2D interaction plots#
PDPs with two features of interest enable us to visualize interactions among them.
However, ICEs cannot be plotted in an easy manner and thus interpreted. We will show
the representation of available in
from_estimator that is a 2D
heatmap.
print("Computing partial dependence plots...")
features_info = {
"features": ["temp", "humidity", ("temp", "humidity")],
"kind": "average",
}
_, ax = plt.subplots (ncols=3, figsize=(10, 4), constrained_layout=True)
tic = time ()
display = PartialDependenceDisplay.from_estimator (
hgbdt_model,
X_train,
**features_info,
ax=ax,
**common_params,
)
print(f"done in {time ()-tic:.3f}s")
_ = display.figure_.suptitle(
"1-way vs 2-way of numerical PDP using gradient boosting", fontsize=16
)
1-way vs 2-way of numerical PDP using gradient boostingComputing partial dependence plots...
done in 7.514s
The two-way partial dependence plot shows the dependence of the number of bike rentals
on joint values of temperature and humidity.
We clearly see an interaction between the two features. For a temperature higher than
20 degrees Celsius, the humidity has a impact on the number of bike rentals
that seems independent on the temperature.
On the other hand, for temperatures lower than 20 degrees Celsius, both the
temperature and humidity continuously impact the number of bike rentals.
Furthermore, the slope of the of the impact ridge of the 20 degrees Celsius
threshold is very dependent on the humidity level: the ridge is steep under
dry conditions but much smoother under wetter conditions above 70% of humidity.
We now contrast those results with the same plots computed for the model
constrained to learn a prediction function that does not depend on such
non-linear feature interactions.
print("Computing partial dependence plots...")
features_info = {
"features": ["temp", "humidity", ("temp", "humidity")],
"kind": "average",
}
_, ax = plt.subplots (ncols=3, figsize=(10, 4), constrained_layout=True)
tic = time ()
display = PartialDependenceDisplay.from_estimator (
hgbdt_model_without_interactions,
X_train,
**features_info,
ax=ax,
**common_params,
)
print(f"done in {time ()-tic:.3f}s")
_ = display.figure_.suptitle(
"1-way vs 2-way of numerical PDP using gradient boosting", fontsize=16
)
1-way vs 2-way of numerical PDP using gradient boostingComputing partial dependence plots...
done in 6.893s
The 1D partial dependence plots for the model constrained to not model feature
interactions show local spikes for each features individually, in particular for
for the "humidity" feature. Those spikes might be reflecting a degraded behavior
of the model that attempts to somehow compensate for the forbidden interactions
by overfitting particular training points. Note that the predictive performance
of this model as measured on the test set is significantly worse than that of
the original, unconstrained model.
Also note that the number of local spikes visible on those plots is depends on
the grid resolution parameter of the PD plot itself.
Those local spikes result in a noisily gridded 2D PD plot. It is quite
challenging to tell whether or not there are no interaction between those
features because of the high frequency oscillations in the humidity feature.
However it can clearly be seen that the simple interaction effect observed when
the temperature crosses the 20 degrees boundary is no longer visible for this
model.
The partial dependence between categorical features will provide a discrete
representation that can be shown as a heatmap. For instance the interaction between
the season, the weather, and the target would be as follow:
print("Computing partial dependence plots...")
features_info = {
"features": ["season", "weather", ("season", "weather")],
"kind": "average",
"categorical_features": categorical_features,
}
_, ax = plt.subplots (ncols=3, figsize=(14, 6), constrained_layout=True)
tic = time ()
display = PartialDependenceDisplay.from_estimator (
hgbdt_model,
X_train,
**features_info,
ax=ax,
**common_params,
)
print(f"done in {time ()-tic:.3f}s")
_ = display.figure_.suptitle(
"1-way vs 2-way PDP of categorical features using gradient boosting", fontsize=16
)
1-way vs 2-way PDP of categorical features using gradient boostingComputing partial dependence plots...
done in 0.333s
3D representation#
Let’s make the same partial dependence plot for the 2 features interaction,
this time in 3 dimensions.
# unused but required import for doing 3d projections with matplotlib < 3.2
importmpl_toolkits.mplot3d # noqa: F401
importnumpyasnp
fromsklearn.inspectionimport partial_dependence
fig = plt.figure (figsize=(5.5, 5))
features = ("temp", "humidity")
pdp = partial_dependence (
hgbdt_model, X_train, features=features, kind="average", grid_resolution=10
)
XX, YY = np.meshgrid (pdp["grid_values"][0], pdp["grid_values"][1])
Z = pdp.average[0].T
ax = fig.add_subplot(projection="3d")
fig.add_axes(ax)
surf = ax.plot_surface(XX, YY, Z, rstride=1, cstride=1, cmap=plt.cm.BuPu, edgecolor="k")
ax.set_xlabel(features[0])
ax.set_ylabel(features[1])
fig.suptitle(
"PD of number of bike rentals on\nthe temperature and humidity GBDT model",
fontsize=16,
)
# pretty init view
ax.view_init(elev=22, azim=122)
clb = plt.colorbar (surf, pad=0.08, shrink=0.6, aspect=10)
clb.ax.set_title("Partial\ndependence")
plt.show ()
PD of number of bike rentals on the temperature and humidity GBDT model, Partial dependence
Custom Inspection Points#
None of the examples so far specify _which_ points are evaluated to create the
partial dependence plots. By default we use percentiles defined by the input dataset.
In some cases it can be helpful to specify the exact points where you would like the
model evaluated. For instance, if a user wants to test the model behavior on
out-of-distribution data or compare two models that were fit on slightly different
data. The custom_values parameter allows the user to pass in the values that they
want the model to be evaluated on. This overrides the grid_resolution and
percentiles parameters. Let’s return to our gradient boosting example above
but with custom values
print("Computing partial dependence plots with custom evaluation values...")
tic = time ()
_, ax = plt.subplots (ncols=2, figsize=(6, 4), sharey=True, constrained_layout=True)
features_info = {
"features": ["temp", "humidity"],
"kind": "both",
}
display = PartialDependenceDisplay.from_estimator (
hgbdt_model,
X_train,
**features_info,
ax=ax,
**common_params,
# we set custom values for temp feature -
# all other features are evaluated based on the data
custom_values={"temp": np.linspace (0, 40, 10)},
)
print(f"done in {time ()-tic:.3f}s")
_ = display.figure_.suptitle(
(
"Partial dependence of the number of bike rentals\n"
"for the bike rental dataset with a gradient boosting"
),
fontsize=16,
)
Partial dependence of the number of bike rentals for the bike rental dataset with a gradient boostingComputing partial dependence plots with custom evaluation values...
done in 0.463s
Total running time of the script: (0 minutes 23.086 seconds)
Related examples
Gallery generated by Sphinx-Gallery