Note
Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.
Introducing the set_output API#
This example will demonstrate the set_output API to configure transformers to
output pandas DataFrames. set_output can be configured per estimator by calling
the set_output method or globally by setting set_config(transform_output="pandas").
For details, see
SLEP018.
First, we load the iris dataset as a DataFrame to demonstrate the set_output API.
fromsklearn.datasetsimport load_iris fromsklearn.model_selectionimport train_test_split X, y = load_iris (as_frame=True, return_X_y=True) X_train, X_test, y_train, y_test = train_test_split (X, y, stratify=y, random_state=0) X_train.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| 60 | 5.0 | 2.0 | 3.5 | 1.0 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 |
| 93 | 5.0 | 2.3 | 3.3 | 1.0 |
| 106 | 4.9 | 2.5 | 4.5 | 1.7 |
To configure an estimator such as preprocessing.StandardScaler to return
DataFrames, call set_output. This feature requires pandas to be installed.
fromsklearn.preprocessingimport StandardScaler scaler = StandardScaler ().set_output(transform="pandas") scaler.fit(X_train) X_test_scaled = scaler.transform(X_test) X_test_scaled.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| 39 | -0.894264 | 0.798301 | -1.271411 | -1.327605 |
| 12 | -1.244466 | -0.086944 | -1.327407 | -1.459074 |
| 48 | -0.660797 | 1.462234 | -1.271411 | -1.327605 |
| 23 | -0.894264 | 0.576989 | -1.159419 | -0.933197 |
| 81 | -0.427329 | -1.414810 | -0.039497 | -0.275851 |
set_output can be called after fit to configure transform after the fact.
scaler2 = StandardScaler () scaler2.fit(X_train) X_test_np = scaler2.transform(X_test) print(f"Default output type: {type(X_test_np).__name__}") scaler2.set_output(transform="pandas") X_test_df = scaler2.transform(X_test) print(f"Configured pandas output type: {type(X_test_df).__name__}")
Default output type: ndarray Configured pandas output type: DataFrame
In a pipeline.Pipeline, set_output configures all steps to output
DataFrames.
fromsklearn.feature_selectionimport SelectPercentile fromsklearn.linear_modelimport LogisticRegression fromsklearn.pipelineimport make_pipeline clf = make_pipeline ( StandardScaler (), SelectPercentile (percentile=75), LogisticRegression () ) clf.set_output(transform="pandas") clf.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('selectpercentile', SelectPercentile(percentile=75)),
('logisticregression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
List of (name of step, estimator) tuples that are to be chained in
sequential order. To be compatible with the scikit-learn API, all steps
must define `fit`. All non-last steps must also define `transform`. See
:ref:`Combining Estimators