Training Pipeline to Create Inference Pipeline with Train/Test/Val Split · aws/amazon-sagemaker-examples · Discussion #3848

Rim921
Mar 14, 2023

Hi,

I am trying to modify the simple preprocess->train->evaluate->register->transform pipeline example,
from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb,
so that the created model is an inference pipeline that includes the preprocessing step.
The intention is to run a batch transformation on raw data that would run the preprocessing and then the xgboost inference, so I need to save the fitted scikit-learn preprocessing pipeline (PP) but I also want to keep the train/test/val split functionality.

From what I understand, I need to register a PipelineModel, which only supports Estimators. If I change the preprocessing step from a ProcessingStep to a TrainingStep then I could easily save the PP but I won't be able to save the train/test/val datasets. I thought of two different solutions:

Separate the preprocessing step into a processing and training step where the former performs the train/test/val split and the latter fits and saves the PP. This is straightforward but I would still need to transform the datasets so I would need to add a transform step before the xgboost training step? Seems over complicated
Keep the preprocessing step as a ProcessingStep, save the PP to /opt/ml/processing, add PP to the processing step's output list, and create a model using the PP output var as the model data. Much simpler, my only hesitation is that I want all my code on S3 and the ProcessingStep expects an S3 URI to a python file whereas the model estimator expects a tar file with a specified entry point. For this to work I would need two copies of the processing script on S3 (.py and tar.gz). In addition, I don't want to be restricted to having all my preprocessing code written on one file.

Does anyone know of a solution or hack around these limitations?
Is there a way to create a TrainingStep that can save more results than just the model artificats?
Is there a way to create a ProcessingStep that uses a zipped file and entry_point as the code input?

Thanks

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training Pipeline to Create Inference Pipeline with Train/Test/Val Split #3848

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Rim921
Mar 14, 2023

Replies: 0 comments

Select a reply

Uh oh!

Training Pipeline to Create Inference Pipeline with Train/Test/Val Split #3848

Uh oh!

Uh oh!

Rim921 Mar 14, 2023

Replies: 0 comments

Rim921
Mar 14, 2023