Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Training Pipeline to Create Inference Pipeline with Train/Test/Val Split #3848

Unanswered
Rim921 asked this question in Q&A
Discussion options

Hi,

I am trying to modify the simple preprocess->train->evaluate->register->transform pipeline example,
from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb,
so that the created model is an inference pipeline that includes the preprocessing step.
The intention is to run a batch transformation on raw data that would run the preprocessing and then the xgboost inference, so I need to save the fitted scikit-learn preprocessing pipeline (PP) but I also want to keep the train/test/val split functionality.

From what I understand, I need to register a PipelineModel, which only supports Estimators. If I change the preprocessing step from a ProcessingStep to a TrainingStep then I could easily save the PP but I won't be able to save the train/test/val datasets. I thought of two different solutions:

  1. Separate the preprocessing step into a processing and training step where the former performs the train/test/val split and the latter fits and saves the PP. This is straightforward but I would still need to transform the datasets so I would need to add a transform step before the xgboost training step? Seems over complicated
  2. Keep the preprocessing step as a ProcessingStep, save the PP to /opt/ml/processing, add PP to the processing step's output list, and create a model using the PP output var as the model data. Much simpler, my only hesitation is that I want all my code on S3 and the ProcessingStep expects an S3 URI to a python file whereas the model estimator expects a tar file with a specified entry point. For this to work I would need two copies of the processing script on S3 (.py and tar.gz). In addition, I don't want to be restricted to having all my preprocessing code written on one file.

Does anyone know of a solution or hack around these limitations?
Is there a way to create a TrainingStep that can save more results than just the model artificats?
Is there a way to create a ProcessingStep that uses a zipped file and entry_point as the code input?

Thanks

You must be logged in to vote

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant

AltStyle によって変換されたページ (->オリジナル) /