-
Notifications
You must be signed in to change notification settings - Fork 6.9k
-
Hi,
I am trying to modify the simple preprocess->train->evaluate->register->transform pipeline example,
from https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb,
so that the created model is an inference pipeline that includes the preprocessing step.
The intention is to run a batch transformation on raw data that would run the preprocessing and then the xgboost inference, so I need to save the fitted scikit-learn preprocessing pipeline (PP) but I also want to keep the train/test/val split functionality.
From what I understand, I need to register a PipelineModel, which only supports Estimators. If I change the preprocessing step from a ProcessingStep to a TrainingStep then I could easily save the PP but I won't be able to save the train/test/val datasets. I thought of two different solutions:
- Separate the preprocessing step into a processing and training step where the former performs the train/test/val split and the latter fits and saves the PP. This is straightforward but I would still need to transform the datasets so I would need to add a transform step before the xgboost training step? Seems over complicated
- Keep the preprocessing step as a ProcessingStep, save the PP to /opt/ml/processing, add PP to the processing step's output list, and create a model using the PP output var as the model data. Much simpler, my only hesitation is that I want all my code on S3 and the ProcessingStep expects an S3 URI to a python file whereas the model estimator expects a tar file with a specified entry point. For this to work I would need two copies of the processing script on S3 (.py and tar.gz). In addition, I don't want to be restricted to having all my preprocessing code written on one file.
Does anyone know of a solution or hack around these limitations?
Is there a way to create a TrainingStep that can save more results than just the model artificats?
Is there a way to create a ProcessingStep that uses a zipped file and entry_point as the code input?
Thanks
Beta Was this translation helpful? Give feedback.