Skip to content

feat: Support placeholders for TransformStep #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

ca-nguyen merged 13 commits into aws:main from ca-nguyen:support-placeholders-for-transform-step

Oct 18, 2021

Merged

feat: Support placeholders for TransformStep #157

Changes from all commits

Commits

Show all changes

Select commit Hold shift + click to select a range

Support placeholders for transfor step

ca-nguyen Sep 1, 2021

Use max_payload in execution input

ca-nguyen Sep 2, 2021

Merge branch 'main' into support-placeholders-for-transform-step

ca-nguyen Sep 2, 2021

Adjust number of concurrent transform

ca-nguyen Sep 3, 2021

Added parameters arg doc

ca-nguyen Sep 3, 2021

Merge branch 'main' into support-placeholders-for-transform-step

shivlaks Sep 10, 2021

Merge branch 'main' into support-placeholders-for-transform-step

shivlaks Sep 11, 2021

Merge branch 'main' into support-placeholders-for-transform-step

ca-nguyen Sep 13, 2021

Add retry to integ test

ca-nguyen Sep 13, 2021

ca-nguyen Sep 15, 2021

Merge branch 'main' into support-placeholders-for-transform-step

ca-nguyen Oct 15, 2021

Apply suggestions from code review

ca-nguyen Oct 15, 2021

Correct indentation and use placeholder for experiment_config in unit...

ca-nguyen Oct 15, 2021

File filter

Filter by extension

Conversations

Jump to

Jump to file

Failed to load files.

Loading

Diff view

Diff view

46 changes: 28 additions & 18 deletions src/stepfunctions/steps/sagemaker.py

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -185,36 +185,42 @@ def __merge_hyperparameters(self, training_step_hyperparameters, estimator_hyper
		merged_hyperparameters[key] = value
		return merged_hyperparameters


		class TransformStep(Task):

		"""
		Creates a Task State to execute a `SageMaker Transform Job <https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html>`_.
		"""

	def __init__(self, state_id, transformer, job_name, model_name, data, data_type='S3Prefix', content_type=None, compression_type=None, split_type=None, experiment_config=None, wait_for_completion=True, tags=None, input_filter=None, output_filter=None, join_source=None, **kwargs):
	def __init__(self, state_id, transformer, job_name, model_name, data, data_type='S3Prefix', content_type=None,
	compression_type=None, split_type=None, experiment_config=None, wait_for_completion=True, tags=None,
	input_filter=None, output_filter=None, join_source=None, **kwargs):
		"""
		Args:
		state_id (str): State name whose length must be less than or equal to 128 unicode characters. State names must be unique within the scope of the whole state machine.
		transformer (sagemaker.transformer.Transformer): The SageMaker transformer to use in the TransformStep.
		job_name (str or Placeholder): Specify a transform job name. We recommend to use :py:class:`~stepfunctions.inputs.ExecutionInput` placeholder collection to pass the value dynamically in each execution.
		model_name (str or Placeholder): Specify a model name for the transform job to use. We recommend to use :py:class:`~stepfunctions.inputs.ExecutionInput` placeholder collection to pass the value dynamically in each execution.
	data (str): Input data location in S3.
	data_type (str): What the S3 location defines (default: 'S3Prefix').
	data (str or Placeholder): Input data location in S3.
	data_type (str or Placeholder): What the S3 location defines (default: 'S3Prefix').
		Valid values:

		* 'S3Prefix' - the S3 URI defines a key name prefix. All objects with this prefix will
		be used as inputs for the transform job.
		* 'ManifestFile' - the S3 URI points to a single manifest file listing each S3 object
		to use as an input for the transform job.
	content_type (str): MIME type of the input data (default: None).
	compression_type (str): Compression type of the input data, if compressed (default: None). Valid values: 'Gzip', None.
	split_type (str): The record delimiter for the input object (default: 'None'). Valid values: 'None', 'Line', 'RecordIO', and 'TFRecord'.
	experiment_config (dict, optional): Specify the experiment config for the transform. (Default: None)
	content_type (str or Placeholder): MIME type of the input data (default: None).
	compression_type (str or Placeholder): Compression type of the input data, if compressed (default: None). Valid values: 'Gzip', None.
	split_type (str or Placeholder): The record delimiter for the input object (default: 'None'). Valid values: 'None', 'Line', 'RecordIO', and 'TFRecord'.
	experiment_config (dict or Placeholder, optional): Specify the experiment config for the transform. (Default: None)
		wait_for_completion(bool, optional): Boolean value set to `True` if the Task state should wait for the transform job to complete before proceeding to the next step in the workflow. Set to `False` if the Task state should submit the transform job and proceed to the next step. (default: True)
	tags (list[dict], optional): `List to tags <https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html>`_ to associate with the resource.
	input_filter (str): A JSONPath to select a portion of the input to pass to the algorithm container for inference. If you omit the field, it gets the value ‘$’, representing the entire input. For CSV data, each row is taken as a JSON array, so only index-based JSONPaths can be applied, e.g. $[0], $[1:]. CSV data should follow the RFC format. See Supported JSONPath Operators for a table of supported JSONPath operators. For more information, see the SageMaker API documentation for CreateTransformJob. Some examples: "$[1:]", "$.features" (default: None).
	output_filter (str): A JSONPath to select a portion of the joined/original output to return as the output. For more information, see the SageMaker API documentation for CreateTransformJob. Some examples: "$[1:]", "$.prediction" (default: None).
	join_source (str): The source of data to be joined to the transform output. It can be set to ‘Input’ meaning the entire input record will be joined to the inference result. You can use OutputFilter to select the useful portion before uploading to S3. (default: None). Valid values: Input, None.
	tags (list[dict] or Placeholder, optional): `List to tags <https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html>`_ to associate with the resource.
	input_filter (str or Placeholder): A JSONPath to select a portion of the input to pass to the algorithm container for inference. If you omit the field, it gets the value ‘$’, representing the entire input. For CSV data, each row is taken as a JSON array, so only index-based JSONPaths can be applied, e.g. $[0], $[1:]. CSV data should follow the RFC format. See Supported JSONPath Operators for a table of supported JSONPath operators. For more information, see the SageMaker API documentation for CreateTransformJob. Some examples: "$[1:]", "$.features" (default: None).
	output_filter (str or Placeholder): A JSONPath to select a portion of the joined/original output to return as the output. For more information, see the SageMaker API documentation for CreateTransformJob. Some examples: "$[1:]", "$.prediction" (default: None).
	join_source (str or Placeholder): The source of data to be joined to the transform output. It can be set to ‘Input’ meaning the entire input record will be joined to the inference result. You can use OutputFilter to select the useful portion before uploading to S3. (default: None). Valid values: Input, None.
	parameters(dict, optional): The value of this field is merged with other arguments to become the request payload for SageMaker `CreateTransformJob<https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html>`_.
	You can use `parameters` to override the value provided by other arguments and specify any field's value dynamically using `Placeholders<https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/placeholders.html?highlight=placeholder#stepfunctions.inputs.Placeholder>`_.

		"""
		if wait_for_completion:
		"""
Expand All		@@ -233,7 +239,7 @@ def __init__(self, state_id, transformer, job_name, model_name, data, data_type=
		SageMakerApi.CreateTransformJob)

		if isinstance(job_name, str):
	parameters = transform_config(
	transform_parameters = transform_config(
		transformer=transformer,
		data=data,
		data_type=data_type,
Expand All		@@ -246,7 +252,7 @@ def __init__(self, state_id, transformer, job_name, model_name, data, data_type=
		join_source=join_source
		)
		else:
	parameters = transform_config(
	transform_parameters = transform_config(
		transformer=transformer,
		data=data,
		data_type=data_type,
Expand All		@@ -259,17 +265,21 @@ def __init__(self, state_id, transformer, job_name, model_name, data, data_type=
		)

		if isinstance(job_name, Placeholder):
	parameters['TransformJobName'] = job_name
	transform_parameters['TransformJobName'] = job_name

	parameters['ModelName'] = model_name
	transform_parameters['ModelName'] = model_name

		if experiment_config is not None:
	parameters['ExperimentConfig'] = experiment_config
	transform_parameters['ExperimentConfig'] = experiment_config

		if tags:
	parameters['Tags'] = tags_dict_to_kv_list(tags)
	transform_parameters['Tags'] = tags if isinstance(tags, Placeholder) else tags_dict_to_kv_list(tags)

	kwargs[Field.Parameters.value] = parameters
	if Field.Parameters.value in kwargs and isinstance(kwargs[Field.Parameters.value], dict):
	# Update transform_parameters with input parameters
	merge_dicts(transform_parameters, kwargs[Field.Parameters.value])

	kwargs[Field.Parameters.value] = transform_parameters
		super(TransformStep, self).__init__(state_id, **kwargs)


Expand Down

90 changes: 90 additions & 0 deletions tests/integ/test_sagemaker_steps.py

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -179,6 +179,96 @@ def test_transform_step(trained_estimator, sfn_client, sfn_role_arn):
		state_machine_delete_wait(sfn_client, workflow.state_machine_arn)
		# End of Cleanup


	def test_transform_step_with_placeholder(trained_estimator, sfn_client, sfn_role_arn):
	# Create transformer from supplied estimator
	job_name = generate_job_name()
	pca_transformer = trained_estimator.transformer(instance_count=INSTANCE_COUNT, instance_type=INSTANCE_TYPE)

	# Create a model step to save the model
	model_step = ModelStep('create_model_step', model=trained_estimator.create_model(), model_name=job_name)
	model_step.add_retry(SAGEMAKER_RETRY_STRATEGY)

	# Upload data for transformation to S3
	data_path = os.path.join(DATA_DIR, "one_p_mnist")
	transform_input_path = os.path.join(data_path, "transform_input.csv")
	transform_input_key_prefix = "integ-test-data/one_p_mnist/transform"
	transform_input = pca_transformer.sagemaker_session.upload_data(
	path=transform_input_path, key_prefix=transform_input_key_prefix
	)

	execution_input = ExecutionInput(schema={
	'data': str,
	'content_type': str,
	'split_type': str,
	'job_name': str,
	'model_name': str,
	'instance_count': int,
	'instance_type': str,
	'strategy': str,
	'max_concurrent_transforms': int,
	'max_payload': int,
	})

	parameters = {
	'BatchStrategy': execution_input['strategy'],
	'TransformInput': {
	'SplitType': execution_input['split_type'],
	},
	'TransformResources': {
	'InstanceCount': execution_input['instance_count'],
	'InstanceType': execution_input['instance_type'],
	},
	'MaxConcurrentTransforms': execution_input['max_concurrent_transforms'],
	'MaxPayloadInMB': execution_input['max_payload']
	}

	# Build workflow definition
	transform_step = TransformStep(
	'create_transform_job_step',
	pca_transformer,
	job_name=execution_input['job_name'],
	model_name=execution_input['model_name'],
	data=execution_input['data'],
	content_type=execution_input['content_type'],
	parameters=parameters
	)
	transform_step.add_retry(SAGEMAKER_RETRY_STRATEGY)
	workflow_graph = Chain([model_step, transform_step])

	with timeout(minutes=DEFAULT_TIMEOUT_MINUTES):
	# Create workflow and check definition
	workflow = create_workflow_and_check_definition(
	workflow_graph=workflow_graph,
	workflow_name=unique_name_from_base("integ-test-transform-step-workflow"),
	sfn_client=sfn_client,
	sfn_role_arn=sfn_role_arn
	)

	execution_input = {
	'job_name': job_name,
	'model_name': job_name,
	'data': transform_input,
	'content_type': "text/csv",
	'instance_count': INSTANCE_COUNT,
	'instance_type': INSTANCE_TYPE,
	'split_type': 'Line',
	'strategy': 'SingleRecord',
	'max_concurrent_transforms': 2,
	'max_payload': 5
	}

	# Execute workflow
	execution = workflow.execute(inputs=execution_input)
	execution_output = execution.get_output(wait=True)

	# Check workflow output
	assert execution_output.get("TransformJobStatus") == "Completed"

	# Cleanup
	state_machine_delete_wait(sfn_client, workflow.state_machine_arn)


		def test_endpoint_config_step(trained_estimator, sfn_client, sagemaker_session, sfn_role_arn):
		# Setup: Create model for trained estimator in SageMaker
		model = trained_estimator.create_model()
Expand Down

111 changes: 111 additions & 0 deletions tests/unit/test_sagemaker_steps.py

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -901,6 +901,117 @@ def test_transform_step_creation(pca_transformer):
		}


	@patch.object(boto3.session.Session, 'region_name', 'us-east-1')
	def test_transform_step_creation_with_placeholder(pca_transformer):
	execution_input = ExecutionInput(schema={
	'data': str,
	'data_type': str,
	'content_type': str,
	'compression_type': str,
	'split_type': str,
	'input_filter': str,
	'output_filter': str,
	'join_source': str,
	'job_name': str,
	'model_name': str,
	'instance_count': int,
	'strategy': str,
	'assemble_with': str,
	'output_path': str,
	'output_kms_key': str,
	'accept': str,
	'max_concurrent_transforms': int,
	'max_payload': int,
	'tags': [{str: str}],
	'env': str,
	'volume_kms_key': str,
	'experiment_config': str,
	})

	step_input = StepInput(schema={
	'instance_type': str
	})

	parameters = {
	'BatchStrategy': execution_input['strategy'],
	'TransformOutput': {
	'Accept': execution_input['accept'],
	'AssembleWith': execution_input['assemble_with'],
	'KmsKeyId': execution_input['output_kms_key'],
	'S3OutputPath': execution_input['output_path']
	},
	'TransformResources': {
	'InstanceCount': execution_input['instance_count'],
	'InstanceType': step_input['instance_type'],
	'VolumeKmsKeyId': execution_input['volume_kms_key']
	},
	'ExperimentConfig': execution_input['experiment_config'],
	'Tags': execution_input['tags'],
	'Environment': execution_input['env'],
	'MaxConcurrentTransforms': execution_input['max_concurrent_transforms'],
	'MaxPayloadInMB': execution_input['max_payload'],
	}

	step = TransformStep('Inference',
	transformer=pca_transformer,
	data=execution_input['data'],
	data_type=execution_input['data_type'],
	content_type=execution_input['content_type'],
	compression_type=execution_input['compression_type'],
	split_type=execution_input['split_type'],
	job_name=execution_input['job_name'],
	model_name=execution_input['model_name'],
	experiment_config={
	'ExperimentName': 'pca_experiment',
	'TrialName': 'pca_trial',
	'TrialComponentDisplayName': 'Transform'
	},
	tags=execution_input['tags'],
	join_source=execution_input['join_source'],
	output_filter=execution_input['output_filter'],
	input_filter=execution_input['input_filter'],
	parameters=parameters
	)

	assert step.to_dict()['Parameters'] == {
	'BatchStrategy.$': "$$.Execution.Input['strategy']",
	'ModelName.$': "$$.Execution.Input['model_name']",
	'TransformInput': {
	'CompressionType.$': "$$.Execution.Input['compression_type']",
	'ContentType.$': "$$.Execution.Input['content_type']",
	'DataSource': {
	'S3DataSource': {
	'S3DataType.$': "$$.Execution.Input['data_type']",
	'S3Uri.$': "$$.Execution.Input['data']"
	}
	},
	'SplitType.$': "$$.Execution.Input['split_type']"
	},
	'TransformOutput': {
	'Accept.$': "$$.Execution.Input['accept']",
	'AssembleWith.$': "$$.Execution.Input['assemble_with']",
	'KmsKeyId.$': "$$.Execution.Input['output_kms_key']",
	'S3OutputPath.$': "$$.Execution.Input['output_path']"
	},
	'TransformJobName.$': "$$.Execution.Input['job_name']",
	'TransformResources': {
	'InstanceCount.$': "$$.Execution.Input['instance_count']",
	'InstanceType.$': "$['instance_type']",
	'VolumeKmsKeyId.$': "$$.Execution.Input['volume_kms_key']"
	},
	'ExperimentConfig.$': "$$.Execution.Input['experiment_config']",
	'DataProcessing': {
	'InputFilter.$': "$$.Execution.Input['input_filter']",
	'OutputFilter.$': "$$.Execution.Input['output_filter']",
	'JoinSource.$': "$$.Execution.Input['join_source']",
	},
	'Tags.$': "$$.Execution.Input['tags']",
	'Environment.$': "$$.Execution.Input['env']",
	'MaxConcurrentTransforms.$': "$$.Execution.Input['max_concurrent_transforms']",
	'MaxPayloadInMB.$': "$$.Execution.Input['max_payload']"
	}


		@patch('botocore.client.BaseClient._make_api_call', new=mock_boto_api_call)
		@patch.object(boto3.session.Session, 'region_name', 'us-east-1')
		def test_get_expected_model(pca_estimator):
Expand Down