Deploy a model by using the gcloud CLI or Vertex AI API
Stay organized with collections
Save and categorize content based on your preferences.
To deploy a model to a public endpoint by using the gcloud CLI or Vertex AI API, you need to get the endpoint ID for an existing endpoint and then deploy the model to it.
Get the endpoint ID
You need the endpoint ID to deploy the model.
gcloud
The following example uses the
gcloud ai endpoints list
command:
gcloudaiendpointslist\
--region=LOCATION_ID\
--filter=display_name=ENDPOINT_NAME
Replace the following:
- LOCATION_ID: The region where you are using Vertex AI.
- ENDPOINT_NAME: The display name for the endpoint.
Note the number that appears in the ENDPOINT_ID
column. Use this ID in the
following step.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_NAME: The display name for the endpoint.
HTTP method and URL:
GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME
To send your request, expand one of these options:
curl (Linux, macOS, or Cloud Shell)
Execute the following command:
curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME"
PowerShell (Windows)
Execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{ "endpoints": [ { "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID", "displayName": "ENDPOINT_NAME", "etag": "AMEw9yPz5pf4PwBHbRWOGh0PcAxUdjbdX2Jm3QO_amguy3DbZGP5Oi_YUKRywIE-BtLx", "createTime": "2020-04-17T18:31:11.585169Z", "updateTime": "2020-04-17T18:35:08.568959Z" } ] }
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Replace the following:
- PROJECT_ID: Your project ID.
- LOCATION_ID: The region where you are using Vertex AI.
- ENDPOINT_NAME: The display name for the endpoint.
fromgoogle.cloudimport aiplatform
PROJECT_ID = "PROJECT_ID"
LOCATION = "LOCATION_ID"
ENDPOINT_NAME = "ENDPOINT_NAME"
aiplatform.init (
project=PROJECT_ID,
location=LOCATION,
)
endpoint = aiplatform.Endpoint .list( filter='display_name=ENDPOINT_NAME', )
endpoint_id = endpoint.name.split("/")[-1]
Deploy the model
When you deploy a model, you give the deployed model an ID to distinguish it from other models deployed to the endpoint.
Select the tab below for your language or environment:
gcloud
The following examples use the gcloud ai endpoints deploy-model
command.
The following example deploys a Model
to an Endpoint
without using GPUs
to accelerate prediction serving and without splitting traffic between multiple
DeployedModel
resources:
Before using any of the command data below, make the following replacements:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. - MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to the maximum number of nodes and never fewer than this number of nodes.
-
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment.
The node count can be increased or decreased as required by the inference load,
up to this number of nodes and never fewer than the minimum number of nodes.
If you omit the
--max-replica-count
flag, then maximum number of nodes is set to the value of--min-replica-count
.
Execute the gcloud ai endpoints deploy-model command:
Linux, macOS, or Cloud Shell
gcloudaiendpointsdeploy-modelENDPOINT_ID\ --region=LOCATION_ID\ --model=MODEL_ID\ --display-name=DEPLOYED_MODEL_NAME\ --min-replica-count=MIN_REPLICA_COUNT\ --max-replica-count=MAX_REPLICA_COUNT\ --traffic-split=0=100
Windows (PowerShell)
gcloudaiendpointsdeploy-modelENDPOINT_ID` --region=LOCATION_ID` --model=MODEL_ID` --display-name=DEPLOYED_MODEL_NAME` --min-replica-count=MIN_REPLICA_COUNT` --max-replica-count=MAX_REPLICA_COUNT` --traffic-split=0=100
Windows (cmd.exe)
gcloudaiendpointsdeploy-modelENDPOINT_ID^ --region=LOCATION_ID^ --model=MODEL_ID^ --display-name=DEPLOYED_MODEL_NAME^ --min-replica-count=MIN_REPLICA_COUNT^ --max-replica-count=MAX_REPLICA_COUNT^ --traffic-split=0=100
Splitting traffic
The --traffic-split=0=100
flag in the preceding examples sends 100% of prediction
traffic that the Endpoint
receives to the new DeployedModel
, which is
represented by the temporary ID 0
. If your Endpoint
already has other
DeployedModel
resources, then you can split traffic between the new
DeployedModel
and the old ones.
For example, to send 20% of traffic to the new DeployedModel
and 80% to an older one,
run the following command.
Before using any of the command data below, make the following replacements:
- OLD_DEPLOYED_MODEL_ID: the ID of the existing
DeployedModel
.
Execute the gcloud ai endpoints deploy-model command:
Linux, macOS, or Cloud Shell
gcloudaiendpointsdeploy-modelENDPOINT_ID\ --region=LOCATION_ID\ --model=MODEL_ID\ --display-name=DEPLOYED_MODEL_NAME\ --min-replica-count=MIN_REPLICA_COUNT\ --max-replica-count=MAX_REPLICA_COUNT\ --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
Windows (PowerShell)
gcloudaiendpointsdeploy-modelENDPOINT_ID` --region=LOCATION_ID` --model=MODEL_ID` --display-name=DEPLOYED_MODEL_NAME\ --min-replica-count=MIN_REPLICA_COUNT` --max-replica-count=MAX_REPLICA_COUNT` --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
Windows (cmd.exe)
gcloudaiendpointsdeploy-modelENDPOINT_ID^ --region=LOCATION_ID^ --model=MODEL_ID^ --display-name=DEPLOYED_MODEL_NAME\ --min-replica-count=MIN_REPLICA_COUNT^ --max-replica-count=MAX_REPLICA_COUNT^ --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
REST
Deploy the model.
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. -
MACHINE_TYPE: Optional. The machine resources used for each node of this
deployment. Its default setting is
n1-standard-2
. Learn more about machine types. - ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
- ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
- MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
- MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to this number of nodes and never fewer than the minimum number of nodes.
- REQUIRED_REPLICA_COUNT: Optional. The required number of nodes for this deployment to be marked as successful. Must be greater than or equal to 1 and fewer than or equal to the minimum number of nodes. If not specified, the default value is the minimum number of nodes.
- TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
- DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
- TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
- PROJECT_NUMBER: Your project's automatically generated project number
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{ "deployedModel": { "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID", "displayName": "DEPLOYED_MODEL_NAME", "dedicatedResources": { "machineSpec": { "machineType": "MACHINE_TYPE", "acceleratorType": "ACCELERATOR_TYPE", "acceleratorCount": "ACCELERATOR_COUNT" }, "minReplicaCount": MIN_REPLICA_COUNT, "maxReplicaCount": MAX_REPLICA_COUNT, "requiredReplicaCount": REQUIRED_REPLICA_COUNT }, }, "trafficSplit": { "0": TRAFFIC_SPLIT_THIS_MODEL, "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1, "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2 }, }
To send your request, expand one of these options:
curl (Linux, macOS, or Cloud Shell)
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel"
PowerShell (Windows)
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployModelOperationMetadata", "genericMetadata": { "createTime": "2020-10-19T17:53:16.502088Z", "updateTime": "2020-10-19T17:53:16.502088Z" } } }
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
importcom.google.api.gax.longrunning.OperationFuture ;
importcom.google.cloud.aiplatform.v1.DedicatedResources ;
importcom.google.cloud.aiplatform.v1.DeployModelOperationMetadata ;
importcom.google.cloud.aiplatform.v1.DeployModelResponse ;
importcom.google.cloud.aiplatform.v1.DeployedModel ;
importcom.google.cloud.aiplatform.v1.EndpointName ;
importcom.google.cloud.aiplatform.v1.EndpointServiceClient ;
importcom.google.cloud.aiplatform.v1.EndpointServiceSettings ;
importcom.google.cloud.aiplatform.v1.MachineSpec ;
importcom.google.cloud.aiplatform.v1.ModelName ;
importjava.io.IOException;
importjava.util.HashMap;
importjava.util.Map;
importjava.util.concurrent.ExecutionException;
publicclass DeployModelCustomTrainedModelSample{
publicstaticvoidmain(String[]args)
throwsIOException,ExecutionException,InterruptedException{
// TODO(developer): Replace these variables before running the sample.
Stringproject="PROJECT";
StringendpointId="ENDPOINT_ID";
StringmodelName="MODEL_NAME";
StringdeployedModelDisplayName="DEPLOYED_MODEL_DISPLAY_NAME";
deployModelCustomTrainedModelSample(project,endpointId,modelName,deployedModelDisplayName);
}
staticvoiddeployModelCustomTrainedModelSample(
Stringproject,StringendpointId,Stringmodel,StringdeployedModelDisplayName)
throwsIOException,ExecutionException,InterruptedException{
EndpointServiceSettings settings=
EndpointServiceSettings .newBuilder()
.setEndpoint("us-central1-aiplatform.googleapis.com:443")
.build();
Stringlocation="us-central1";
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests. After completing all of your requests, call
// the "close" method on the client to safely clean up any remaining background resources.
try(EndpointServiceClient client=EndpointServiceClient .create(settings)){
MachineSpec machineSpec=MachineSpec .newBuilder().setMachineType ("n1-standard-2").build();
DedicatedResources dedicatedResources=
DedicatedResources .newBuilder().setMinReplicaCount(1).setMachineSpec(machineSpec).build();
StringmodelName=ModelName .of(project,location,model).toString();
DeployedModel deployedModel=
DeployedModel .newBuilder()
.setModel(modelName)
.setDisplayName(deployedModelDisplayName)
// `dedicated_resources` must be used for non-AutoML models
.setDedicatedResources(dedicatedResources)
.build();
// key '0' assigns traffic for the newly deployed model
// Traffic percentage values must add up to 100
// Leave dictionary empty if endpoint should not accept any traffic
Map<String,Integer>trafficSplit=newHashMap<>();
trafficSplit.put("0",100);
EndpointName endpoint=EndpointName .of(project,location,endpointId);
OperationFuture<DeployModelResponse,DeployModelOperationMetadata>response=
client.deployModelAsync (endpoint,deployedModel,trafficSplit);
// You can use OperationFuture.getInitialFuture to get a future representing the initial
// response to the request, which contains information while the operation is in progress.
System.out.format("Operation name: %s\n",response.getInitialFuture().get ().getName());
// OperationFuture.get() will block until the operation is finished.
DeployModelResponse deployModelResponse=response.get ();
System.out.format("deployModelResponse: %s\n",deployModelResponse);
}
}
}
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
defdeploy_model_with_dedicated_resources_sample(
project,
location,
model_name:str,
machine_type:str,
endpoint:Optional[aiplatform.Endpoint]=None,
deployed_model_display_name:Optional[str]=None,
traffic_percentage:Optional[int]=0,
traffic_split:Optional[Dict[str, int]]=None,
min_replica_count:int=1,
max_replica_count:int=1,
accelerator_type:Optional[str]=None,
accelerator_count:Optional[int]=None,
explanation_metadata:Optional[explain.ExplanationMetadata]=None,
explanation_parameters:Optional[explain.ExplanationParameters]=None,
metadata:Optional[Sequence[Tuple[str, str]]]=(),
sync:bool=True,
):
"""
model_name: A fully-qualified model resource name or model ID.
Example: "projects/123/locations/us-central1/models/456" or
"456" when project and location are initialized or passed.
"""
aiplatform.init(project=project,location=location)
model=aiplatform.Model(model_name=model_name)
#Theexplanation_metadataandexplanation_parametersshouldonlybe
#providedforacustomtrainedmodelandnotanAutoMLmodel.
model.deploy(
endpoint=endpoint,
deployed_model_display_name=deployed_model_display_name,
traffic_percentage=traffic_percentage,
traffic_split=traffic_split,
machine_type=machine_type,
min_replica_count=min_replica_count,
max_replica_count=max_replica_count,
accelerator_type=accelerator_type,
accelerator_count=accelerator_count,
explanation_metadata=explanation_metadata,
explanation_parameters=explanation_parameters,
metadata=metadata,
sync=sync,
)
model.wait()
print(model.display_name)
print(model.resource_name)
returnmodel
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
constautoml=require('@google-cloud/automl');
constclient=newautoml.v1beta1.AutoMlClient ();
/**
* Demonstrates using the AutoML client to create a model.
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetId = '[DATASET_ID]' e.g., "TBL2246891593778855936";
// const tableId = '[TABLE_ID]' e.g., "1991013247762825216";
// const columnId = '[COLUMN_ID]' e.g., "773141392279994368";
// const modelName = '[MODEL_NAME]' e.g., "testModel";
// const trainBudget = '[TRAIN_BUDGET]' e.g., "1000",
// `Train budget in milli node hours`;
// A resource that represents Google Cloud Platform location.
constprojectLocation=client.locationPath(projectId,computeRegion);
// Get the full path of the column.
constcolumnSpecId=client.columnSpecPath(
projectId,
computeRegion,
datasetId,
tableId,
columnId
);
// Set target column to train the model.
consttargetColumnSpec={name:columnSpecId};
// Set tables model metadata.
consttablesModelMetadata={
targetColumnSpec:targetColumnSpec,
trainBudgetMilliNodeHours:trainBudget,
};
// Set datasetId, model name and model metadata for the dataset.
constmyModel={
datasetId:datasetId,
displayName:modelName,
tablesModelMetadata:tablesModelMetadata,
};
// Create a model with the model metadata in the region.
client
.createModel({parent:projectLocation,model:myModel})
.then(responses=>{
constinitialApiResponse=responses[1];
console.log(`Training operation name: ${initialApiResponse.name}`);
console.log('Training started...');
})
.catch(err=>{
console.error(err);
});
Learn how to change the default settings for inference logging.
Get operation status
Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.
What's next
- Learn how to get an online inference.
- Learn about private endpoints.