Deploy a model by using the gcloud CLI or Vertex AI API

To deploy a model to a public endpoint by using the gcloud CLI or Vertex AI API, you need to get the endpoint ID for an existing endpoint and then deploy the model to it.

Get the endpoint ID

You need the endpoint ID to deploy the model.

gcloud

The following example uses the gcloud ai endpoints listcommand:

gcloudaiendpointslist\
--region=LOCATION_ID\
--filter=display_name=ENDPOINT_NAME

Replace the following:

LOCATION_ID: The region where you are using Vertex AI.
ENDPOINT_NAME: The display name for the endpoint.

Note the number that appears in the ENDPOINT_ID column. Use this ID in the following step.

REST

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where you are using Vertex AI.
PROJECT_ID: Your project ID.
ENDPOINT_NAME: The display name for the endpoint.

HTTP method and URL:

GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Execute the following command:

curl -X GET \
 -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
 -Method GET `
 -Headers $headers `
 -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
 "endpoints": [
 {
 "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID",
 "displayName": "ENDPOINT_NAME",
 "etag": "AMEw9yPz5pf4PwBHbRWOGh0PcAxUdjbdX2Jm3QO_amguy3DbZGP5Oi_YUKRywIE-BtLx",
 "createTime": "2020-04-17T18:31:11.585169Z",
 "updateTime": "2020-04-17T18:35:08.568959Z"
 }
 ]
}

Note the ENDPOINT_ID.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

Replace the following:

PROJECT_ID: Your project ID.
LOCATION_ID: The region where you are using Vertex AI.
ENDPOINT_NAME: The display name for the endpoint.

fromgoogle.cloudimport aiplatform
PROJECT_ID = "PROJECT_ID"
LOCATION = "LOCATION_ID"
ENDPOINT_NAME = "ENDPOINT_NAME"
aiplatform.init (
 project=PROJECT_ID,
 location=LOCATION,
)
endpoint = aiplatform.Endpoint .list( filter='display_name=ENDPOINT_NAME', )
endpoint_id = endpoint.name.split("/")[-1]

Deploy the model

When you deploy a model, you give the deployed model an ID to distinguish it from other models deployed to the endpoint.

Select the tab below for your language or environment:

gcloud

The following examples use the gcloud ai endpoints deploy-model command.

The following example deploys a Model to an Endpoint without using GPUs to accelerate prediction serving and without splitting traffic between multiple DeployedModel resources:

Before using any of the command data below, make the following replacements:

ENDPOINT_ID: The ID for the endpoint.
LOCATION_ID: The region where you are using Vertex AI.
MODEL_ID: The ID for the model to be deployed.
DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to the maximum number of nodes and never fewer than this number of nodes.
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to this number of nodes and never fewer than the minimum number of nodes. If you omit the --max-replica-count flag, then maximum number of nodes is set to the value of --min-replica-count.

Execute the gcloud ai endpoints deploy-model command:

Linux, macOS, or Cloud Shell

gcloudaiendpointsdeploy-modelENDPOINT_ID\
--region=LOCATION_ID\
--model=MODEL_ID\
--display-name=DEPLOYED_MODEL_NAME\
--min-replica-count=MIN_REPLICA_COUNT\
--max-replica-count=MAX_REPLICA_COUNT\
--traffic-split=0=100

Windows (PowerShell)

gcloudaiendpointsdeploy-modelENDPOINT_ID`
--region=LOCATION_ID`
--model=MODEL_ID`
--display-name=DEPLOYED_MODEL_NAME`
--min-replica-count=MIN_REPLICA_COUNT`
--max-replica-count=MAX_REPLICA_COUNT`
--traffic-split=0=100

Windows (cmd.exe)

gcloudaiendpointsdeploy-modelENDPOINT_ID^
--region=LOCATION_ID^
--model=MODEL_ID^
--display-name=DEPLOYED_MODEL_NAME^
--min-replica-count=MIN_REPLICA_COUNT^
--max-replica-count=MAX_REPLICA_COUNT^
--traffic-split=0=100

Splitting traffic

The --traffic-split=0=100 flag in the preceding examples sends 100% of prediction traffic that the Endpoint receives to the new DeployedModel, which is represented by the temporary ID 0. If your Endpoint already has other DeployedModel resources, then you can split traffic between the new DeployedModel and the old ones. For example, to send 20% of traffic to the new DeployedModel and 80% to an older one, run the following command.

Before using any of the command data below, make the following replacements:

OLD_DEPLOYED_MODEL_ID: the ID of the existing DeployedModel.

Execute the gcloud ai endpoints deploy-model command:

Linux, macOS, or Cloud Shell

gcloudaiendpointsdeploy-modelENDPOINT_ID\
--region=LOCATION_ID\
--model=MODEL_ID\
--display-name=DEPLOYED_MODEL_NAME\ 
--min-replica-count=MIN_REPLICA_COUNT\
--max-replica-count=MAX_REPLICA_COUNT\
--traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

Windows (PowerShell)

gcloudaiendpointsdeploy-modelENDPOINT_ID`
--region=LOCATION_ID`
--model=MODEL_ID`
--display-name=DEPLOYED_MODEL_NAME\ 
--min-replica-count=MIN_REPLICA_COUNT`
--max-replica-count=MAX_REPLICA_COUNT`
--traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

Windows (cmd.exe)

gcloudaiendpointsdeploy-modelENDPOINT_ID^
--region=LOCATION_ID^
--model=MODEL_ID^
--display-name=DEPLOYED_MODEL_NAME\ 
--min-replica-count=MIN_REPLICA_COUNT^
--max-replica-count=MAX_REPLICA_COUNT^
--traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80

REST

Deploy the model.

Before using any of the request data, make the following replacements:

LOCATION_ID: The region where you are using Vertex AI.
PROJECT_ID: Your project ID.
ENDPOINT_ID: The ID for the endpoint.
MODEL_ID: The ID for the model to be deployed.
DEPLOYED_MODEL_NAME: A name for the DeployedModel. You can use the display name of the Model for the DeployedModel as well.
MACHINE_TYPE: Optional. The machine resources used for each node of this deployment. Its default setting is n1-standard-2. Learn more about machine types.
ACCELERATOR_TYPE: The type of accelerator to be attached to the machine. Optional if ACCELERATOR_COUNT is not specified or is zero. Not recommended for AutoML models or custom-trained models that are using non-GPU images. Learn more.
ACCELERATOR_COUNT: The number of accelerators for each replica to use. Optional. Should be zero or unspecified for AutoML models or custom-trained models that are using non-GPU images.
MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to the maximum number of nodes and never fewer than this number of nodes. This value must be greater than or equal to 1.
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the inference load, up to this number of nodes and never fewer than the minimum number of nodes.
REQUIRED_REPLICA_COUNT: Optional. The required number of nodes for this deployment to be marked as successful. Must be greater than or equal to 1 and fewer than or equal to the minimum number of nodes. If not specified, the default value is the minimum number of nodes.
TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
PROJECT_NUMBER: Your project's automatically generated project number

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
 "deployedModel": {
 "model": "projects/PROJECT/locations/us-central1/models/MODEL_ID",
 "displayName": "DEPLOYED_MODEL_NAME",
 "dedicatedResources": {
 "machineSpec": {
 "machineType": "MACHINE_TYPE",
 "acceleratorType": "ACCELERATOR_TYPE",
 "acceleratorCount": "ACCELERATOR_COUNT"
 },
 "minReplicaCount": MIN_REPLICA_COUNT,
 "maxReplicaCount": MAX_REPLICA_COUNT,
 "requiredReplicaCount": REQUIRED_REPLICA_COUNT
 },
 },
 "trafficSplit": {
 "0": TRAFFIC_SPLIT_THIS_MODEL,
 "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1,
 "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2
 },
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
 -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 -H "Content-Type: application/json; charset=utf-8" \
 -d @request.json \
 "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel"

PowerShell (Windows)

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
 -Method POST `
 -Headers $headers `
 -ContentType: "application/json; charset=utf-8" `
 -InFile request.json `
 -Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
 "name": "projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
 "metadata": {
 "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployModelOperationMetadata",
 "genericMetadata": {
 "createTime": "2020-10-19T17:53:16.502088Z",
 "updateTime": "2020-10-19T17:53:16.502088Z"
 }
 }
}

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

importcom.google.api.gax.longrunning.OperationFuture ;
importcom.google.cloud.aiplatform.v1.DedicatedResources ;
importcom.google.cloud.aiplatform.v1.DeployModelOperationMetadata ;
importcom.google.cloud.aiplatform.v1.DeployModelResponse ;
importcom.google.cloud.aiplatform.v1.DeployedModel ;
importcom.google.cloud.aiplatform.v1.EndpointName ;
importcom.google.cloud.aiplatform.v1.EndpointServiceClient ;
importcom.google.cloud.aiplatform.v1.EndpointServiceSettings ;
importcom.google.cloud.aiplatform.v1.MachineSpec ;
importcom.google.cloud.aiplatform.v1.ModelName ;
importjava.io.IOException;
importjava.util.HashMap;
importjava.util.Map;
importjava.util.concurrent.ExecutionException;
publicclass DeployModelCustomTrainedModelSample{
publicstaticvoidmain(String[]args)
throwsIOException,ExecutionException,InterruptedException{
// TODO(developer): Replace these variables before running the sample.
Stringproject="PROJECT";
StringendpointId="ENDPOINT_ID";
StringmodelName="MODEL_NAME";
StringdeployedModelDisplayName="DEPLOYED_MODEL_DISPLAY_NAME";
deployModelCustomTrainedModelSample(project,endpointId,modelName,deployedModelDisplayName);
}
staticvoiddeployModelCustomTrainedModelSample(
Stringproject,StringendpointId,Stringmodel,StringdeployedModelDisplayName)
throwsIOException,ExecutionException,InterruptedException{
EndpointServiceSettings settings=
EndpointServiceSettings .newBuilder()
.setEndpoint("us-central1-aiplatform.googleapis.com:443")
.build();
Stringlocation="us-central1";
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests. After completing all of your requests, call
// the "close" method on the client to safely clean up any remaining background resources.
try(EndpointServiceClient client=EndpointServiceClient .create(settings)){
MachineSpec machineSpec=MachineSpec .newBuilder().setMachineType ("n1-standard-2").build();
DedicatedResources dedicatedResources=
DedicatedResources .newBuilder().setMinReplicaCount(1).setMachineSpec(machineSpec).build();
StringmodelName=ModelName .of(project,location,model).toString();
DeployedModel deployedModel=
DeployedModel .newBuilder()
.setModel(modelName)
.setDisplayName(deployedModelDisplayName)
// `dedicated_resources` must be used for non-AutoML models
.setDedicatedResources(dedicatedResources)
.build();
// key '0' assigns traffic for the newly deployed model
// Traffic percentage values must add up to 100
// Leave dictionary empty if endpoint should not accept any traffic
Map<String,Integer>trafficSplit=newHashMap<>();
trafficSplit.put("0",100);
EndpointName endpoint=EndpointName .of(project,location,endpointId);
OperationFuture<DeployModelResponse,DeployModelOperationMetadata>response=
client.deployModelAsync (endpoint,deployedModel,trafficSplit);
// You can use OperationFuture.getInitialFuture to get a future representing the initial
// response to the request, which contains information while the operation is in progress.
System.out.format("Operation name: %s\n",response.getInitialFuture().get ().getName());
// OperationFuture.get() will block until the operation is finished.
DeployModelResponse deployModelResponse=response.get ();
System.out.format("deployModelResponse: %s\n",deployModelResponse);
}
}
}

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

defdeploy_model_with_dedicated_resources_sample(
project,
location,
model_name:str,
machine_type:str,
endpoint:Optional[aiplatform.Endpoint]=None,
deployed_model_display_name:Optional[str]=None,
traffic_percentage:Optional[int]=0,
traffic_split:Optional[Dict[str, int]]=None,
min_replica_count:int=1,
max_replica_count:int=1,
accelerator_type:Optional[str]=None,
accelerator_count:Optional[int]=None,
explanation_metadata:Optional[explain.ExplanationMetadata]=None,
explanation_parameters:Optional[explain.ExplanationParameters]=None,
metadata:Optional[Sequence[Tuple[str, str]]]=(),
sync:bool=True,
):
"""
 model_name: A fully-qualified model resource name or model ID.
 Example: "projects/123/locations/us-central1/models/456" or
 "456" when project and location are initialized or passed.
 """
aiplatform.init(project=project,location=location)
model=aiplatform.Model(model_name=model_name)
#Theexplanation_metadataandexplanation_parametersshouldonlybe
#providedforacustomtrainedmodelandnotanAutoMLmodel.
model.deploy(
endpoint=endpoint,
deployed_model_display_name=deployed_model_display_name,
traffic_percentage=traffic_percentage,
traffic_split=traffic_split,
machine_type=machine_type,
min_replica_count=min_replica_count,
max_replica_count=max_replica_count,
accelerator_type=accelerator_type,
accelerator_count=accelerator_count,
explanation_metadata=explanation_metadata,
explanation_parameters=explanation_parameters,
metadata=metadata,
sync=sync,
)
model.wait()
print(model.display_name)
print(model.resource_name)
returnmodel

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

constautoml=require('@google-cloud/automl');
constclient=newautoml.v1beta1.AutoMlClient ();
/**
 * Demonstrates using the AutoML client to create a model.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const datasetId = '[DATASET_ID]' e.g., "TBL2246891593778855936";
// const tableId = '[TABLE_ID]' e.g., "1991013247762825216";
// const columnId = '[COLUMN_ID]' e.g., "773141392279994368";
// const modelName = '[MODEL_NAME]' e.g., "testModel";
// const trainBudget = '[TRAIN_BUDGET]' e.g., "1000",
// `Train budget in milli node hours`;
// A resource that represents Google Cloud Platform location.
constprojectLocation=client.locationPath(projectId,computeRegion);
// Get the full path of the column.
constcolumnSpecId=client.columnSpecPath(
projectId,
computeRegion,
datasetId,
tableId,
columnId
);
// Set target column to train the model.
consttargetColumnSpec={name:columnSpecId};
// Set tables model metadata.
consttablesModelMetadata={
targetColumnSpec:targetColumnSpec,
trainBudgetMilliNodeHours:trainBudget,
};
// Set datasetId, model name and model metadata for the dataset.
constmyModel={
datasetId:datasetId,
displayName:modelName,
tablesModelMetadata:tablesModelMetadata,
};
// Create a model with the model metadata in the region.
client
.createModel({parent:projectLocation,model:myModel})
.then(responses=>{
constinitialApiResponse=responses[1];
console.log(`Training operation name: ${initialApiResponse.name}`);
console.log('Training started...');
})
.catch(err=>{
console.error(err);
});

Learn how to change the default settings for inference logging.

Get operation status

Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.

What's next

Learn how to get an online inference.
Learn about private endpoints.

Deploy a model by using the gcloud CLI or Vertex AI API Stay organized with collections Save and categorize content based on your preferences.

Get the endpoint ID

gcloud

REST

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

Python

Deploy the model

gcloud

Linux, macOS, or Cloud Shell

Windows (PowerShell)

Windows (cmd.exe)

Splitting traffic

Linux, macOS, or Cloud Shell

Windows (PowerShell)

Windows (cmd.exe)

REST

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

Java

Python

Node.js

Get operation status

What's next

Deploy a model by using the gcloud CLI or Vertex AI API