Create cluster

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

This page provides the direct, API-driven method for creating and managing a Managed Training cluster. You'll learn how to define your cluster's complete configuration, including login nodes, high-performance GPU partitions like the A4, and Slurm orchestrator settings—within a JSON file. Also included is how to use curl and REST API calls to deploy this configuration, creating the cluster and managing its lifecycle with GET, LIST, UPDATE and DELETE operations.

Define the cluster configuration

Create a JSON file to define the complete configuration for your Managed Training cluster.

If your organizational policy prohibits Public IP addresses on compute instances, deploy the Managed Training cluster with the enable_public_ips: false parameter and utilize Cloud NAT for internet egress.

The first step in provisioning a Managed Training cluster is to define its complete configuration in a JSON file. This file acts as the blueprint for your cluster, specifying everything from its name and network settings to the hardware for its login and worker nodes.

The following section provides several complete JSON configuration files that serve as practical templates for a variety of common use cases. Consult this list to find the example that most closely matches your needs and use it as a starting point.

Each example is followed by a detailed description of the key parameters used within that specific configuration.

GPU with Filestore only

This is the standard configuration. It provides a Filestore instance that serves as the /home directory for the cluster, suitable for general use and storing user data.

The following example shows the content of gpu-filestore.json. This specification creates a cluster with a GPU partition. You can use this as a template and modify values such as the machineType or nodeCount to fit your needs.

For a list of parameters, see Parameter reference.

{
"display_name":"DISPLAY_NAME",
"network":{
"network":"projects/PROJECT_ID/global/networks/NETWORK",
"subnetwork":"projects/PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK"
},
"node_pools":[
{
"id":"login",
"machine_spec":{
"machine_type":"n2-standard-8"
},
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"enable_public_ips":true,
"zone":"ZONE",
"boot_disk":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":200
},
},
{
"id":"a4",
"machine_spec":{
"machine_type":"a4-highgpu-8g",
"accelerator_type":"NVIDIA_B200",
"provisioning_model":"RESERVATION",
"accelerator_count":8,
"reservation_affinity":{
"reservationAffinityType":"RESERVATION_AFFINITY_TYPE",
"key":"compute.googleapis.com/reservation-name",
"values":[
"projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION"
]
}
},
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"enable_public_ips":true,
"zone":"ZONE",
"boot_disk":{
"boot_disk_type":"hyperdisk-balanced",
"boot_disk_size_gb":200
},
}
],
"orchestrator_spec":{
"slurm_spec":{
"home_directory_storage":"projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
"partitions":[
{
"id":"a4",
"node_pool_ids":[
"a4"
]
}
],
"login_node_pool_id":"login"
}
}
}
}

GPU with Filestore and Managed Lustre

This advanced configuration includes the standard Filestore instance in addition to a high-performance Lustre file system. Choose this option if your training jobs require high-throughput access to large datasets.

For a list of parameters, see Parameter reference.

{
"display_name":"DISPLAY_NAME",
"network":{
"network":"projects/PROJECT_ID/global/networks/NETWORK",
"subnetwork":"projects/PROJECT_ID/regions/asia-sREGION/subnetworks/SUBNETWORK"
},
"node_pools":[
{
"id":"login",
"machine_spec":{
"machine_type":"n2-standard-8"
},
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"enable_public_ips":true,
"zone":"ZONE",
"boot_disk":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":200
},
"lustres":[
"projects/PROJECT_ID/locations/ZONE/instances/LUSTRE"
]
},
{
"id":"a4",
"machine_spec":{
"machine_type":"a4-highgpu-8g",
"accelerator_type":"NVIDIA_B200",
"accelerator_count":8,
"reservation_affinity":{
"reservation_affinity_type":RESERVATION_AFFINITY_TYPE,
"key":"compute.googleapis.com/reservation-name",
"values":[
"projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME"
]
}
},
"provisioning_model":"RESERVATION",
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"enable_public_ips":true,
"zone":"ZONE",
"boot_disk":{
"boot_disk_type":"hyperdisk-balanced",
"boot_disk_size_gb":200
},
"lustres":[
"projects/PROJECT_ID/locations/ZONE/instances/LUSTRE"
]
}
],
"orchestrator_spec":{
"slurm_spec":{
"home_directory_storage":"projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
"partitions":[
{
"id":"a4",
"node_pool_ids":[
"a4"
]
}
],
"login_node_pool_id":"login"
}
}
}

GPU with startup script

This example demonstrates how to add a custom script to a node pool. This script executes on all nodes in that pool at startup. To configure this, add the relevant fields to your node pool's definition in addition to the general settings. For a list of parameters and their descriptions, see Parameter reference.

{
"display_name":"DISPLAY_NAME",
"network":{
"network":"projects/PROJECT_ID/global/networks/NETWORK",
"subnetwork":"projects/PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK"
},
"node_pools":[
{
"id":"login",
"machine_spec":{
"machine_type":"n2-standard-8"
},
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"enable_public_ips":true,
"zone":"ZONE",
"boot_disk":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":200
},
"startup_script":"#Example script\nsudo mkdir -p /data\necho 'Script Finished'\n",
},
{
"id":"a4",
"machine_spec":{
"machine_type":"a4-highgpu-8g",
"accelerator_type":"NVIDIA_B200",
"provisioning_model":"RESERVATION_NAME",
"accelerator_count":8,
"reservation_affinity":{
"reservationAffinityType":"RESERVATION_AFFINITY_TYPE",
"key":"compute.googleapis.com/reservation-name",
"values":[
"projects/PROJECT_ID/zones/ZONE/reservations/RESERVATION_NAME"
]
}
},
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"enable_public_ips":true,
"zone":"ZONE",
"boot_disk":{
"boot_disk_type":"hyperdisk-balanced",
"boot_disk_size_gb":200
},
"startup_script":"#Example script\nsudo mkdir -p /data\necho 'Script Finished'\n",
}
],
"orchestrator_spec":{
"slurm_spec":{
"home_directory_storage":"projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
"partitions":[
{
"id":"a4",
"node_pool_ids":[
"a4"
]
}
],
"login_node_pool_id":"login"
}
}
}
}

CPU only cluster

To provision a Managed Training on reserved clusters environment, you must first define its complete configuration in a JSON file. This file acts as the blueprint for your cluster, specifying everything from its name and network settings to the hardware for its login and worker nodes.

For a list of parameters, see Parameter reference.

{
"display_name":"DISPLAY_NAME",
"network":{
"network":"projects/PROJECT_ID/global/networks/NETWORK",
"subnetwork":"projects/PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK"
},
"node_pools":[
{
"id":"cpu",
"machine_spec":{
"machine_type":"n2-standard-8"
},
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"zone":"ZONE",
"enable_public_ips":true,
"boot_disk":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":120
}
},
{
"id":"login",
"machine_spec":{
"machine_type":"n2-standard-8",
}
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"zone":"ZONE",
"enable_public_ips":true,
"boot_disk":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":120
}
},
],
"orchestrator_spec":{
"slurm_spec":{
"home_directory_storage":"projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
"partitions":[
{
"id":"cpu",
"node_pool_ids":[
"cpu"
]
}
],
"login_node_pool_id":"login"
}
}
}

CPU with advanced Slurm config

This example demonstrates how to customize the Slurm orchestrator with advanced parameters. Use this template if you need fine-grained control over job scheduling behavior, such as setting multifactor priority weights, configuring job preemption, and running prolog and epilog scripts for automated job setup and cleanup.

For a list of parameters, see Parameter reference.

{
"display_name":"DISPLAY_NAME",
"network":{
"network":"projects/PROJECT_ID/global/networks/NETWORK",
"subnetwork":"projects/PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK"
},
"node_pools":[
{
"id":"cpu",
"machine_spec":{
"machine_type":"n2-standard-8"
},
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"zone":"ZONE",
"enable_public_ips":true,
"boot_disk":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":120
}
},
{
"id":"login",
"machine_spec":{
"machine_type":"n2-standard-8"
},
"scaling_spec":{
"min_node_count":MIN_NODE_COUNT,
"max_node_count":MAX_NODE_COUNT
},
"zone":"ZONE",
"enable_public_ips":true,
"boot_disk":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":120
}
}
],
"orchestrator_spec":{
"slurm_spec":{
"home_directory_storage":"projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
"accounting":{
"accounting_storage_enforce":"ACCOUNTING_STORAGE_ENFORCE"
},
"scheduling":{
"priority_type":"PRIORITY_TYPE",
"priority_weight_age":PRIORITY_WEIGHT_AGE,
"priority_weight_assoc":PRIORITY_WEIGHT_ASSOC,
"priority_weight_fairshare":PRIORITY_WEIGHT_FAIRSHARE,
"priority_weight_job_size":PRIORITY_WEIGHT_JOB_SIZE,
"priority_weight_partition":PRIORITY_WEIGHT_PARTITION,
"priority_weight_qos":PRIORITY_WEIGHT_QOS,
"priority_weight_tres":"PRIORITY_WEIGHT_TRES",
"preempt_type":"PREEMPT_TYPE",
"preempt_mode":"PREEMPT_MODE",
"preempt_exempt_time":"PREEMPT_EXEMPT_TIME"
},
"prolog_bash_scripts":[
"#!/bin/bash\necho 'First prolog script running'",
"#!/bin/bash\necho 'Second prolog script running'"
],
"epilog_bash_scripts":[
"#!/bin/bash\necho 'Epilog script running'"
]
"partitions":[
{
"id":"cpu",
"node_pool_ids":[
"cpu"
]
}
],
"login_node_pool_id":"login"
}
}
}

Once your cluster is defined in a JSON file, use the following REST API commands to deploy and manage the cluster. The examples use a gcurl alias, which is a convenient, authenticated shortcut for interacting with the API endpoints. These commands cover the full lifecycle, from initially deploying your cluster to updating a cluster getting its status, listing all clusters, and ultimately deleting the cluster.

Authentication

alias gcurl='curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"'

Create a JSON file

Create a JSON file (for example, @cpu-cluster.json) to specify the configuration for your Model Training cluster.

Deploy the cluster

Once you've created your JSON configuration file, you can deploy the cluster using the REST API.

Set environment variables

Before running the command, set the following environment variables. This makes the API command cleaner and easier to manage.

  • PROJECT_ID: Your Google Cloud project ID where the cluster will be created.
  • REGION: The Google Cloud region for the cluster and its resources.
  • ZONE: The Google Cloud zone where the cluster resources will be provisioned.
  • CLUSTER_ID: A unique identifier for your Managed Training cluster, which is also used as a prefix for naming related resources.

Run the create command

Now, execute the following gcurl command. It uses the JSON file (in this example, cpu-cluster.json) as the request body and the environment variables you just set to construct the API endpoint and query parameters.

gcurl-XPOST-d@cpu-cluster.jsonhttps://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters?model_development_cluster_id=CLUSTER_ID

Once the deployment starts, an Operation ID will be generated. Be sure to copy this ID. You'll need it to validate your cluster in the next step.

gcurl-XPOST-d@cpu-cluster.jsonhttps://us-central1-aiplatform.googleapis.com/v1beta1/projects/managedtraining-project/locations/us-central1/modelDevelopmentClusters?model_development_cluster_id=training
{
"name":"projects/1059558423163/locations/us-central1/operations/2995239222190800896",
"metadata":{
"@type":"type.googleapis.com/google.cloud.aiplatform.v1beta1.CreateModelDevelopmentClusterOperationMetadata",
"genericMetadata":{
"createTime":"2025-10-24T14:16:59.233332Z",
"updateTime":"2025-10-24T14:16:59.233332Z"
},
"progressMessage":"Create Model Development Cluster request received, provisioning..."
}

Validate cluster deployment

Track the deployment's progress using the operation ID provided when you deployed the cluster. For example, 2995239222190800896 is the operation ID in the example cited earlier.

 gcurl https://REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/operations/OPERATION_ID
 

In summary

Submitting your cluster configuration with the gcurl POST command initiates the provisioning of your cluster, which is an asynchronous, long-running operation. The API immediately returns a response containing an Operation ID. It's crucial to save this ID, since you'll use it in the following steps to monitor the deployment's progress, verify that the cluster has been created successfully, and manage its lifecycle.

Parameter reference

The following list describes all parameters used in the configuration examples. The parameters are organized into logical groups based on the resource they configure.

General and network settings

  • DISPLAY_NAME: A unique name for your Managed Training cluster. The string can only contain lowercase alphanumeric characters, must begin with a letter, and is limited to 10 characters.
  • PROJECT_ID: Your Google Cloud project ID.
  • REGION: The Google Cloud region where the cluster and its resources will be located.
  • NETWORK: The Virtual Private Cloud network to use for the cluster's resources.
  • ZONE: The Google Cloud zone for the cluster and its resources.
  • SUBNETWORK: The subnetwork to use for the cluster's resources.

Node pool configuration

The following parameters are used to define the node pools for both login and worker nodes.

Common node pool settings

  • ID: A unique identifier for the node pool within the cluster (for example, "login", "a4", "cpu").
  • MACHINE_TYPE: The machine type for the worker node. Supported values are a3-megagpu-8g, a3-ultragpu-8g, a4-highgpu-8g.
  • MIN_NODE_COUNT: The MIN_NODE_COUNT must be the same as the MAX_NODE_COUNT.
  • MAX_NODE_COUNT: For the login node pool, the MAX_NODE_COUNT must be the same as the MIN_NODE_COUNT.
  • ENABLE_PUBLIC_IPS: A boolean (true or false) to determine if the login node has a public IP address.
  • BOOT_DISK_TYPE: The boot disk type for the login node (for example, pd-standard, pd-ssd).
  • BOOT_DISK_SIZE_GB: The boot disk size in GB for the login node.

Worker-Specific Settings

  • ACCELERATOR_TYPE: The corresponding GPU accelerator to attach to the worker nodes. Supported values are:
    • NVIDIA_H100_MEGA_80GB
    • NVIDIA_H200_141GB
    • NVIDIA_B200
  • ACCELERATOR_COUNT: The number of accelerators to attach to each worker node.
  • PROVISIONING_MODEL: The provisioning model for the worker node (for example, ON_DEMAND, SPOT, RESERVATION, FLEX_START).
  • RESERVATION_AFFINITY_TYPE: The reservation affinity for the node pool (for example, SPECIFIC_RESERVATION).
  • RESERVATION_NAME: The name of the reservation to use for the node pool.

Orchestrator and storage configuration

These fields are defined within the orchestrator_spec.slurm_spec block of the JSON file.

Core Slurm and Storage settings

  • FILESTORE (corresponds to home_directory_storage): The full resource name of the Filestore instance to be mounted as the /home directory.
  • LUSTRE (corresponds to lustres inside a node_pools object): A list of pre-existing Managed Lustre instances to mount on the cluster nodes for high-performance file access.
  • LOGIN_NODE_POOL_ID (corresponds to login_node_pool_id): The id of the node pool that should be used for login nodes.
  • partitions: A list of partition objects, where each object requires an id and a list of node_pool_ids.

Advanced Slurm settings

  • prolog_bash_scripts: A list of strings, where each string contains the full content of a Bash script to be executed before a job begins.
  • epilog_bash_scripts: A list of strings, where each string contains the full content of a Bash script to be executed after a job completes.
  • ACCOUNTING_STORAGE_ENFORCE: Enforces accounting limits for storage usage.
  • PRIORITY_TYPE: The scheduling priority algorithm to be used (for example, priority/multifactor).
  • priority_weight_*: A set of integer values that assign weight to different factors in the scheduling priority calculation (for example, priority_weight_age, priority_weight_fairshare).
  • PREEMPT_TYPE: The preemption plugin to use (for example, preempt/partition_prio).
  • PREEMPT_MODE: The mode for the preemption plugin (for example, REQUEUE).
  • PREEMPT_EXEMPT_TIME: The time after a job starts during which it can't be preempted.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年11月11日 UTC.