Visualizing jobs with Vertex AI TensorBoard

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

With Managed Training, you can visualize your training logs in near real-time using Vertex AI TensorBoard. Simply configure your workload to save logs to a Cloud Storage bucket, and they will be automatically streamed to the TensorBoard interface for analysis.

Prerequisites

Before you begin, ensure you have the following:

  • A running Managed Training cluster.
  • A Cloud Storage bucket to store your TensorBoard logs. This bucket must be in the same region as your TensorBoard instance. For setup instructions, see Create a Cloud Storage bucket.
  • A Vertex AI TensorBoard instance. For creation instructions, see Create a Vertex AI TensorBoard instance.
  • The correct IAM permissions. To allow Cloud Storage FUSE to read from and write to the storage bucket, the service account used by your cluster's VMs requires the Storage Object User (roles/storage.objectUser) role.

Enabling Tensorboard upload

To configure the TensorBoard integration for your job, pass the following arguments using the --extra flag in your Slurm job submission:

  • tensorboard_base_output_dir: Specifies the Cloud Storage path to upload logs to. For example, gs://my-bucket/my-logs.

  • tensorboard_url: Specifies the Vertex AI TensorBoard instance, experiment, or run URL. If only an instance is provided, a new experiment and run are created. If omitted, the default TensorBoard instance for the project is used. For example, projects/123/locations/us-central1/tensorboards/456.

Example

# Using specific tensorboard instance
sbatch --extra="tensorboard_base_output_dir=<your-cloud-storage-dir>,tensorboard_url=projects/<project-id>/locations/<location>/tensorboards/<tensorboard-instance-id>" your_script.sbatch

Writing logs from your training job

Within your training script, access the AIP_TENSORBOARD_LOG_DIR environment variable. This variable provides the unique Cloud Storage path where your script should write its TensorBoard logs.

The path follows this structure:

gs://<your-cloud-storage-path>/<cluster-id>-<cluster-uuid>/tensorboard/job-<job-id>/

The following example shows a complete workflow with two key components: the Slurm submission script that configures the job, and the Python training script that reads the environment variable to write its logs.

Slurm Job Script (simple_job.sbatch):

#!/bin/bash
#SBATCH --job-name=tensorboard-simple-test
#SBATCH --output=tensorboard-simple-test-%j.out
#ActivateyourPythonvirtualenvironmentifneeded
#source/path/to/your/venv/bin/activate
python3simple_logger.py

Python Script (simple_logger.py):

importtensorflowastf
importos
# Get the log directory from the environment variable
log_dir = os.environ.get("AIP_TENSORBOARD_LOG_DIR")
print(f"Writing TensorBoard logs to: {log_dir}")
writer = tf.summary.create_file_writer(log_dir)
with writer.as_default():
 for step in range(10):
 # Simulate some metrics
 loss = 1.0 - (step * 0.1)
 accuracy = 0.6 + (step * 0.04)
 # Log the metrics
 tf.summary.scalar('loss', loss, step=step)
 tf.summary.scalar('accuracy', accuracy, step=step)
 writer.flush()
 print(f"Step {step}: loss={loss:.4f}, accuracy={accuracy:.4f}")
writer.close()
print(f"--- Finished writing metrics to {log_dir} ---")

Real-time Log Synchronization

To visualize metrics from a running job, you must periodically close and recreate the summary writer in your training code. This is necessary because gcsfuse only syncs log files to Cloud Storage once they are closed. This "flushing" technique ensures that intermediate results are visible in the TensorBoard console before the job completes.

Viewing Vertex AI TensorBoard

Once your job is submitted, you can monitor its progress by going to the to the Vertex AI Experiments page in the Google Cloud console.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年10月31日 UTC.