Get started with AI model inference using GKE Gen AI capabilities!

AI/ML orchestration on GKE documentation

Google Kubernetes Engine (GKE) provides a single, unified platform to orchestrate your entire AI/ML lifecycle. It gives you the power and flexibility to supercharge your training, inference, and agentic workloads, so you can streamline your infrastructure and start delivering results. GKE's state-of-the-art orchestration capabilities provide the following:

  • Hardware accelerators: access and manage the high-powered GPUs and TPUs you need, for both training and inference, at scale.
  • Stack flexibility: integrate with the distributed computing, data processing, and model serving frameworks you already know and trust.
  • Managed Kubernetes simplicity: get all the benefits of a managed platform to automate, scale, and enhance the security of your entire AI/ML lifecycle while maintaining flexibility.

Explore our blogs, tutorials, and best practices to see how GKE can optimize your AI/ML workloads. For more information about benefits and available features, see the Introduction to AI/ML workloads on GKE overview.

Start your proof of concept with 300ドル in free credit

  • Develop with our latest Generative AI models and tools.
  • Get free usage of 20+ popular products, including Compute Engine and AI APIs.
  • No automatic charges, no commitment.

Keep exploring with 20+ always-free products.

Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.

Documentation resources

Find quickstarts and guides, review key references, and get help with common issues.
Explore self-paced training, use cases, reference architectures, and code samples with examples of how to use and connect Google Cloud services.
Training
Training and tutorials

Optimize AI and ML workloads with Cloud Storage and GKE

Learn how to use Cloud Storage FUSE to optimize performance for AI and ML workloads on GKE.

AI/ML Inference AI/ML Training Storage

Training
Training and tutorials

Optimize AI and ML workloads with Managed Lustre and GKE

Learn how to use Managed Lustre to optimize performance for AI and ML workloads on GKE.

AI/ML Inference AI/ML Training Storage

Training
Training and tutorials

Isolate AI code execution with Agent Sandbox

Learn how to install and run the Agent Sandbox controller on GKE, and deploy a sandboxed environment on the cluster for testing untrusted shell commands.

Tutorial Agent Sandbox Agentic AI

Training
Training and tutorials

Deploy an agentic AI application on GKE with the Agent Development Kit (ADK) and a self-hosted LLM

Learn how to deploy and manage a containerized agentic AI application on GKE, using the Agent Development Kit (ADK) and vLLM for scalable inference with Llama 3.1.

Tutorial AI/ML Inference Agentic AI

Training
Training and tutorials

Deploy an agentic AI application on GKE with the Agent Development Kit (ADK) and Vertex AI

Learn how to deploy and manage a containerized agentic AI application on GKE, using the Agent Development Kit (ADK) and Vertex AI for scalable inference with Gemini 2.0 Flash.

Tutorial AI/ML Inference Agentic AI

Training
Training and tutorials

Serve open source models using TPUs on GKE with Optimum TPU

Learn how to deploy LLMs using Tensor Processing Units (TPUs) on GKE with the Optimum TPU serving framework from Hugging Face.

Tutorial AI/ML Inference TPU

Training
Training and tutorials

Serve LLMs on GKE with a cost-optimized and high-availability GPU provisioning strategy

Learn how to optimize costs for LLM-serving workloads on GKE using DWS Flex-start.

Cost optimization GPU DWS

Training
Training and tutorials

Serving Large Language Models with KubeRay on TPUs

Learn how to serve large language models (LLMs) with KubeRay on TPUs, and how this can help improve the performance of your models.

Video Ray TPUs

Training
Training and tutorials

Accelerate AI/ML data loading with Hyperdisk ML

Learn how to how to simplify and accelerate the loading of AI/ML model weights on GKE using Hyperdisk ML.

Tutorial AI/ML Data Loading

Training
Training and tutorials

Serve an LLM using TPUs on GKE with JetStream and PyTorch

Learn how to serve a LLM using Tensor Processing Units (TPUs) on GKE with JetStream through PyTorch.

Tutorial AI/ML Inference TPUs

Training
Training and tutorials

Best practices for optimizing LLM inference with GPUs on GKE

Learn best practices for optimizing LLM inference performance with GPUs on GKE using the vLLM and Text Generation Inference (TGI) serving frameworks.

Tutorial AI/ML Inference GPUs

Training
Training and tutorials

Manage the GPU Stack with the NVIDIA GPU Operator on GKE

Learn when to use the NVIDIA GPU operator and how to enable the NVIDIA GPU Operator on GKE.

Tutorial GPUs

Training
Training and tutorials

Configure autoscaling for LLM workloads on TPUs

Learn how to set up your autoscaling infrastructure by using the GKE Horizontal Pod Autoscaler (HPA) to deploy the Gemma LLM using single-host JetStream.

Tutorial TPUs

Training
Training and tutorials

Fine-tune Gemma open models using multiple GPUs on GKE

Learn how to fine-tune Gemma LLM using GPUs on GKE with the Hugging Face Transformers library.

Tutorial AI/ML Inference GPUs

Training
Training and tutorials

Deploy a Ray Serve application with a Stable Diffusion model on GKE with TPUs

Learn how to deploy and serve a Stable Diffusion model on GKE using TPUs, Ray Serve, and the Ray Operator add-on.

Tutorial AI/ML Inference Ray TPUs

Training
Training and tutorials

Configure autoscaling for LLM workloads on GPUs with GKE

Learn how to set up your autoscaling infrastructure by using the GKE Horizontal Pod Autoscaler (HPA) to deploy the Gemma LLM with the Hugging Face Text Generation Interface (TGI) serving framework.

Tutorial GPUs

Training
Training and tutorials

Train Llama2 with Megatron-LM on A3 Mega virtual machines

Learn how to run a container-based, Megatron-LM PyTorch workload on A3 Mega.

Tutorial AI/ML Training GPUs

Training
Training and tutorials

Deploy GPU workloads in Autopilot

Learn how to request hardware accelerators (GPUs) in your GKE Autopilot workloads.

Tutorial GPUs

Training
Training and tutorials

Serve an LLM with multiple GPUs in GKE

Learn how to serve Llama 2 70B or Falcon 40B using multiple NVIDIA L4 GPUs with GKE.

Tutorial AI/ML Inference GPUs

Training
Training and tutorials

Getting started with Ray on GKE

Learn how to easily start using Ray on GKE by running a workload on a Ray cluster.

Tutorial Ray

Training
Training and tutorials

Serve an LLM on L4 GPUs with Ray

Learn how to serve Falcon 7b, Llama2 7b, Falcon 40b, or Llama2 70b using the Ray framework in GKE.

Tutorial AI/ML Inference Ray GPUs

Training
Training and tutorials

Orchestrate TPU Multislice workloads using JobSet and Kueue

Learn how to orchestrate a Jax workload on multiple TPU slices on GKE by using JobSet and Kueue.

Tutorial TPUs

Training
Training and tutorials

Monitoring GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM)

Learn how to observe GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM).

Tutorial AI/ML Observability GPUs

Training
Training and tutorials

Quickstart: Train a model with GPUs on GKE Standard clusters

This quickstart shows you how to deploy a training model with GPUs in GKE and store the predictions in Cloud Storage.

Tutorial AI/ML Training GPUs

Training
Training and tutorials

Running large-scale machine learning on GKE

This video shows how GKE helps solve common challenges of training large AI models at scale, and the best practices for training and serving large-scale machine learning models on GKE.

Video AI/ML Training AI/ML Inference

Training
Training and tutorials

TensorFlow on GKE Autopilot with GPU acceleration

This blog post is a step-by-step guide to the creation, execution, and teardown of a Tensorflow-enabled Jupiter notebook.

Blog AI/ML Training AI ML Inference GPUs

Training
Training and tutorials

Implement a Job queuing system with quota sharing between namespaces on GKE

This tutorial uses Kueue to show you how to implement a Job queueing system, and configure workload resource and quota sharing between different namespaces on GKE.

Tutorial AI/ML Batch

Training
Training and tutorials

Build a RAG chatbot with GKE and Cloud Storage

This tutorial shows you how to integrate a Large Language Model application based on retrieval-augmented generation with PDF files that you upload to a Cloud Storage bucket.

Tutorial AI/ML Data Loading

Training
Training and tutorials

Analyze data on GKE using BigQuery, Cloud Run, and Gemma

This tutorial shows you how to analyze big datasets on GKE by leveraging BigQuery for data storage and processing, Cloud Run for request handling, and a Gemma LLM for data analysis and predictions.

Tutorial AI/ML Data Loading

Use case
Use cases

Distributed data preprocessing with GKE and Ray: Scaling for the enterprise

Learn how to leverage GKE and Ray to efficiently preprocess large datasets for machine learning.

MLOps Training Ray

Use case
Use cases

Data loading best practices for AI/ML inference on GKE

Learn how to speed up data loading times for your machine learning applications on Google Kubernetes Engine.

Inference Hyperdisk ML Cloud Storage FUSE

Use case
Use cases

Save on GPUs: Smarter autoscaling for your GKE inferencing workloads

Learn how to optimize your GPU inference costs by fine-tuning GKE's Horizontal Pod Autoscaler for maximum efficiency.

Inference GPU HPA

Use case
Use cases

Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE

Learn how to deploy cutting-edge NVIDIA NIM microservices on GKE with ease and accelerate your AI workloads.

AI NVIDIA NIM

Use case
Use cases

Accelerate Ray in production with new Ray Operator on GKE

Learn how Ray Operator on GKE simplifies your AI/ML production deployments, boosting performance and scalability.

AI TPU Ray

Use case
Use cases

Maximize your LLM serving throughput for GPUs on GKE — a practical guide

Learn how to maximize large language model (LLM) serving throughput for GPUs on GKE, including infrastructure decisions and model server optimizations.

LLM GPU NVIDIA

Use case
Use cases

Best practices for running batch workloads on GKE

Learn how to build and optimize batch processing platforms on GKE

Batch Performance Cost optimization

Use case
Use cases

High performance AI/ML storage through Local SSD support on GKE

Learn how to use Local SSDs to provide high-performance AI/ML storage on GKE.

AI NVMe Local SSD

Use case
Use cases

Machine learning with JAX on Kubernetes with NVIDIA GPUs

Learn how to run JAX multi-GPU, multi-node applications on GKE with NVIDIA GPUs.

GPUs JAX ML

Use case
Use cases

Search engines made simple: A low-code approach with GKE and Vertex AI Agent Builder

How to build a search engine with Google Cloud, using Vertex AI Agent Builder, Vertex AI Search, and GKE.

Search Agent Vertex AI

Use case
Use cases

LiveX AI reduces customer support costs with AI agents trained and served on GKE and NVIDIA AI

How LiveX AI uses GKE to build AI agents that enhance customer satisfaction and reduce costs.

GenAI NVIDIA GPU

Use case
Use cases

Infrastructure for a RAG-capable generative AI application using GKE and Cloud SQL

Reference architecture for running a generative AI application with retrieval-augmented generation (RAG) using GKE, Cloud SQL, Ray, Hugging Face, and LangChain.

GenAI RAG Ray

Use case
Use cases

Reference architecture for a batch processing platform on GKE

Reference architecture for a batch processing platform on GKE in Standard mode using Kueue to manage resoure quotas.

AI Kueue Batch

Use case
Use cases

Innovating in patent search: How IPRally leverages AI with GKE and Ray

How IPRally uses GKE and Ray to build a scalable, efficient ML platform for faster patent searches with better accuracy.

AI Ray GPU

Use case
Use cases

Performance deep dive of Gemma on Google Cloud

Leverage Gemma on Cloud GPUs and Cloud TPUs for inference and training efficiency on GKE.

AI Gemma Performance

Use case
Use cases

Gemma on GKE deep dive: New innovations to serve open generative AI models

Use best-in-class Gemma open models to build portable, customizable AI applications and deploy them on GKE.

AI Gemma Performance

Use case
Use cases

Advanced scheduling for AI/ML with Ray and Kueue

Orchestrate Ray applications in GKE with KubeRay and Kueue.

Kueue Ray KubeRay

Use case
Use cases

How to secure Ray on Google Kubernetes Engine

Apply security insights and hardening techniques for training AI/ML workloads using Ray on GKE.

AI Ray Security

Use case
Use cases

Design storage for AI and ML workloads in Google Cloud

Select the best combination of storage options for AI and ML workloads on Google Cloud.

AI ML Storage

Use case
Use cases

Automatic driver installation simplifies using NVIDIA GPUs in GKE

Automatically install Nvidia GPU drivers in GKE.

GPU NVIDIA Installation

Use case
Use cases

Accelerate your generative AI journey with NVIDIA NeMo framework on GKEE

Train generative AI models using GKE and NVIDIA NeMo framework.

GenAI NVIDIA NeMo

Use case
Use cases

Why GKE for your Ray AI workloads?

Improve scalability, cost-efficiency, fault tolerance, isolation, and portability by using GKE for Ray workloads.

AI Ray Scale

Use case
Use cases

Simplifying MLOps using Weights & Biases with Google Kubernetes Engine

Simplify the model development and deployment process using Weights & Biases with GKE.

Cost optimization TPUs GPUs

Use case
Use cases

Running AI on fully managed GKE, now with new compute options, pricing and resource reservations

Gain improved GPU support, performance, and lower pricing for AI/ML workloads with GKE Autopilot.

GPU Autopilot Performance

Use case
Use cases

How SEEN scaled output 89x and reduced GPU costs by 66% using GKE

Startup scales personalized video output with GKE.

GPU Scale Containers

Use case
Use cases

How Spotify is unleashing ML Innovation with Ray and GKE

How Ray is transforming ML development at Spotify.

ML Ray Containers

Use case
Use cases

How Ordaōs Bio takes advantage of generative AI on GKE

Ordaōs Bio, one of the leading AI accelerators for biomedical research and discovery, is finding solutions to novel immunotherapies in oncology and chronic inflammatory disease.

Performance TPU Cost optimization

Use case
Use cases

GKE from a growing startup powered by ML

How Moloco, a Silicon Valley startup, harnessed the power of GKE and Tensor Flow Enterprise to supercharge its machine learning (ML) infrastructure.

ML Scale Cost optimization

Use case
Use cases

Improving launch time of Stable Diffusion on GKE by 4x

Learn how to improve the launch time of Stable Diffusion on GKE.

Performance Scaling PD

Code sample
Code Samples

Google Kubernetes Engine (GKE) Samples

View sample applications used in official GKE product tutorials.

Code sample
Code Samples

GKE AI Labs Samples

View experimental samples for leveraging GKE to accelerate your AI/ML initiatives.

Code sample
Code Samples

GKE Accelerated Platforms

View reference architectures and solutions for deploying accelerated workloads on GKE.

Related videos

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年11月24日 UTC.