This page explains how to migrate your GKE Inference Gateway setup from the
preview v1alpha2 API to the generally available v1 API.
This document is intended for platform administrators and networking specialists
who are using the v1alpha2 version of the GKE Inference Gateway and want
to upgrade to the v1 version to use the latest features.
Before you start the migration, ensure you are familiar with the concepts and
deployment of the GKE Inference Gateway. We recommend you review Deploy
GKE Inference Gateway.
Before you begin
Before you start the migration, determine if you need to follow this guide.
Check for existing v1alpha2 APIs
To check if you're using the v1alpha2 GKE Inference Gateway API, run the
following commands:
The output of these commands determines if you need to migrate:
If either command returns one or more InferencePool or InferenceModel
resources, you are using the v1alpha2 API and must follow this guide.
If both commands return No resources found, you are not using the
v1alpha2 API. You can proceed with a fresh installation of the v1
GKE Inference Gateway.
Migration paths
There are two paths for migrating from v1alpha2 to v1:
Simple migration (with downtime): this path is faster and simpler but
results in a brief period of downtime. It is the recommended path if you
don't require a zero-downtime migration.
Zero-downtime migration: this path is for users who cannot afford any
service interruption. It involves running both v1alpha2 and v1 stacks
side-by-side and gradually shifting traffic.
Simple migration (with downtime)
This section describes how to perform a simple migration with downtime.
Delete existing v1alpha2 resources: to delete the v1alpha2
resources, choose one of the following options:
Option 1: Uninstall using Helm
helmuninstallHELM_PREVIEW_INFERENCEPOOL_NAME
Option 2: Manually delete resources
If you are not using Helm, manually delete all resources associated with
your v1alpha2 deployment:
Update or delete the HTTPRoute to remove the backendRef that points
to the v1alpha2InferencePool.
Delete the v1alpha2InferencePool, any InferenceModel resources
that point to it, and the corresponding Endpoint Picker (EPP) Deployment
and Service.
After all v1alpha2 custom resources are deleted, remove the Custom
Resource Definitions (CRD) from your cluster:
Ensure you receive a successful response with a 200 response code.
Zero-downtime migration
This migration path is designed for users who cannot afford any service
interruption. The following diagram illustrates how GKE Inference Gateway
facilitates serving multiple generative AI models, a key aspect of a
zero-downtime migration strategy.
Routing requests to different models based on model name and Priority
Figure: GKE Inference Gateway routing requests to different generative AI models based on model name and priority
Distinguishing API versions with kubectl
During the zero-downtime migration, both v1alpha2 and v1 CRDs are installed
on your cluster. This can create ambiguity when using kubectl to query for
InferencePool resources. To ensure you are interacting with the correct
version, you must use the full resource name:
The v1 API also provides a convenient short name, infpool, which you can use
to query v1 resources specifically:
kubectlgetinfpool
Stage 1: Side-by-side v1 deployment
In this stage, you deploy the new v1 InferencePool stack alongside the existing
v1alpha2 stack, which allows for a safe, gradual migration.
After you finish all the steps in this stage, you have the following
infrastructure in the following diagram:
Routing requests to different models based on model name and Priority
Figure: GKE Inference Gateway routing requests to different generative AI models based on model name and priority
Install needed Custom Resource Definition (CRDs) in your GKE cluster:
For GKE versions earlier than 1.34.0-gke.1626000, run the following command to install both the v1 InferencePool and alpha InferenceObjective CRDs:
Use Helm to install a new v1 InferencePool with a distinct release name,
such as vllm-llama3-8b-instruct-ga. The InferencePool must target the
same Model Server pods as the alpha InferencePool using
inferencePool.modelServers.matchLabels.app.
To install the InferencePool, use the following command:
As part of migrating to the v1.0 release of Gateway API Inference Extension,
we also need to migrate from the alpha InferenceModel API to the new
InferenceObjective API.
Apply the following YAML to create the InferenceObjective resources:
With both stacks running, you can start shifting traffic from v1alpha2 to v1
by updating the HTTPRoute to split traffic. This example shows a 50-50 split.
Update HTTPRoute for traffic splitting.
To update the HTTPRoute for traffic splitting, run the following command:
After applying the changes, monitor the performance and stability of the new
v1 stack. Verify that the inference-gateway gateway has a PROGRAMMED
status of TRUE.
Stage 3: Finalization and cleanup
Once you have verified that the v1 InferencePool is stable, you can direct all
traffic to it and decommission the old v1alpha2 resources.
Shift 100% of traffic to the v1 InferencePool.
To shift 100 percent of traffic to the v1 InferencePool, run the following
command:
Ensure you receive a successful response with a 200 response code.
Clean up v1alpha2 resources.
After confirming the v1 stack is fully operational, safely remove the old
v1alpha2 resources.
Check for remaining v1alpha2 resources.
Now that you've migrated to the v1InferencePool API, it's safe to
delete the old CRDs. Check for existing v1alpha2 APIs to
ensure you no longer have any v1alpha2 resources in use. If you still have
some remaining, you can continue the migration process for those.
Delete v1alpha2 CRDs.
After all v1alpha2 custom resources are deleted, remove the Custom
Resource Definitions (CRD) from your cluster:
After completing all steps, your infrastructure should resemble the
following diagram:
Routing requests to different models based on model name and Priority
Figure: GKE Inference Gateway routing requests to different generative AI models based on model name and priority
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025年10月24日 UTC."],[],[]]