Serve an LLM with multiple GPUs in GKE
Stay organized with collections
Save and categorize content based on your preferences.
This tutorial demonstrates how to deploy and serve a large language model (LLM) using multiple GPUs on GKE for efficient and scalable inference. You create a GKE cluster that uses multiple L4 GPUs and you prepare infrastructure to serve any of the following models:
Depending on the data format of the model, the required number of GPUs varies. In this tutorial, each model uses two L4 GPUs. To learn more, see Calculating the amount of GPUs.
This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving LLMs. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks.
Before reading this page, ensure that you're familiar with the following:
Objectives
In this tutorial, you:
- Create a cluster and node pools.
- Prepare your workload.
- Deploy your workload.
- Interact with the LLM interface.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components update
command. Earlier gcloud CLI versions might not support running the commands in this document.
Some models have additional requirements. Ensure you meet these requirements:
- To access models from Hugging Face, use a HuggingFace token.
- For the Mixtral 8x7b model - accept the conditions for the Mistral Mixtral model.
- For the Llama 3 70b model - make sure you have an active license for the Meta Llama models.
Prepare your environment
In the Google Cloud console, start a Cloud Shell instance:
Open Cloud ShellSet the default environment variables:
gcloudconfigsetprojectPROJECT_ID gcloudconfigsetbilling/quota_projectPROJECT_ID exportPROJECT_ID=$(gcloudconfiggetproject) exportCONTROL_PLANE_LOCATION=us-central1
Replace the PROJECT_ID with your Google Cloud project ID.
Create a GKE cluster and node pool
You can serve LLMs on GPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.
Autopilot
In Cloud Shell, run the following command:
gcloudcontainerclusterscreate-autol4-demo\ --project=${PROJECT_ID}\ --location=${CONTROL_PLANE_LOCATION}\ --release-channel=rapid
GKE creates an Autopilot cluster with CPU and GPU nodes as requested by the deployed workloads.
Configure
kubectl
to communicate with your cluster:gcloudcontainerclustersget-credentialsl4-demo--location=${CONTROL_PLANE_LOCATION}
Standard
In Cloud Shell, run the following command to create a Standard cluster that uses Workload Identity Federation for GKE:
gcloudcontainerclusterscreatel4-demo\ --location${CONTROL_PLANE_LOCATION}\ --workload-pool${PROJECT_ID}.svc.id.goog\ --enable-image-streaming\ --node-locations=${CONTROL_PLANE_LOCATION}-a\ --workload-pool=${PROJECT_ID}.svc.id.goog\ --machine-typen2d-standard-4\ --num-nodes1--min-nodes1--max-nodes5\ --release-channel=rapid
The cluster creation might take several minutes.
Run the following command to create a node pool for your cluster:
gcloudcontainernode-poolscreateg2-standard-24--clusterl4-demo\ --location${CONTROL_PLANE_LOCATION}\ --acceleratortype=nvidia-l4,count=2,gpu-driver-version=latest\ --machine-typeg2-standard-24\ --enable-autoscaling--enable-image-streaming\ --num-nodes=0--min-nodes=0--max-nodes=3\ --node-locations${CONTROL_PLANE_LOCATION}-a,${CONTROL_PLANE_LOCATION}-c\ --spot
GKE creates the following resources for the LLM:
- A public Google Kubernetes Engine (GKE) Standard edition cluster.
- A node pool with
g2-standard-24
machine type scaled down to 0 nodes. You aren't charged for any GPUs until you launch Pods that request GPUs. This node pool provisions Spot VMs, which are priced lower than the default standard Compute Engine VMs and provide no guarantee of availability. You can remove the--spot
flag from this command, and thecloud.google.com/gke-spot
node selector in thetext-generation-inference.yaml
config to use on-demand VMs.
Configure
kubectl
to communicate with your cluster:gcloudcontainerclustersget-credentialsl4-demo--location=${CONTROL_PLANE_LOCATION}
Prepare your workload
This section shows how to set up your workload depending on the model you want to use. This tutorial uses Kubernetes Deployments to deploy the model. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..
Llama 3 70b
Set the default environment variables:
exportHF_TOKEN=HUGGING_FACE_TOKEN
Replace the
HUGGING_FACE_TOKEN
with your HuggingFace token.Create a Kubernetes secret for the HuggingFace token:
kubectlcreatesecretgenericl4-demo\ --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN}\ --dry-run=client-oyaml|kubectlapply-f-
Create the following
text-generation-inference.yaml
Deployment manifest:apiVersion:apps/v1 kind:Deployment metadata: name:llm spec: replicas:1 selector: matchLabels: app:llm template: metadata: labels: app:llm spec: containers: -name:llm image:us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310 resources: requests: cpu:"10" memory:"60Gi" nvidia.com/gpu:"2" limits: cpu:"10" memory:"60Gi" nvidia.com/gpu:"2" env: -name:MODEL_ID value:meta-llama/Meta-Llama-3-70B-Instruct -name:NUM_SHARD value:"2" -name:MAX_INPUT_TOKENS value:"2048" -name:PORT value:"8080" -name:QUANTIZE value:bitsandbytes-nf4 -name:HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name:l4-demo key:HUGGING_FACE_TOKEN volumeMounts: -mountPath:/dev/shm name:dshm # mountPath is set to /tmp as it's the path where the HUGGINGFACE_HUB_CACHE environment # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image. # i.e. where the downloaded model from the Hub will be stored -mountPath:/tmp name:ephemeral-volume volumes: -name:dshm emptyDir: medium:Memory -name:ephemeral-volume ephemeral: volumeClaimTemplate: metadata: labels: type:ephemeral spec: accessModes:["ReadWriteOnce"] storageClassName:"premium-rwo" resources: requests: storage:150Gi nodeSelector: cloud.google.com/gke-accelerator:"nvidia-l4" cloud.google.com/gke-spot:"true"
In this manifest:
NUM_SHARD
must be2
because the model requires two NVIDIA L4 GPUs.QUANTIZE
is set tobitsandbytes-nf4
which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, the model accuracy can decrease. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.
Apply the manifest:
kubectlapply-ftext-generation-inference.yaml
The output is similar to the following:
deployment.apps/llm created
Verify the status of the model:
kubectlgetdeploy
The output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE llm 1/1 1 1 20m
View the logs from the running deployment:
kubectllogs-lapp=llm
The output is similar to the following:
{"timestamp":"2024年03月09日T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024年03月09日T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024年03月09日T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024年03月09日T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
Mixtral 8x7b
Set the default environment variables:
exportHF_TOKEN=HUGGING_FACE_TOKEN
Replace the
HUGGING_FACE_TOKEN
with your HuggingFace token.Create a Kubernetes secret for the HuggingFace token:
kubectlcreatesecretgenericl4-demo\ --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN}\ --dry-run=client-oyaml|kubectlapply-f-
Create the following
text-generation-inference.yaml
Deployment manifest:apiVersion:apps/v1 kind:Deployment metadata: name:llm spec: replicas:1 selector: matchLabels: app:llm template: metadata: labels: app:llm spec: containers: -name:llm image:us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 resources: requests: cpu:"5" memory:"40Gi" nvidia.com/gpu:"2" limits: cpu:"5" memory:"40Gi" nvidia.com/gpu:"2" env: -name:MODEL_ID value:mistralai/Mixtral-8x7B-Instruct-v0.1 -name:NUM_SHARD value:"2" -name:PORT value:"8080" -name:QUANTIZE value:bitsandbytes-nf4 -name:HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name:l4-demo key:HUGGING_FACE_TOKEN volumeMounts: -mountPath:/dev/shm name:dshm # mountPath is set to /tmp as it's the path where the HF_HOME environment # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image. # i.e. where the downloaded model from the Hub will be stored -mountPath:/tmp name:ephemeral-volume volumes: -name:dshm emptyDir: medium:Memory -name:ephemeral-volume ephemeral: volumeClaimTemplate: metadata: labels: type:ephemeral spec: accessModes:["ReadWriteOnce"] storageClassName:"premium-rwo" resources: requests: storage:100Gi nodeSelector: cloud.google.com/gke-accelerator:"nvidia-l4" cloud.google.com/gke-spot:"true"
In this manifest:
NUM_SHARD
must be2
because the model requires two NVIDIA L4 GPUs.QUANTIZE
is set tobitsandbytes-nf4
which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, this may reduce model accuracy. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.
Apply the manifest:
kubectlapply-ftext-generation-inference.yaml
The output is similar to the following:
deployment.apps/llm created
Verify the status of the model:
watchkubectlgetdeploy
When the Deployment is ready, the output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE llm 1/1 1 1 10m
To exit the watch, type
CTRL + C
.View the logs from the running deployment:
kubectllogs-lapp=llm
The output is similar to the following:
{"timestamp":"2024年03月09日T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024年03月09日T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024年03月09日T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024年03月09日T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
Falcon 40b
Create the following
text-generation-inference.yaml
Deployment manifest:apiVersion:apps/v1 kind:Deployment metadata: name:llm spec: replicas:1 selector: matchLabels: app:llm template: metadata: labels: app:llm spec: containers: -name:llm image:us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310 resources: requests: cpu:"10" memory:"60Gi" nvidia.com/gpu:"2" limits: cpu:"10" memory:"60Gi" nvidia.com/gpu:"2" env: -name:MODEL_ID value:tiiuae/falcon-40b-instruct -name:NUM_SHARD value:"2" -name:PORT value:"8080" -name:QUANTIZE value:bitsandbytes-nf4 volumeMounts: -mountPath:/dev/shm name:dshm # mountPath is set to /data as it's the path where the HUGGINGFACE_HUB_CACHE environment # variable points to in the TGI container image i.e. where the downloaded model from the Hub will be # stored -mountPath:/data name:ephemeral-volume volumes: -name:dshm emptyDir: medium:Memory -name:ephemeral-volume ephemeral: volumeClaimTemplate: metadata: labels: type:ephemeral spec: accessModes:["ReadWriteOnce"] storageClassName:"premium-rwo" resources: requests: storage:175Gi nodeSelector: cloud.google.com/gke-accelerator:"nvidia-l4" cloud.google.com/gke-spot:"true"
In this manifest:
NUM_SHARD
must be2
because the model requires two NVIDIA L4 GPUs.QUANTIZE
is set tobitsandbytes-nf4
which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, the model accuracy can decrease. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.
Apply the manifest:
kubectlapply-ftext-generation-inference.yaml
The output is similar to the following:
deployment.apps/llm created
Verify the status of the model:
watchkubectlgetdeploy
When the deployment is ready, the output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE llm 1/1 1 1 10m
To exit the watch, type
CTRL + C
.View the logs from the running deployment:
kubectllogs-lapp=llm
The output is similar to the following:
{"timestamp":"2024年03月09日T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024年03月09日T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024年03月09日T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024年03月09日T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
Create a Service of type ClusterIP
Expose your Pods internally within the cluster so they can be discovered and accessed by other applications.
Create the following
llm-service.yaml
manifest:apiVersion:v1 kind:Service metadata: name:llm-service spec: selector: app:llm type:ClusterIP ports: -protocol:TCP port:80 targetPort:8080
Apply the manifest:
kubectlapply-fllm-service.yaml
Deploy a chat interface
Use Gradio to build a web application that lets you interact with your model. Gradio is a Python library that has a ChatInterface wrapper that creates user interfaces for chatbots.
Llama 3 70b
Create a file named
gradio.yaml
:apiVersion:apps/v1 kind:Deployment metadata: name:gradio labels: app:gradio spec: strategy: type:Recreate replicas:1 selector: matchLabels: app:gradio template: metadata: labels: app:gradio spec: containers: -name:gradio image:us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4 resources: requests: cpu:"512m" memory:"512Mi" limits: cpu:"1" memory:"512Mi" env: -name:CONTEXT_PATH value:"/generate" -name:HOST value:"http://llm-service" -name:LLM_ENGINE value:"tgi" -name:MODEL_ID value:"meta-llama/Meta-Llama-3-70B-Instruct" -name:USER_PROMPT value:"<|begin_of_text|><|start_header_id|>user<|end_header_id|>prompt<|eot_id|><|start_header_id|>assistant<|end_header_id|>" -name:SYSTEM_PROMPT value:"prompt<|eot_id|>" ports: -containerPort:7860 --- apiVersion:v1 kind:Service metadata: name:gradio-service spec: type:LoadBalancer selector: app:gradio ports: -port:80 targetPort:7860
Apply the manifest:
kubectlapply-fgradio.yaml
Find the external IP address of the Service:
kubectlgetsvc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gradio-service LoadBalancer 10.24.29.197 34.172.115.35 80:30952/TCP 125m
Copy the external IP address from the
EXTERNAL-IP
column.View the model interface from your web browser by using the external IP address with the exposed port:
http://EXTERNAL_IP
Mixtral 8x7b
Create a file named
gradio.yaml
:apiVersion:apps/v1 kind:Deployment metadata: name:gradio labels: app:gradio spec: strategy: type:Recreate replicas:1 selector: matchLabels: app:gradio template: metadata: labels: app:gradio spec: containers: -name:gradio image:us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4 resources: requests: cpu:"512m" memory:"512Mi" limits: cpu:"1" memory:"512Mi" env: -name:CONTEXT_PATH value:"/generate" -name:HOST value:"http://llm-service" -name:LLM_ENGINE value:"tgi" -name:MODEL_ID value:"mixtral-8x7b" -name:USER_PROMPT value:"[INST]prompt[/INST]" -name:SYSTEM_PROMPT value:"prompt" ports: -containerPort:7860 --- apiVersion:v1 kind:Service metadata: name:gradio-service spec: type:LoadBalancer selector: app:gradio ports: -port:80 targetPort:7860
Apply the manifest:
kubectlapply-fgradio.yaml
Find the external IP address of the Service:
kubectlgetsvc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gradio-service LoadBalancer 10.24.29.197 34.172.115.35 80:30952/TCP 125m
Copy the external IP address from the
EXTERNAL-IP
column.View the model interface from your web browser by using the external IP address with the exposed port:
http://EXTERNAL_IP
Falcon 40b
Create a file named
gradio.yaml
:apiVersion:apps/v1 kind:Deployment metadata: name:gradio labels: app:gradio spec: strategy: type:Recreate replicas:1 selector: matchLabels: app:gradio template: metadata: labels: app:gradio spec: containers: -name:gradio image:us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4 resources: requests: cpu:"512m" memory:"512Mi" limits: cpu:"1" memory:"512Mi" env: -name:CONTEXT_PATH value:"/generate" -name:HOST value:"http://llm-service" -name:LLM_ENGINE value:"tgi" -name:MODEL_ID value:"falcon-40b-instruct" -name:USER_PROMPT value:"User:prompt" -name:SYSTEM_PROMPT value:"Assistant:prompt" ports: -containerPort:7860 --- apiVersion:v1 kind:Service metadata: name:gradio-service spec: type:LoadBalancer selector: app:gradio ports: -port:80 targetPort:7860
Apply the manifest:
kubectlapply-fgradio.yaml
Find the external IP address of the Service:
kubectlgetsvc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gradio-service LoadBalancer 10.24.29.197 34.172.115.35 80:30952/TCP 125m
Copy the external IP address from the
EXTERNAL-IP
column.View the model interface from your web browser by using the external IP address with the exposed port:
http://EXTERNAL_IP
Calculate the amount of GPUs
The amount of GPUs depends on the value of the QUANTIZE
flag. In this
tutorial, QUANTIZE
is set to bitsandbytes-nf4
, which means that the model is
loaded in 4 bits.
A 70 billion parameter model would require a minimum of 40 GB of GPU memory which equals to 70 billion times 4 bits (70 billion x 4 bits= 35 GB) and considers a 5 GB of overhead. In this case, a single L4 GPU wouldn't have enough memory. Therefore, the examples in this tutorial use two L4 GPU of memory (2 x 24 = 48 GB). This configuration is sufficient for running Falcon 40b or Llama 3 70b in L4 GPUs.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the cluster
To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, delete the GKE cluster:
gcloudcontainerclustersdeletel4-demo--location${CONTROL_PLANE_LOCATION}