FunctionGemma released, a model tuned for function calling! Learn more

Run Gemma with Kubernetes Engine

Google Cloud Kubernetes Engine provides a wide range of deployment options for running Gemma models with high performance and low latency using preferred development frameworks. Check out the following deployment guides for Hugging Face, vLLM, TensorRT-LLM on GPUs, and TPU execution with JetStream, plus application, and tuning guides:

Deploy and serve

Serve Gemma on GPUs with Hugging Face TGI : Deploy Gemma models on GKE using GPUs and the Hugging Face Text Generation Inference (TGI) framework.
Serve Gemma on GPUs with vLLM : Deploy Gemma with vLLM for convenient model load management and high-throughput.
Serve Gemma on GPUs with TensorRT-LLM : Deploy Gemma with NVIDIA TensorRT-LLM to maximize model operation efficiency.
Serve Gemma on TPUs with JetStream : Deploy Gemma with JetStream on TPU processors for high-performance and low latency.

Analyze data

Analyze data on GKE using BigQuery, Cloud Run, and Gemma : Build a data analysis pipeline with BigQuery and Gemma.

Fine-tune

Fine-tune Gemma open models using multiple GPUs : Customize the behavior of Gemma based on your own dataset.