Run Gemma with Kubernetes Engine
Google Cloud Kubernetes Engine provides a wide range of deployment options for running Gemma models with high performance and low latency using preferred development frameworks. Check out the following deployment guides for Hugging Face, vLLM, TensorRT-LLM on GPUs, and TPU execution with JetStream, plus application, and tuning guides:
Deploy and serve
Serve Gemma on GPUs with Hugging Face TGI : Deploy Gemma models on GKE using GPUs and the Hugging Face Text Generation Inference (TGI) framework.
Serve Gemma on GPUs with vLLM : Deploy Gemma with vLLM for convenient model load management and high-throughput.
Serve Gemma on GPUs with TensorRT-LLM : Deploy Gemma with NVIDIA TensorRT-LLM to maximize model operation efficiency.
Serve Gemma on TPUs with JetStream : Deploy Gemma with JetStream on TPU processors for high-performance and low latency.
Analyze data
- Analyze data on GKE using BigQuery, Cloud Run, and Gemma : Build a data analysis pipeline with BigQuery and Gemma.
Fine-tune
- Fine-tune Gemma open models using multiple GPUs : Customize the behavior of Gemma based on your own dataset.