Serverless Gemma 12B on Azure Container Apps

DEV Community

and model details:

> > check model details
 I will call the get_model_details tool on the MCP server to retrieve the active vLLM serving parameters and
 configurations.
●くろまる gpu-devops-agent/get_model_details(Call get_model_details tool) (ctrl+o to expand)
 Here are the serving model details retrieved from the active vLLM instance:
 ### 🧩 Model Details
 • Public Endpoint: https://gemma4-vllm-gpu.gentlefield-612b24aa.westus2.azurecontainerapps.io
 • Model ID: google/gemma-4-12B-it-qat-w4a16-ct 
 • Object Type: model 
 • Owned By: vllm 
 • API Health Endpoint Status: Healthy ✅

and the endpoint:

> get_endpoint
 Here is the active vLLM service URL:
 🟢 vLLM Endpoint: https://gemma4-vllm-gpu.gentlefield-612b24aa.westus2.azurecontainerapps.io

and status:

> get_status
 Here is the verified GPU vLLM System Status:
 ### 🌀 GPU vLLM System Status
 • vLLM Health: 🟢 Online (https://gemma4-vllm-gpu.gentlefield-612b24aa.westus2.azurecontainerapps.io)
 • Hosting Status: 🟢 ACA Succeeded ( gemma4-vllm-gpu )

and metrics:

> get_metrics
 ### 📈 Request & Concurrency Metrics
 • Successfully Processed Requests: 2 (both completed via natural stop token)
 • Currently Active Running Requests: 0 
 • Currently Waiting Requests (Queued): 0 
 • KV-Cache Memory Usage: 0.0% 
 ### 🚀 Throughput & Token Stats
 • Total Prefill (Prompt) Tokens Processed: 36 tokens
 • Total Generated (Completion) Tokens Processed: 63 tokens
 ### ⏱️ Latency Metrics
 • Average Time to First Token (TTFT): 1.81 seconds (sum: 3.62s over 2 requests)
 • Average End-to-End Latency: 2.96 seconds (sum: 5.92s over 2 requests)

Check the Azure Instance

Once Deployed — the model is visible on the Azure Console:

Review the Model

Check the model details:

> what are the model details?
 Here are the active model deployment details:
 • Serving Endpoint: https://gemma4-vllm-gpu.gentlefield-612b24aa.westus2.azurecontainerapps.io
 • Model ID: google/gemma-4-12B-it-qat-w4a16-ct 
 • Quantization Format: compressed-tensors (4-bit weights, 16-bit activations)
 • API Engine / Server: vLLM 
 • VRAM Allocation (Tesla T4): 95% GPU memory utilization, 18,187 KV cache tokens
 • API Health status: Healthy (Online) ✅
> query_gemma4_with_stats what is Gemma?
 I will call the query_gemma4_with_stats tool on the MCP server to query the model and retrieve both the generated
 response and model performance statistics (like TTFT and total generation time).
●くろまる gpu-devops-agent/query_gemma4_with_stats(Call query_gemma4_with_stats tool) (ctrl+o to expand)
 Here is the response from the serving Gemma 4 model along with the query performance statistics:
 ### 💬 Model Response
 │ Depending on the context, Gemma usually refers to one of two very popular things: a Google AI model or a gemstone.
 │
 │ Here is a breakdown of both:
 │
 │ ### 1. In Technology: Google’s AI Model
 │
 │ Gemma is a family of open-model Large Language Models (LLMs) developed by Google DeepMind.
 │
 │ • Relationship to Gemini: It is built from the same research, technology, and infrastructure used to create Google's
 │ most capable AI model, Gemini.
 │ • "Open" Nature: Unlike the "closed" Gemini models (which are accessed primarily via API or Google products), Gemma
 is
 │ released as "open weights." This means developers and researchers can download, customize, and run the model on their
 │ own hardware (like a personal laptop or a private cloud).
 │ • Size Varieties: Gemma is designed to be lightweight. It comes in different sizes (such as 2B, 7B, 9B, and 27B
 │ parameters), making it easy for people who don't have massive supercomputers to run high-quality AI locally.
 │ • Capabilities: It can understand and generate text, reason, and even process images or audio (depending on the
 │ version). It is commonly used by developers to build chatbots, summarize documents, or power coding assistants.
 │ ──────
 │ ### 2. In Gemology: The Gemstone
 │
 │ Gemma is the Latin word for "gem" or "precious stone." In the world of jewelry and geology:
 │
 │ • Gemma is often used as a prefix or root for various precious stones (e.g., Gemmae refers to jewels).
 │ • However, if you are looking for a specific stone, you might be thinking of Garnet, Garner, or a specific variety of
 │ Jade.
 │ ──────
 │ ### 3. In Other Contexts
 │
 │ • Name: Gemma is a popular female given name of Italian origin, meaning "gem" or "precious stone."
 │ • Software/Coding: There are various smaller libraries or internal projects in coding communities named "Gemma," but
 │ they are less common than the Google AI model.
 │
 │ Summary: If you are asking about AI, Gemma is Google's "mini" version of Gemini that anyone can download and use for
 │ free.
 ### 📊 Performance Stats
 • Serving Model: google/gemma-4-12B-it-qat-w4a16-ct 
 • Time to First Token (TTFT): 0.26 seconds 
 • Total Generation Time: 19.68 seconds 
 • Generation Throughput: 25.74 tokens/s (~500 tokens total)

Benchmark the Local Model

The local model was benchmarked. Summarize the results:

> summarize benchmark_report.md
 Here is a summary of the performance results from benchmark_report.md:
 ### 🕒 Average Latency Highlights
 • Low Concurrency (1–4 users): Extremely fast response times ( 0.33s – 1.5s ) for all valid context windows (up to
 2048 tokens).
 • High Concurrency (1024–2048 users): Latency scales up to 23s – 32s due to queuing under heavy load.
 • Tipping Point: Latency remains under 3s up to 128 concurrent users for short prompts (<= 64 tokens) and under 8s 
 for large prompts (512 tokens).
 ### 🚀 Throughput Highlights
 • Max Throughput: Tops out at 47.7 requests/sec (4 tokens context window at 1024 users).
 • Prefabricated Prefill Impact: Throughput drops linearly as context window size increases, reaching a maximum
 throughput of 5.2 requests/sec at 2048 tokens.
 • Concurrency Saturation: Throughput plateaus once concurrency reaches 128–256 users; additional users increase latency
 rather than total request throughput.
 ### 🚫 Limit Violations
 • 4096 tokens and above: Reports 0.0% success rate and 0.0 throughput across all concurrencies because the prompt
 length exceeds the serving engine's maximum limit ( max-model-len = 4096 ).

This is visualized:

GPU Memory Allocation of QAT 12B Models

The GPU provides 28 GB of GPU Memory:

> > what is the memory breakdown on the GPU?
 Here is the approximate memory breakdown on the NVIDIA Tesla T4 GPU (16 GB VRAM) running the google/gemma-4-12B-it-qat-
 w4a16-ct model via vLLM:
 ### 📟 VRAM Allocation Overview (16 GB Total)
 VRAM Category | Size (GiB) | % of total GPU | Purpose
 ---------------------------|--------------|----------------|-----------------------------------------------------------
 Model Weights | ~ 6.50 GiB | ~ 40.6% | Gemma-4 12B QAT 4-bit weights
 KV Cache | 5.85 GiB | 36.6% | Reserved space for context tokens ( 18,187 FP8 tokens)
 CUDA / Workspace Overhead | ~ 2.85 GiB | ~ 17.8% | Run-time activations, CUDA context, graph capture
 Free / System Reserve | 0.80 GiB | 5.0% | Unallocated (enforced by --gpu-memory-utilization 0.95 )
 ──────
 ### 🔍 Key Metrics Details
 1. Model Compression Advantage:
 • An unquantized 12B parameter model in bfloat16 requires at least 24 GB of VRAM just for the weights (which
 exceeds the Tesla T4's 16 GB limit).
 • Thanks to QAT (4-bit quantization), weights occupy only ~6.5 GB, leaving ample space (5.85 GB) for the KV cache
 on a single cost-effective T4 GPU.
 2. KV Cache Efficiency:
 • Quantized to FP8 (via --kv-cache-dtype fp8 ), cutting the token memory footprint in half.
 • This provides a total capacity of 18,187 tokens (5,990 blocks of size 16), which translates to a maximum
 concurrency of 4.44x concurrent users at a full 4,096 context length.
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

What are the performance characteristics of the Model?

> what are the performance characteristics of the deployment
Here are the core performance characteristics of the vLLM deployment running the Gemma-4 12B QAT model on an NVIDIA 
 Tesla T4 GPU (Azure Container Apps):
### 1. Latency Profile (TTFT & E2E)
• Time to First Token (TTFT): 0.26 seconds under low concurrency, indicating highly optimized prefill compilation.
 • Average E2E Latency: 2.96 seconds for a standard completion task.
 • Queuing Scaling: Latency scales linearly with user concurrency once the queue saturates (reaching ~21s at 256 users
 and ~31s at 1024+ users).
### 2. Throughput & Scaling
• Prefill Throughput Peak: Achieves 47.7 requests per second for very small context queries (4 tokens) at concurrency
 1024.
 • Payload Throughput Bounds: Peak throughput drops to 5.2 requests per second at the model's active 2048-token context
 window boundary.
 • Saturation Plateau: The deployment saturates around 128–256 concurrent users. Adding users beyond this point
 increases queuing latency without improving aggregate throughput.
### 3. Context & Limits
• Maximum Context Size: Strictly capped at 4096 tokens (as configured by --max-model-len ). Queries equal to or larger
 than 4096 tokens are rejected instantaneously (0% success rate).
 • KV-Cache Pool Size: Accommodates 18,187 concurrent tokens using FP8 KV cache quantization.
### 4. Hardware Efficiency (The QAT Advantage)
• Datacenter Accessibility: Quantization-Aware Training (QAT) fits the 12B parameter model into a cost-efficient 16 GB 
 Tesla T4 GPU while reserving 5.85 GB for the active cache.
 • Without QAT, this model would require an A10 or L4 GPU (24 GB VRAM) to start up successfully.
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

One last cross check — what are the trade-offs of the deployment?

> what are the trade-offs on the deployment?
Here are the primary technical and financial trade-offs in this deployment architecture:
### 1. Model Quantization: QAT 4-bit (INT4) vs. Full Precision (BF16)
• The Trade-off: VRAM Savings vs. Mathematical Precision
 • Pros: Quantization-Aware Training (QAT) reduces model size to ~6.5 GB, allowing a 12B model to be served on a cheap
 16 GB GPU while leaving room for the KV cache.
 • Cons: There is a minor quality degradation in reasoning and function-calling precision compared to the native BF16
 checkpoint (though QAT retains ~99% of baseline performance, far outperforming standard Post-Training Quantization).
### 2. GPU Hardware: Tesla T4 vs. NVIDIA L4
• The Trade-off: Infrastructure Cost vs. Generation Speed
 • Pros: The Tesla T4 (NC8as-T4 workload profile on Azure) is significantly cheaper and more widely available in cloud
 regions than modern NVIDIA L4 GPUs.
 • Cons: The T4 runs on the older Turing architecture and lacks native hardware Tensor Core support for bfloat16 
 execution. This results in slower token generation speeds (~25 tokens/s) than an L4 GPU.
### 3. KV Cache Format: FP8 vs. FP16 Cache
• The Trade-off: Concurrency Capacity vs. Long-Context Recall
 • Pros: Compressing the KV cache to 8-bit ( --kv-cache-dtype fp8 ) cuts memory requirements in half, increasing token
 capacity to 18,187 tokens.
 • Cons: Quantizing attention keys and values can introduce small rounding errors in attention calculations, slightly
 degrading retrieval accuracy in long-context prompts (approaching the 4K boundary).
### 4. Active Context Cap: 4,096 tokens vs. 32,768 tokens (Model Native)
• The Trade-off: Server Stability vs. Maximum Document Size
 • Pros: Capping the context at 4096 tokens ( --max-model-len 4096 ) guarantees the system will not exhaust VRAM or
 crash under massive concurrent requests.
 • Cons: The model cannot process long documents, large codebases, or extended chat histories that exceed 4K tokens,
 rejecting them immediately.

So where is the best place to run Gemma? AWS or GCP? Azure?

Survey says...

> compare aws vs gcp vs azure performance
> summarize cloud_comparison_report.md
 Here is a quick summary of cloud_comparison_report.md:
 ### 📋 Key Metrics Comparison
 Metric | AWS EC2 (VM + L4) | GCP Cloud Run (Serverless + L4) | Azure ACA (Serverless + T4)
 ---------------------------|---------------------------|---------------------------------|-----------------------------
 Max Context Size | 16,384 tokens | 16,384 tokens | 4,096 tokens
 Peak Throughput | 93.3 req/s | 40.3 req/s | 47.7 req/s
 First Token Latency | 0.09s | 0.14s | 0.26s
 Max Stable Users | 1024 | 512 | 256
 ### 💡 Core Takeaways
 1. AWS EC2 (L4 GPU VM): Best raw performance and highest stability under extreme concurrency (1024+ users) since
 dedicated hardware avoids serverless hypervisor overhead.
 2. GCP Cloud Run (L4 GPU Serverless): Best cost-performance balance. Supports large context lengths (16K) and scales to
 zero when idle, but throughput is lower than dedicated VMs due to proxy networking.
 3. Azure Container Apps (T4 GPU Serverless): Good for serverless workloads but constrained by older GPU hardware (Tesla
 T4), restricting context length to 4K to prevent Out-of-Memory crashes.

What about costs?!?

The serving costs were summarized:

> summarize azure_cost_comparison.md
 Here is a summary of azure_cost_comparison.md:
 ### 📋 Cost Model Summary
 • Azure Container Apps (ACA - Serverless): Billed per second of active execution. Replicas scale down to zero when idle,
 incurring no compute charges.
 • Tesla T4 Profile: ~2ドル.20/hr. Cheapest option for intermittent use (costs ~316ドル/month at 20% active utilization).
 • A10G Profile: ~3ドル.90/hr. Best for serverless workloads that require a larger 24 GB VRAM GPU.
 • Dedicated GPU VM (Standard_NV36ads_A10_v5): Billed hourly at ~3ドル.20/hr regardless of utilization (costs ~2,ドル 
 304/month). Offers the cheapest cost-per-hour for 24/7 continuous operations but cannot scale to zero.
 ### 💡 Decision Matrix
 • Choose ACA (T4 or A10G) if your SRE diagnostic workloads are trigger-based, intermittent, or idle for a majority of
 the day.
 • Choose Dedicated VM (A10G) if you are running continuous, high-concurrency production inference pipelines 24/7.

What about other cloud costs?

> summarize multi_cloud_cost_comparison.md
 Here is a quick summary of multi_cloud_cost_comparison.md:
 ### 📋 Cost and VM Performance Characteristics
 • GCP Compute Engine ( g2-std-8 ): Cheapest VM option at ~0ドル.85/hr (~612ドル/month) for a dedicated NVIDIA L4 GPU.
 • AWS EC2 ( g6.2xlarge ): Highly competitive dedicated VM rate at ~0ドル.98/hr (~705ドル/month) with unmatched stability
 under high concurrency.
 • GCP Cloud Run (Serverless L4): Best serverless choice. Costs ~1ドル.48/hr (only ~213ドル/month at 20% active runtime) and
 hosts a modern 24 GB VRAM L4 GPU.
 • Azure Container Apps (Serverless T4): More expensive (~2ドル.20/hr) for an older 16 GB VRAM Tesla T4 GPU, restricting
 active context length to 4K.
 • Azure VM ( NV36ads_A10_v5 ): Most expensive VM option (~3ドル.20/hr) due to larger bundled CPU/RAM profiles.
 ### 💡 Recommendation
 • If migrating to save cost on scale-to-zero (intermittent load): Choose GCP Cloud Run.
 • If migrating for continuous, high-concurrency 24/7 runtimes: Choose a dedicated GCP Compute Engine VM or AWS EC2 VM.

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI and Azure was validated with an incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot Azure Model deployments.