12B Gemma 4 QAT Deployment with GCE, NVIDIA L4, MCP, and Antigravity CLI

DEV Community

Deploy The Model

The MCP server tools are used to deploy the model to GCE with GPU:

deploy the model

Checking System status

The status can be checked with an MCP tool:

> get_system_status
 The system status has been successfully updated and is fully healthy:
 ### 🌀 GPU vLLM System Status
 • vLLM Health: 🟢 Online (http://34.82.63.29:8080)
 • Hosting Status: 🟢 Running ( gpu-12b-qat-l4-devops-agent )
 👉 Next Step: Use query_gemma4 to interact with the model.

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> verify_model_health
 I will execute the verify_model_health tool function from server.py to run a model query benchmark and
 measure latency.
●くろまる Bash(python3 -c "import asyncio; from server import verify_model_health; print(asyncio.run(veri...) (ctrl+o to
expand)
 The model health check has successfully passed:
 • Model: google/gemma-4-12B-it-qat-w4a16-ct 
 • Response: "Yes, I am working and ready to help! How can I assist you today?" 
 • Latency: 0.84 seconds 
 • Status: 🟢 PASSED

and model details:

> get_active_models
 I will call the get_model_details tool function from server.py to fetch details on the active served models.
●くろまる Bash(python3 -c "import asyncio; from server import get_model_details; print(asyncio.run(get_mo...) (ctrl+o to
expand)
 The active model currently served by vLLM is:
 ### 🧩 Active Model Details
 • Model ID: google/gemma-4-12B-it-qat-w4a16-ct 
 • Object: model 
 • Owner: vllm 
 • Endpoint Status: Healthy ✅

and the endpoint:

> get_endpoint
 I will call the get_endpoint tool function in server.py to retrieve and verify the active API endpoint.
●くろまる Bash(python3 -c "import asyncio; from server import get_endpoint; print(asyncio.run(get_endpoin...) (ctrl+o to
expand)
 The active validated vLLM endpoint is:
 http://34.82.63.29:8080
 • Status: 🟢 Online (HTTP 200)

and status:

> get status
 I will fetch the current status of the GCE VM instance and the vLLM engine to provide a unified summary.
●くろまる Bash(make status ZONE=us-west1-a) (ctrl+o to expand)
 Here is the complete status of the serving stack:
 ### 🖥️ GCP GCE VM Instance Status
 • Instance Name: gpu-12b-qat-l4-devops-agent 
 • Instance State: 🟢 RUNNING 
 • Zone: us-west1-a 
 • Machine Type: g2-standard-4 (1x NVIDIA L4 GPU, 24GB VRAM)
 • Public IP Address: 34.82.63.29 
 ### 🌀 vLLM Serving Status
 • Endpoint: http://34.82.63.29:8080 
 • Health Endpoint ( /health ): 🟢 Healthy (HTTP 200)
 • Active Model: google/gemma-4-12B-it-qat-w4a16-ct 
 • Model State: Loaded and listening to requests.

Check the Cloud GCE VM

Once Deployed — the model is visible as a standard VM:

Review the Model

Check the model details:

> what are the model details?
 I will query the GCE VM vLLM server configuration details by pulling the model information and runtime settings.
●くろまる Bash(python3 -c "import asyncio; from server import get_model_details; print(asyncio.run(get_mo...) (ctrl+o to
expand)
 The active model details on the self-hosted GCE server are:
 ### 🧩 Served Model Details
 • Model Identifier: google/gemma-4-12B-it-qat-w4a16-ct 
 • Quantization Type: Quantization-Aware Training (QAT) with 4-bit weights and 16-bit activations ( w4a16 ) using
 the native compressed-tensors ( -ct ) serialization format.
 • Serving Backend: vllm (v0.22.1rc1 nightly build) utilizing the Marlin linear kernel ( MarlinLinearKernel for
 CompressedTensorsWNA16 ) for fast 4-bit weight decoding.
 • GPU Hardware: Single NVIDIA L4 GPU (24GB VRAM) hosted on a GCP Compute Engine (GCE) g2-standard-4 VM instance.
 • VRAM Configuration:
 • gpu-memory-utilization : 0.95 (allocates 95% of GPU memory to vLLM cache)
 • kv-cache-dtype : fp8 (8-bit quantized key-value cache to save memory and maximize concurrency)
 • API Server Health: 🟢 Healthy (HTTP 200) at http://34.82.63.29:8080

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is gemma?
 I will run the model query tool with performance stats to answer "what is gemma?" and retrieve latency and
 throughput metrics.
●くろまる Bash(python3 -c "import asyncio; from server import query_gemma4_with_stats; print(asyncio.run(...) (ctrl+o to
expand)
 The query is running in the background. I will await the results.
 Here are the performance metrics and response for the query:
 ### 📊 Performance Stats
 • Model: google/gemma-4-12B-it-qat-w4a16-ct 
 • Time to First Token (TTFT): 0.198s 
 • Total Generation Time: 23.326s 
 • Tokens per Second: 26.89 tokens/s 
 • Total Tokens generated: 622 
 ### 💬 Model Response
 The term Gemma most commonly refers to a family of lightweight, state-of-the-art open models developed by Google 
 DeepMind.
 To understand what Gemma is, it helps to look at it through three different lenses: its relationship to Gemini,
 its technical nature, and its purpose.
 #### 1. The Relationship to Gemini
 Gemma is built using the same technology, infrastructure, and research used to create Gemini, Google’s most
 capable AI model:
 • Gemini is a "closed" model (available via API or Google products).
 • Gemma is an "open" model (available for developers to download and run on their own hardware).
 #### 2. Key Characteristics
 What makes Gemma unique in the AI landscape are these three traits:
 • Open Weights: This means that the internal weights are shared with the public, allowing developers to build
 upon, fine-tune, and customize the model.
 • Lightweight & Efficient: Unlike massive models that require supercomputers to run, Gemma is designed to be
 small enough to run on a single GPU or even a high-end laptop.
 • Responsible AI: It includes safety filtering to ensure the models are less likely to generate harmful or toxic
 content.
 #### 3. Versions and Sizes
 Gemma comes in different "sizes" (parameter counts) to balance compute requirements with reasoning capabilities:
 • Gemma 2B: A tiny model capable of running on mobile devices or laptops.
 • Gemma 7B/9B/27B/12B/31B: Mid-to-large sized models that are highly capable at reasoning, coding, and creative
 writing while still being manageable for individual developers.
 #### 4. Who is it for?
 Gemma is designed primarily for:
 • Developers: Who want to build apps (like a personal assistant or a specialized chatbot) without paying fees to
 use the Gemini API.
 • Researchers: Who want to study how Large Language Models (LLMs) work and experiment with fine-tuning them on
 specific datasets.
 • Enterprises: Companies that want to run AI on their own private servers to ensure data privacy.
 #### Summary Table
 Feature | Gemini | Gemma
 -------------------------------------|-------------------------------------|-------------------------------------
 Accessibility | Closed (API/Web) | Open Weights (Downloadable)
 Size | Massive (Very Large) | Small to Medium
 Hardware | Google's Supercomputers | Local GPUs / Cloud / Laptops
 Best For | Consumer apps, complex reasoning | Development, Research, Local Use

Benchmark the Local Model

The local model was benchmarked:

The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
 window sizes and concurrency levels:

The analysis is reviewed:

> summarize benchmark results
 ### 1. Request Success Rate (Reliability)
 • 100% Success Rate: Maintained across all context lengths (4 to 8,162 tokens) for workloads up to 1,024 
 concurrent users.
 • 16K Context Window: Success rate remains at 99.7% under 1,024 concurrent users, dropping to 47.7% at 2,048
 users due to KV-cache memory pressure.
 ### 2. Latency (Prefill & Queue Waiting)
 • Low Concurrency (1-32 Users): Very responsive; average latency stays below 1.2 seconds even with large context
 sizes (4K).
 • High Concurrency (256-1024 Users): Queue wait times begin to dominate. Latencies scale linearly:
 • 128 Context: scales from 0.26s (1 user) to 5.05s (256 users) to 18.47s (1024 users).
 • 16K Context: scales from 5.61s (1 user) to 9.96s (256 users) to 33.26s (1024 users).
 ### 3. Throughput (Requests per Second)
 • Peak Throughput: Reaches 75.8 requests/sec under small context windows (4 tokens) at 128 users.
 • Large Contexts (16K): Throughput saturates at approximately 14 to 16 requests/sec for concurrencies of 128
 users and above.
 Detailed tables and the performance plot can be viewed in the generated gce_benchmark_results.md file.

This is visualized:

Compare to Other Deployments

The GCE deployment was reviewed:

 Here is a summary of the performance comparison for serving Gemma 4 QAT on an NVIDIA L4 GPU:
 ### 1. GCP GCE VM vs. AWS EC2 (VM vs. VM)
 • Performance: Almost identical. Average latency at 1024 users (1K context) is 19.91s on GCE and 19.90s on AWS.
 • Extreme Load: AWS EC2 holds a slight edge under 2048 concurrent users at 16K context window with a 61.8% 
 success rate compared to 47.7% on GCP GCE VM.
 ### 2. VMs (GCE/EC2) vs. GCP Cloud Run (VM vs. Serverless)
 • Throughput: Dedicated VMs scale to significantly higher peak throughput. GCE VM peaked at 75.8 RPS and AWS at
 93.3 RPS, whereas Cloud Run was capped at 40.3 RPS.
 • Scalability: Cloud Run degrades much earlier under high concurrency. At 2048 users (16K context), its success
 rate plummets to 22%, while GCE VM maintains 47.7%.
 • Overhead: Cloud Run's gVisor container sandboxing and routing layers add a 3–4 second latency penalty under
 load, in addition to cold start delays.

Cost Analysis

The current Google Cloud pricing is used to evaluate the cost per token:

> summarize gce_cost_analysis.md
 Here is a summary of the cost analysis for running the g2-standard-4 GCE VM instance with an NVIDIA L4 GPU:
 ### 1. Core Monthly Costs (24/7 Run-time)
 • On-Demand Compute: ~547ドル.50 / month (0ドル.75 / hour).
 • Balanced Storage (150 GB): 15ドル.00 / month (0ドル.10 / GB).
 • Total On-Demand Baseline: ~562ドル.50 / month.
 ### 2. Cost Optimization Options
 • Spot Instances (~70% savings): Reduces compute costs to ~167ドル.90 / month (0ドル.23 / hour).
 • Dynamic VM Scheduling (10h/day, weekday-only): Reduces compute costs to ~165ドル.00 / month by leveraging the
 agent's programmatic start_gce and stop_gce tools to power down the VM during idle hours.
 ### 3. Value Proposition
 At 0ドル.23 - 0ドル.75 / hour, a single L4 GPU supports up to 1024 concurrent users with 100% success rate for context
 lengths up to 8K. This makes GCE self-hosting highly cost-efficient compared to managed pay-per-token API
 endpoints.
 I have prepared a cost comparison between GCP GCE, AWS EC2, and Google Cloud Run. The detailed cost matrix and
 workload analysis are stored in platform_cost_comparison.md.
 Here is a summary of the cost comparison:
 ### 📊 Hourly & Monthly Cost Matrix (NVIDIA L4 GPU)
 Hosting Option | GCP GCE ( g2-standard-4 ) | AWS EC2 ( g6.2xlarge ) | GCP Cloud Run (Gen2 Serverless)
 --------------------------|---------------------------|------------------------|---------------------------------
 On-Demand Hourly | ~0ドル.75 / hr | ~0ドル.98 / hr | ~0ドル.946 / hr (Active)
 Spot Hourly | ~0ドル.23 / hr | ~0ドル.27 / hr | N/A (Not supported)
 Monthly (24/7 On-Demand) | ~547ドル.50 / mo | ~715ドル.40 / mo | ~690ドル.58 / mo (Warm min=1 )
 Monthly (24/7 Spot) | ~167ドル.90 / mo | ~197ドル.10 / mo | N/A
 Idle / Scale-to-Zero | No (Manual scheduling) | No (Manual scheduling) | Yes (0 active cost when idle)
 ### ⚖️ Cost Analysis by Workload Type
 1. Intermittent / Burst Use (GCP Cloud Run)
 • Best for: Occasional workloads (e.g., automated log audits triggered once daily).
 • Advantage: Scales down to 0 instances when idle, incurring 0ドル.00 compute charges.
 • Disadvantage: Idle scale-down triggers a cold start (~1–2 mins) on subsequent calls. Keeping the model warm
 ( min-instances=1 ) costs ~690ドル.58/mo.
 2. Non-Production / Batch Workloads (Spot Instances)
 • Best for: Development and testing.
 • Advantage: GCP GCE Spot VMs are the cheapest way to serve continuously (0ドル.23/hr vs AWS Spot's 0ドル.27/hr),
 bringing the total monthly bill to ~182ドル.90 (with storage).
 • Disadvantage: Preemptible; instances can be shut down with a 30-second warning.
 3. Steady-State Production Workloads (On-Demand VMs)
 • Best for: 24/7 SRE agents or alert response bots requiring sub-second response times.
 • Advantage: GCP GCE VMs are ~23% cheaper than AWS EC2 g6.2xlarge (547ドル.50/mo vs. 715ドル.40/mo) and ~20% 
 cheaper than keeping Cloud Run warm.

And Price/Performacne?

The winner is...

 ### 🏆 Summary: GCP GCE VM ( g2-standard-4 ) has the best overall price/performance for steady-state workloads.
 We can measure price/performance by calculating Peak Throughput (Requests Per Second) per Dollar of hourly cost:
 Metric | GCP GCE VM ( g2-standard-4 ) | AWS EC2 ( g6.2xlarge ) | GCP Cloud Run (Gen2 Serverless
 -------------------------|------------------------------|------------------------|-------------------------------
 Peak Throughput | 75.8 RPS | 93.3 RPS | 40.3 RPS
 Hourly Rate (On-Demand) | 0ドル.75 / hr | 0ドル.98 / hr | 0ドル.946 / hr (Warm)
 Steady-State Price/Perf | 101.1 RPS per $ | 95.2 RPS per $ | 42.6 RPS per $
 Hourly Rate (Spot) | 0ドル.23 / hr | 0ドル.27 / hr | N/A
 Spot Price/Perf | 329.6 RPS per $ | 345.6 RPS per $ | N/A
 ──────
 ### ⚖️ Recommendations
 1. For Steady-State Serving (24/7): GCP GCE VM
 • Winner: GCP GCE VM
 • Why: It yields the highest value for on-demand serving (101.1 RPS/$ vs AWS's 95.2 RPS/$). It avoids the
 gVisor sandbox and container virtualization overhead of Cloud Run, which cuts throughput in half (40.3 RPS)
 while cost remains high (0ドル.946/hr) if kept warm.
 2. For Development & Batch Serving: AWS Spot or GCP Spot
 • Winner: AWS EC2 Spot (by a small margin on throughput value), though GCP GCE Spot is cheaper in absolute
 monthly dollars (167ドル.90 vs 197ドル.10).
 • Why: Both spot options offer spectacular price/performance (over 320+ RPS per dollar), making them ideal
 for offline training, benchmarking, or non-critical diagnostic sweeps.
 3. For Highly Intermittent Auditing: GCP Cloud Run (Serverless)
 • Winner: GCP Cloud Run (with scale-to-zero active)
 • Why: If your SRE agent only runs a log diagnosis task 2 to 3 times a day and sits idle the rest of the time,
 Cloud Run's scale-to-zero feature brings active costs to 0ドル.00, making it the most cost-effective choice
 overall, despite lower peak throughput and the ~1–2 minute cold start.

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot GCE Model deployments.