Copied to Clipboard
Deploy The Model
The MCP server tools are used to deploy the model to GCE with GPU:
deploy the model
Checking System status
The status can be checked with an MCP tool:
> get_system_status
The system status has been successfully updated and is fully healthy:
### 🌀 GPU vLLM System Status
• vLLM Health: 🟢 Online (http://34.82.63.29:8080)
• Hosting Status: 🟢 Running ( gpu-12b-qat-l4-devops-agent )
👉 Next Step: Use query_gemma4 to interact with the model.
Cross Check The Deployed Model
Once the model starts locally- the MCP tool allows for model verification:
> verify_model_health
I will execute the verify_model_health tool function from server.py to run a model query benchmark and
measure latency.
●くろまる Bash(python3 -c "import asyncio; from server import verify_model_health; print(asyncio.run(veri...) (ctrl+o to
expand)
The model health check has successfully passed:
• Model: google/gemma-4-12B-it-qat-w4a16-ct
• Response: "Yes, I am working and ready to help! How can I assist you today?"
• Latency: 0.84 seconds
• Status: 🟢 PASSED
and model details:
> get_active_models
I will call the get_model_details tool function from server.py to fetch details on the active served models.
●くろまる Bash(python3 -c "import asyncio; from server import get_model_details; print(asyncio.run(get_mo...) (ctrl+o to
expand)
The active model currently served by vLLM is:
### 🧩 Active Model Details
• Model ID: google/gemma-4-12B-it-qat-w4a16-ct
• Object: model
• Owner: vllm
• Endpoint Status: Healthy ✅
and the endpoint:
> get_endpoint
I will call the get_endpoint tool function in server.py to retrieve and verify the active API endpoint.
●くろまる Bash(python3 -c "import asyncio; from server import get_endpoint; print(asyncio.run(get_endpoin...) (ctrl+o to
expand)
The active validated vLLM endpoint is:
http://34.82.63.29:8080
• Status: 🟢 Online (HTTP 200)
and status:
> get status
I will fetch the current status of the GCE VM instance and the vLLM engine to provide a unified summary.
●くろまる Bash(make status ZONE=us-west1-a) (ctrl+o to expand)
Here is the complete status of the serving stack:
### 🖥️ GCP GCE VM Instance Status
• Instance Name: gpu-12b-qat-l4-devops-agent
• Instance State: 🟢 RUNNING
• Zone: us-west1-a
• Machine Type: g2-standard-4 (1x NVIDIA L4 GPU, 24GB VRAM)
• Public IP Address: 34.82.63.29
### 🌀 vLLM Serving Status
• Endpoint: http://34.82.63.29:8080
• Health Endpoint ( /health ): 🟢 Healthy (HTTP 200)
• Active Model: google/gemma-4-12B-it-qat-w4a16-ct
• Model State: Loaded and listening to requests.
Check the Cloud GCE VM
Once Deployed — the model is visible as a standard VM:
Review the Model
Check the model details:
> what are the model details?
I will query the GCE VM vLLM server configuration details by pulling the model information and runtime settings.
●くろまる Bash(python3 -c "import asyncio; from server import get_model_details; print(asyncio.run(get_mo...) (ctrl+o to
expand)
The active model details on the self-hosted GCE server are:
### 🧩 Served Model Details
• Model Identifier: google/gemma-4-12B-it-qat-w4a16-ct
• Quantization Type: Quantization-Aware Training (QAT) with 4-bit weights and 16-bit activations ( w4a16 ) using
the native compressed-tensors ( -ct ) serialization format.
• Serving Backend: vllm (v0.22.1rc1 nightly build) utilizing the Marlin linear kernel ( MarlinLinearKernel for
CompressedTensorsWNA16 ) for fast 4-bit weight decoding.
• GPU Hardware: Single NVIDIA L4 GPU (24GB VRAM) hosted on a GCP Compute Engine (GCE) g2-standard-4 VM instance.
• VRAM Configuration:
• gpu-memory-utilization : 0.95 (allocates 95% of GPU memory to vLLM cache)
• kv-cache-dtype : fp8 (8-bit quantized key-value cache to save memory and maximize concurrency)
• API Server Health: 🟢 Healthy (HTTP 200) at http://34.82.63.29:8080
Once everything is online — you can send queries to the Model:
> query_gemma4_with_stats what is gemma?
I will run the model query tool with performance stats to answer "what is gemma?" and retrieve latency and
throughput metrics.
●くろまる Bash(python3 -c "import asyncio; from server import query_gemma4_with_stats; print(asyncio.run(...) (ctrl+o to
expand)
The query is running in the background. I will await the results.
Here are the performance metrics and response for the query:
### 📊 Performance Stats
• Model: google/gemma-4-12B-it-qat-w4a16-ct
• Time to First Token (TTFT): 0.198s
• Total Generation Time: 23.326s
• Tokens per Second: 26.89 tokens/s
• Total Tokens generated: 622
### 💬 Model Response
The term Gemma most commonly refers to a family of lightweight, state-of-the-art open models developed by Google
DeepMind.
To understand what Gemma is, it helps to look at it through three different lenses: its relationship to Gemini,
its technical nature, and its purpose.
#### 1. The Relationship to Gemini
Gemma is built using the same technology, infrastructure, and research used to create Gemini, Google’s most
capable AI model:
• Gemini is a "closed" model (available via API or Google products).
• Gemma is an "open" model (available for developers to download and run on their own hardware).
#### 2. Key Characteristics
What makes Gemma unique in the AI landscape are these three traits:
• Open Weights: This means that the internal weights are shared with the public, allowing developers to build
upon, fine-tune, and customize the model.
• Lightweight & Efficient: Unlike massive models that require supercomputers to run, Gemma is designed to be
small enough to run on a single GPU or even a high-end laptop.
• Responsible AI: It includes safety filtering to ensure the models are less likely to generate harmful or toxic
content.
#### 3. Versions and Sizes
Gemma comes in different "sizes" (parameter counts) to balance compute requirements with reasoning capabilities:
• Gemma 2B: A tiny model capable of running on mobile devices or laptops.
• Gemma 7B/9B/27B/12B/31B: Mid-to-large sized models that are highly capable at reasoning, coding, and creative
writing while still being manageable for individual developers.
#### 4. Who is it for?
Gemma is designed primarily for:
• Developers: Who want to build apps (like a personal assistant or a specialized chatbot) without paying fees to
use the Gemini API.
• Researchers: Who want to study how Large Language Models (LLMs) work and experiment with fine-tuning them on
specific datasets.
• Enterprises: Companies that want to run AI on their own private servers to ensure data privacy.
#### Summary Table
Feature | Gemini | Gemma
-------------------------------------|-------------------------------------|-------------------------------------
Accessibility | Closed (API/Web) | Open Weights (Downloadable)
Size | Massive (Very Large) | Small to Medium
Hardware | Google's Supercomputers | Local GPUs / Cloud / Laptops
Best For | Consumer apps, complex reasoning | Development, Research, Local Use
Benchmark the Local Model
The local model was benchmarked:
The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
window sizes and concurrency levels:
The analysis is reviewed:
> summarize benchmark results
### 1. Request Success Rate (Reliability)
• 100% Success Rate: Maintained across all context lengths (4 to 8,162 tokens) for workloads up to 1,024
concurrent users.
• 16K Context Window: Success rate remains at 99.7% under 1,024 concurrent users, dropping to 47.7% at 2,048
users due to KV-cache memory pressure.
### 2. Latency (Prefill & Queue Waiting)
• Low Concurrency (1-32 Users): Very responsive; average latency stays below 1.2 seconds even with large context
sizes (4K).
• High Concurrency (256-1024 Users): Queue wait times begin to dominate. Latencies scale linearly:
• 128 Context: scales from 0.26s (1 user) to 5.05s (256 users) to 18.47s (1024 users).
• 16K Context: scales from 5.61s (1 user) to 9.96s (256 users) to 33.26s (1024 users).
### 3. Throughput (Requests per Second)
• Peak Throughput: Reaches 75.8 requests/sec under small context windows (4 tokens) at 128 users.
• Large Contexts (16K): Throughput saturates at approximately 14 to 16 requests/sec for concurrencies of 128
users and above.
Detailed tables and the performance plot can be viewed in the generated gce_benchmark_results.md file.
This is visualized:
Compare to Other Deployments
The GCE deployment was reviewed:
Here is a summary of the performance comparison for serving Gemma 4 QAT on an NVIDIA L4 GPU:
### 1. GCP GCE VM vs. AWS EC2 (VM vs. VM)
• Performance: Almost identical. Average latency at 1024 users (1K context) is 19.91s on GCE and 19.90s on AWS.
• Extreme Load: AWS EC2 holds a slight edge under 2048 concurrent users at 16K context window with a 61.8%
success rate compared to 47.7% on GCP GCE VM.
### 2. VMs (GCE/EC2) vs. GCP Cloud Run (VM vs. Serverless)
• Throughput: Dedicated VMs scale to significantly higher peak throughput. GCE VM peaked at 75.8 RPS and AWS at
93.3 RPS, whereas Cloud Run was capped at 40.3 RPS.
• Scalability: Cloud Run degrades much earlier under high concurrency. At 2048 users (16K context), its success
rate plummets to 22%, while GCE VM maintains 47.7%.
• Overhead: Cloud Run's gVisor container sandboxing and routing layers add a 3–4 second latency penalty under
load, in addition to cold start delays.
Cost Analysis
The current Google Cloud pricing is used to evaluate the cost per token:
> summarize gce_cost_analysis.md
Here is a summary of the cost analysis for running the g2-standard-4 GCE VM instance with an NVIDIA L4 GPU:
### 1. Core Monthly Costs (24/7 Run-time)
• On-Demand Compute: ~547ドル.50 / month (0ドル.75 / hour).
• Balanced Storage (150 GB): 15ドル.00 / month (0ドル.10 / GB).
• Total On-Demand Baseline: ~562ドル.50 / month.
### 2. Cost Optimization Options
• Spot Instances (~70% savings): Reduces compute costs to ~167ドル.90 / month (0ドル.23 / hour).
• Dynamic VM Scheduling (10h/day, weekday-only): Reduces compute costs to ~165ドル.00 / month by leveraging the
agent's programmatic start_gce and stop_gce tools to power down the VM during idle hours.
### 3. Value Proposition
At 0ドル.23 - 0ドル.75 / hour, a single L4 GPU supports up to 1024 concurrent users with 100% success rate for context
lengths up to 8K. This makes GCE self-hosting highly cost-efficient compared to managed pay-per-token API
endpoints.
I have prepared a cost comparison between GCP GCE, AWS EC2, and Google Cloud Run. The detailed cost matrix and
workload analysis are stored in platform_cost_comparison.md.
Here is a summary of the cost comparison:
### 📊 Hourly & Monthly Cost Matrix (NVIDIA L4 GPU)
Hosting Option | GCP GCE ( g2-standard-4 ) | AWS EC2 ( g6.2xlarge ) | GCP Cloud Run (Gen2 Serverless)
--------------------------|---------------------------|------------------------|---------------------------------
On-Demand Hourly | ~0ドル.75 / hr | ~0ドル.98 / hr | ~0ドル.946 / hr (Active)
Spot Hourly | ~0ドル.23 / hr | ~0ドル.27 / hr | N/A (Not supported)
Monthly (24/7 On-Demand) | ~547ドル.50 / mo | ~715ドル.40 / mo | ~690ドル.58 / mo (Warm min=1 )
Monthly (24/7 Spot) | ~167ドル.90 / mo | ~197ドル.10 / mo | N/A
Idle / Scale-to-Zero | No (Manual scheduling) | No (Manual scheduling) | Yes (0 active cost when idle)
### ⚖️ Cost Analysis by Workload Type
1. Intermittent / Burst Use (GCP Cloud Run)
• Best for: Occasional workloads (e.g., automated log audits triggered once daily).
• Advantage: Scales down to 0 instances when idle, incurring 0ドル.00 compute charges.
• Disadvantage: Idle scale-down triggers a cold start (~1–2 mins) on subsequent calls. Keeping the model warm
( min-instances=1 ) costs ~690ドル.58/mo.
2. Non-Production / Batch Workloads (Spot Instances)
• Best for: Development and testing.
• Advantage: GCP GCE Spot VMs are the cheapest way to serve continuously (0ドル.23/hr vs AWS Spot's 0ドル.27/hr),
bringing the total monthly bill to ~182ドル.90 (with storage).
• Disadvantage: Preemptible; instances can be shut down with a 30-second warning.
3. Steady-State Production Workloads (On-Demand VMs)
• Best for: 24/7 SRE agents or alert response bots requiring sub-second response times.
• Advantage: GCP GCE VMs are ~23% cheaper than AWS EC2 g6.2xlarge (547ドル.50/mo vs. 715ドル.40/mo) and ~20%
cheaper than keeping Cloud Run warm.
And Price/Performacne?
The winner is...
### 🏆 Summary: GCP GCE VM ( g2-standard-4 ) has the best overall price/performance for steady-state workloads.
We can measure price/performance by calculating Peak Throughput (Requests Per Second) per Dollar of hourly cost:
Metric | GCP GCE VM ( g2-standard-4 ) | AWS EC2 ( g6.2xlarge ) | GCP Cloud Run (Gen2 Serverless
-------------------------|------------------------------|------------------------|-------------------------------
Peak Throughput | 75.8 RPS | 93.3 RPS | 40.3 RPS
Hourly Rate (On-Demand) | 0ドル.75 / hr | 0ドル.98 / hr | 0ドル.946 / hr (Warm)
Steady-State Price/Perf | 101.1 RPS per $ | 95.2 RPS per $ | 42.6 RPS per $
Hourly Rate (Spot) | 0ドル.23 / hr | 0ドル.27 / hr | N/A
Spot Price/Perf | 329.6 RPS per $ | 345.6 RPS per $ | N/A
──────
### ⚖️ Recommendations
1. For Steady-State Serving (24/7): GCP GCE VM
• Winner: GCP GCE VM
• Why: It yields the highest value for on-demand serving (101.1 RPS/$ vs AWS's 95.2 RPS/$). It avoids the
gVisor sandbox and container virtualization overhead of Cloud Run, which cuts throughput in half (40.3 RPS)
while cost remains high (0ドル.946/hr) if kept warm.
2. For Development & Batch Serving: AWS Spot or GCP Spot
• Winner: AWS EC2 Spot (by a small margin on throughput value), though GCP GCE Spot is cheaper in absolute
monthly dollars (167ドル.90 vs 197ドル.10).
• Why: Both spot options offer spectacular price/performance (over 320+ RPS per dollar), making them ideal
for offline training, benchmarking, or non-critical diagnostic sweeps.
3. For Highly Intermittent Auditing: GCP Cloud Run (Serverless)
• Winner: GCP Cloud Run (with scale-to-zero active)
• Why: If your SRE agent only runs a log diagnosis task 2 to 3 times a day and sits idle the rest of the time,
Cloud Run's scale-to-zero feature brings active costs to 0ドル.00, making it the most cost-effective choice
overall, despite lower peak throughput and the ~1–2 minute cold start.
Summary
The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.
A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot GCE Model deployments.