Local Mac Gemma 4 Deployment with MCP and Antigravity CLI

DEV Community

Checking System status

The status can be checked with an MCP tool:

> get_system_status
●くろまる local-devops-agent/get_system_status(Get system status) (ctrl+o to expand)
 ### 🌀 Local System Status
 • vLLM Health: 🟢 Online (http://localhost:8000)
 • Native Ollama Service Status: 🟢 Running (ollama daemon is active via
 Homebrew launch agent)
 Next Steps: You can now interact with the local model using the 
 query_gemma4 tool.
──────────────────────────

Deploy The Model Locally

Once the base environment has been checked- the Gemma model can be deployed:

> deploy gemma4

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> model health check
 • Status: ✅ Model health check PASSED.
 • Model Response: "Hello! Yes, I am working and ready to assist you..."
 • Inference Latency: 4.84 seconds.
 The initial empty response was likely due to a transient cold-start or
 model
 loading delay (the model was active in CPU/GPU memory). The local 
 gemma4:e2b model is now fully healthy and ready for queries.

and model details:

> get_active_models
●くろまる local-devops-agent/get_active_models(Get active models) (ctrl+o to expand)
 ### 📊 Active Loaded Models:
 NAME ID SIZE PROCESSOR CONTEXT
 UNTIL
 gemma4:e2b bcec27ff2bea 7.7 GB 76%/24% CPU/GPU 2048 4
 minutes from now

and the endpoint:

> get_endpoint
●くろまる local-devops-agent/get_endpoint(Get endpoint) (ctrl+o to expand)
 🟢 Local vLLM/Ollama is Online at: http://localhost:8000

and status:

>get_system_details●くろまるlocal-devops-agent/get_system_details(Getsystemdetails)(ctrl+otoexpand)###🧩ModelDetails(http://localhost:8000)ModelInformation(/v1/models):{"object":"list","data":[{"id":"gemma4:e2b","object":"model","created":1780234616,"owned_by":"library"},{"id":"gemma4:e4b","object":"model","created":1780234234,"owned_by":"library"}]}•HealthStatus:Healthy✅

Review the Model

Check the model details:

> Here are the deep parameters and configurations for the loaded model
●くろまる local-devops-agent/get_model_show_details(Get model show ...) (ctrl+o to
expand)
 Here are the deep parameters and configurations for the loaded model ( 
 gemma4:e2b ):
 ### 🧩 Model Details for gemma4:e2b 
 • Architecture: gemma4 
 • Parameters: 5.1B 
 • Context Length: 131,072 (Configured to 1024 for run)
 • Embedding Length: 1,536 
 • Quantization: Q4_K_M 
 • Capabilities: Completion, Vision, Audio, Tools, Thinking
 • Parameters Config:
 • temperature : 1 
 • top_k : 64 
 • top_p : 0.95 
 • License: Apache License 2.0

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is gemma?
●くろまる local-devops-agent/query_gemma4_with_stats(Query Gemma 4 ...) (ctrl+o to
expand)
 ### 📊 Inference Performance Stats
 • Time to First Token (TTFT): 10.912s 
 • Total Generation Time: 28.396s 
 • Throughput: 14.93 tokens/s 
 • Total Tokens (approx.): 261 
 ──────
 ### 💬 Model Response
 "Gemma" primarily refers to a family of lightweight, open-weights large 
 language models (LLMs) developed by Google DeepMind.
 Here is a breakdown of what that means:
 1. Large Language Models (LLMs): Like other LLMs, Gemma models are trained
 on massive amounts of text data, allowing them to understand, generate, and
 process human language.
 2. Open Weights: A key feature of the Gemma models is that they are
 released with open weights. This means researchers, developers, and the
 public have access to the model parameters, allowing them to inspect,
 modify, fine-tune, and build upon the models for specific applications.
 3. Lightweight and Efficient: The models are designed to be efficient,
 meaning they offer strong performance while being relatively lightweight
 compared to some of the largest models. This makes them accessible for
 deployment on a wider range of hardware.
 4. Family: The Gemma family includes various sizes (e.g., different
 parameter counts) to suit different computational needs, ranging from
 smaller, highly efficient models to more capable ones.

Benchmark the Local Model

The local model was benchmarked:

●くろまる Create(/Users/xbill/.gemini/an...benchmark_comparison.md) (ctrl+o to expand)
 I have compiled a comparison report between the two benchmark results:
 • 4B Agent File: benchmark_results.csv
 • 2B Agent File: benchmark_results.csv
 The full analysis is saved in the artifact: benchmark_comparison.md.
 ### Summary Comparison
 Metric | Concurrency | 4B Model (m... | 2B Model (ma... | Speedup / Di...
 --------------|-------------|--------------|---------------|---------------
 Avg Latency | 1 | 1.57s | 0.59s | 2.66x faster
 (s) | | | | (2B)
 | 2 | 2.68s | 0.79s | 3.39x faster
 | | | | (2B)
 P95 Latency | 1 | 1.57s | 0.61s | 2.57x faster
 (s) | | | | (2B)
 | 2 | 3.02s | 1.01s | 2.99x faster
 | | | | (2B)
 Req/sec | 1 | 0.64 | 1.70 | +166% (2B)
 | 2 | 0.67 | 1.98 | +195% (2B)
 Tokens/sec | 1 | 20.44 | 27.25 | +33.3% (2B)
 | 2 | 21.57 | 31.70 | +47.0% (2B)
 ### Key Takeaways
 1. Performance: The 2B model is significantly faster across the board,
 processing requests with roughly 3x lower latency and achieving 47% more 
 tokens per second at concurrency 2.
 2. Scalability: While the 4B model shows a sharp latency increase from
 concurrency 1 to 2, the 2B model handles concurrency scaling up to 16 with
 very little performance degradation (latency remaining under ~0.83s).

This is visualized:

Summary

The strategy for using MCP for local Gemma 4 deployment with Antigravity CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot local Model deployments.