Copied to Clipboard
Checking System status
The status can be checked with an MCP tool:
> get_system_status
●くろまる local-devops-agent/get_system_status(Get system status) (ctrl+o to expand)
### 🌀 Local System Status
• vLLM Health: 🟢 Online (http://localhost:8000)
• Native Ollama Service Status: 🟢 Running (ollama daemon is active via
Homebrew launch agent)
Next Steps: You can now interact with the local model using the
query_gemma4 tool.
──────────────────────────
Deploy The Model Locally
Once the base environment has been checked- the Gemma model can be deployed:
> deploy gemma4
Cross Check The Deployed Model
Once the model starts locally- the MCP tool allows for model verification:
> model health check
• Status: ✅ Model health check PASSED.
• Model Response: "Hello! Yes, I am working and ready to assist you..."
• Inference Latency: 4.84 seconds.
The initial empty response was likely due to a transient cold-start or
model
loading delay (the model was active in CPU/GPU memory). The local
gemma4:e2b model is now fully healthy and ready for queries.
and model details:
> get_active_models
●くろまる local-devops-agent/get_active_models(Get active models) (ctrl+o to expand)
### 📊 Active Loaded Models:
NAME ID SIZE PROCESSOR CONTEXT
UNTIL
gemma4:e2b bcec27ff2bea 7.7 GB 76%/24% CPU/GPU 2048 4
minutes from now
and the endpoint:
> get_endpoint
●くろまる local-devops-agent/get_endpoint(Get endpoint) (ctrl+o to expand)
🟢 Local vLLM/Ollama is Online at: http://localhost:8000
and status:
>get_system_details●くろまるlocal-devops-agent/get_system_details(Getsystemdetails)(ctrl+otoexpand)###🧩ModelDetails(http://localhost:8000)ModelInformation(/v1/models):{"object":"list","data":[{"id":"gemma4:e2b","object":"model","created":1780234616,"owned_by":"library"},{"id":"gemma4:e4b","object":"model","created":1780234234,"owned_by":"library"}]}•HealthStatus:Healthy✅
Review the Model
Check the model details:
> Here are the deep parameters and configurations for the loaded model
●くろまる local-devops-agent/get_model_show_details(Get model show ...) (ctrl+o to
expand)
Here are the deep parameters and configurations for the loaded model (
gemma4:e2b ):
### 🧩 Model Details for gemma4:e2b
• Architecture: gemma4
• Parameters: 5.1B
• Context Length: 131,072 (Configured to 1024 for run)
• Embedding Length: 1,536
• Quantization: Q4_K_M
• Capabilities: Completion, Vision, Audio, Tools, Thinking
• Parameters Config:
• temperature : 1
• top_k : 64
• top_p : 0.95
• License: Apache License 2.0
Once everything is online — you can send queries to the Model:
> query_gemma4_with_stats what is gemma?
●くろまる local-devops-agent/query_gemma4_with_stats(Query Gemma 4 ...) (ctrl+o to
expand)
### 📊 Inference Performance Stats
• Time to First Token (TTFT): 10.912s
• Total Generation Time: 28.396s
• Throughput: 14.93 tokens/s
• Total Tokens (approx.): 261
──────
### 💬 Model Response
"Gemma" primarily refers to a family of lightweight, open-weights large
language models (LLMs) developed by Google DeepMind.
Here is a breakdown of what that means:
1. Large Language Models (LLMs): Like other LLMs, Gemma models are trained
on massive amounts of text data, allowing them to understand, generate, and
process human language.
2. Open Weights: A key feature of the Gemma models is that they are
released with open weights. This means researchers, developers, and the
public have access to the model parameters, allowing them to inspect,
modify, fine-tune, and build upon the models for specific applications.
3. Lightweight and Efficient: The models are designed to be efficient,
meaning they offer strong performance while being relatively lightweight
compared to some of the largest models. This makes them accessible for
deployment on a wider range of hardware.
4. Family: The Gemma family includes various sizes (e.g., different
parameter counts) to suit different computational needs, ranging from
smaller, highly efficient models to more capable ones.
Benchmark the Local Model
The local model was benchmarked:
●くろまる Create(/Users/xbill/.gemini/an...benchmark_comparison.md) (ctrl+o to expand)
I have compiled a comparison report between the two benchmark results:
• 4B Agent File: benchmark_results.csv
• 2B Agent File: benchmark_results.csv
The full analysis is saved in the artifact: benchmark_comparison.md.
### Summary Comparison
Metric | Concurrency | 4B Model (m... | 2B Model (ma... | Speedup / Di...
--------------|-------------|--------------|---------------|---------------
Avg Latency | 1 | 1.57s | 0.59s | 2.66x faster
(s) | | | | (2B)
| 2 | 2.68s | 0.79s | 3.39x faster
| | | | (2B)
P95 Latency | 1 | 1.57s | 0.61s | 2.57x faster
(s) | | | | (2B)
| 2 | 3.02s | 1.01s | 2.99x faster
| | | | (2B)
Req/sec | 1 | 0.64 | 1.70 | +166% (2B)
| 2 | 0.67 | 1.98 | +195% (2B)
Tokens/sec | 1 | 20.44 | 27.25 | +33.3% (2B)
| 2 | 21.57 | 31.70 | +47.0% (2B)
### Key Takeaways
1. Performance: The 2B model is significantly faster across the board,
processing requests with roughly 3x lower latency and achieving 47% more
tokens per second at concurrency 2.
2. Scalability: While the 4B model shows a sharp latency increase from
concurrency 1 to 2, the 2B model handles concurrency scaling up to 16 with
very little performance degradation (latency remaining under ~0.83s).
This is visualized:
Summary
The strategy for using MCP for local Gemma 4 deployment with Antigravity CLI was validated with a incremental step by step approach.
A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot local Model deployments.