How to Unlock Local Inference in the Google Gemini SDK (Without Forking)

DEV Community

FallbackStrategy
OverrideStrategy (Triggered when a concrete model is provided)
ClassifierStrategy (The default cloud ping)

By simply passing a specific model name (e.g., qwen-3b) instead of auto during initialization, we trip the OverrideStrategy. This "amputates" the cloud router, forcing the SDK to talk directly to our local bridge with 0ms latency and zero cloud pings.

2. The Implementation: `LlamaCppGenerator`

Tars implements the SDK's ContentGenerator interface. This allows us to intercept the SDK’s generateContent and streamGenerateContent calls. We then:

Map Gemini Parts to OpenAI: Translate the SDK’s complex multi-part messages (text + function calls) into flat OpenAI-compatible JSON.
Native Tool-Calling Bridge: To make the SDK recognize local tool calls, we manually map them to the response.functionCalls prototype getter. This allows local models (like Qwen 3.5) to participate in the exact same multi-turn tool-loops as Gemini 1.5 Pro.

3. Future-Proofing: Upgrading Core without Breaking

Because Tars uses the standard ContentGenerator interface, we can upgrade @google/gemini-cli-core to the latest version (e.g., for new Gemini 2.0 features) without breaking our local inference logic. We aren't hacking the SDK; we are using it exactly as it was designed to be extended.

The Verdict

The Gemini CLI doesn't need a "Local Mode" feature request—it needs an implementation that respects its modular architecture. Tars is that implementation.

Key Benefits:

100% Privacy: No telemetry or classifier pings to Google.
Agentic Power: Full MCP extension support (Gmail, Drive, Shell) on local hardware.
Telemetry: Captures local usageMetadata (tokens) for real-time dashboard tracking.

Recommended Model: Qwen 3.5 (35B or 80B) for the most reliable tool-calling and JSON output.

[!TIP]
Get Started: You can test this today by running tars setup and selecting the Llama.cpp backend.
Repository: github.com/agustinsacco/tars