Context Size: The number of relevant file chunks currently in the prompt window.
Tool Requirements: Whether the model needs to execute complex function calls or simply provide raw code.
Latency Sensitivity: Historical performance metrics for the agent type.
The Training Loop
We treat routing as a multi-armed bandit problem where the state space consists of the current conversation context. The reward signal is derived from the final outcome of the coding agent. If a plan generated by a cheaper model results in a failed test, a negative reward is backpropagated to the router, discouraging the selection of that model for similar task signatures in the future.
# Conceptual implementation of the routing decision
class Router:
def route(self, request_payload: Dict) -> ModelEndpoint:
# Extract features from the prompt
features = self.feature_extractor.get_features(request_payload)
# Query the routing model
model_choice = self.routing_policy.predict(features)
return self.endpoints.get(model_choice)
Protocol Translation and Normalization
A significant challenge in building a model router is the lack of a universal standard for LLM APIs. Anthropic and OpenAI, for instance, handle tool definitions, stop sequences, and streaming chunks differently. The Weave Router incorporates a normalization layer that performs an AST-like transformation on the incoming request body.
//Example:Requestnormalizationflow//AgentsendsOpenAI-compatiblerequest{"model":"gpt-4o","messages":[{"role":"user","content":"Refactor this module..."}],"tools":[...]}//RouterdeterminesthetaskissuitableforDeepSeekV4//Translationlayerexecutes:{"model":"deepseek-v4","messages":[...],"tools":[/*TranslatedtoDeepSeekschema*/]}
This ensures that the underlying agent, whether it is Cursor or a custom Claude Code implementation, remains agnostic of the fact that it is not communicating directly with its native provider.
Performance and Reliability
Introducing a proxy inevitably adds latency. To mitigate this, we have implemented:
-
Asynchronous Routing Decisions: The routing model runs on a dedicated high-performance inference cluster.
-
Decision Caching: If a sequence of requests shows high spatial correlation (e.g., iterative refactoring in the same file), the router caches the model assignment for a duration of $T$.
-
Circuit Breaking: If a target provider experiences a spike in latency or 5xx errors, the router automatically fails over to a secondary model, ensuring the coding agent remains functional even if our primary optimization path is interrupted.
Measuring Cost-Efficiency
In our internal evaluation over the last month, we observed a 40% reduction in total token costs. The distribution of model usage shifted significantly:
-
Frontier Models (Opus, GPT-5): Reduced from 100% usage to approximately 25%, strictly reserved for complex architectural changes and logic-heavy debugging.
-
Mid-Tier Models (DeepSeek, GLM): Increased from 0% to 65%, handling the bulk of routine implementation and boilerplate code.
-
Small Models (Flash/Lite): Used for approximately 10% of requests, specifically for trivial context gathering and chat responses.
The key to achieving these results without degradation in velocity is the strict thresholding in the RL model. If the routing model’s confidence score for a task does not meet a pre-defined threshold ($\sigma > 0.95$), the router defaults to the frontier model as a safety measure.
Challenges in Implementation
One of the primary difficulties encountered was the "State Leakage" issue. Coding agents often maintain stateful conversations. If the router switches models mid-conversation, the system prompt and the model’s internal behavior might change, leading to unexpected outputs.
To solve this, the router maintains a light-weight session state. It stores the model assignment for the duration of a specific task-session. This ensures consistency for the duration of a single coding request, even if the subsequent request is routed to a different model family.
Future Directions
The routing model is not a static artifact. It must evolve as new base models are released. The immediate roadmap includes:
-
Adaptive Fine-tuning: Continuously updating the routing policy based on global usage patterns.
-
Provider Multi-homing: Allowing the router to dynamically balance load across different API providers to avoid rate limits and minimize latency.
-
Client-Side Hints: Adding metadata to the agent’s requests that provide the router with "hints" about task intent, enabling higher precision routing.
This architectural pattern allows organizations to benefit from the rapid innovation in the LLM landscape without being locked into the pricing structures of individual vendors. By decoupling the agent from the model, we turn AI-assisted development into a tiered, cost-optimized pipeline.
For further exploration of architectural patterns in AI engineering, custom LLM integration, or strategic infrastructure consulting for your organization's AI initiatives, please visit https://www.mgatc.com.
Originally published in Spanish at www.mgatc.com/blog/smart-model-routing-for-ai-coding-agents/