The uniform selection problem
Most multi-agent setups use the same model for everything. Every task — whether it's renaming a variable or designing an authentication system — gets routed to the same model at the same effort level. This is wasteful. A docs task that writes a docstring doesn't need the same model as a security task that implements credential scoping.
The cost difference is real. At current API pricing, routing a simple task to Haiku instead of Opus costs roughly 30x less. Over a session with 40-60 tasks, that adds up fast.
How the router works
Bernstein's routing pipeline has three layers:
Layer 1: Heuristic classification. Every task has a complexity field (low, medium, high) and a role (backend, frontend, qa, security, etc.). The router uses a rule-based classifier to make an initial model/effort assignment. Low-complexity tasks default to Haiku or Sonnet with standard effort. High-complexity tasks get Opus with max effort.
Layer 2: Epsilon-greedy bandit. This is where it gets interesting. The bandit maintains per-role reward estimates for each model. When a task arrives, it exploits the best-known model 80% of the time and explores alternatives 20% of the time. Rewards come from task outcomes: did the agent complete the task? Did tests pass? How many retries were needed?
# Simplified selection logic
candidates = ["sonnet", "opus"] if task.complexity == "high" else CASCADE
selected = bandit.select(role=task.role, candidate_models=candidates)
The CASCADE list includes all available models from cheapest to most capable. For high-complexity tasks, the bandit only considers Sonnet and Opus — sending a hard architecture task to Haiku would waste the agent's time even if it's cheap.
Layer 3: Effectiveness seeding. The bandit warms up using historical effectiveness data from the .sdd/metrics/ directory. If a previous run showed that backend tasks succeed 95% of the time with Sonnet but only 70% with Haiku, the bandit starts with that prior. No cold-start problem after the first session.
What the router learns
After a few sessions, clear patterns emerge:
| Task type |
Typical model |
Why |
| Docs, docstrings |
Haiku |
Templated output, low reasoning |
| Test writing |
Sonnet |
Needs code understanding, not creativity |
| Bug fixes |
Sonnet |
Pattern matching on error traces |
| Refactoring |
Sonnet/Opus |
Depends on scope |
| Architecture, security |
Opus |
Requires deep reasoning |
These aren't hardcoded rules — they're learned from outcomes. If your codebase has unusually complex tests, the bandit will learn to route test tasks to a stronger model.
Configuration
The bandit is enabled by default when a metrics directory exists. You can tune exploration rate and model cascade in your config:
# .sdd/config.yaml
routing:
bandit_epsilon: 0.2 # 20% exploration
cascade: [haiku, sonnet, opus]
min_samples_per_arm: 5 # explore each option at least 5 times
To disable bandit routing and use pure heuristics:
routing:
bandit_enabled: false
The numbers
Across our internal runs (self-development sessions where Bernstein improves its own codebase), the bandit router cut per-session spend roughly in half compared to the baseline of Sonnet-for-everything. Task completion rates stayed within a couple of percentage points, so cheaper models handle their assigned tasks fine. Measure your own runs with bernstein cost.
The savings compound. A 10-agent session running 50 tasks might cost 15ドル-20 with uniform Sonnet. With bandit routing, the same session runs 7ドル-10. Over weeks of iterative development, that's the difference between a side project budget and a real expense.
Further reading