How Bernstein Routes Tasks to the Right Model

DEV Community

The uniform selection problem

Most multi-agent setups use the same model for everything. Every task — whether it's renaming a variable or designing an authentication system — gets routed to the same model at the same effort level. This is wasteful. A docs task that writes a docstring doesn't need the same model as a security task that implements credential scoping.

The cost difference is real. At current API pricing, routing a simple task to Haiku instead of Opus costs roughly 30x less. Over a session with 40-60 tasks, that adds up fast.

How the router works

Bernstein's routing pipeline has three layers:

Layer 1: Heuristic classification. Every task has a complexity field (low, medium, high) and a role (backend, frontend, qa, security, etc.). The router uses a rule-based classifier to make an initial model/effort assignment. Low-complexity tasks default to Haiku or Sonnet with standard effort. High-complexity tasks get Opus with max effort.

Layer 2: Epsilon-greedy bandit. This is where it gets interesting. The bandit maintains per-role reward estimates for each model. When a task arrives, it exploits the best-known model 80% of the time and explores alternatives 20% of the time. Rewards come from task outcomes: did the agent complete the task? Did tests pass? How many retries were needed?

# Simplified selection logic
candidates = ["sonnet", "opus"] if task.complexity == "high" else CASCADE
selected = bandit.select(role=task.role, candidate_models=candidates)

The CASCADE list includes all available models from cheapest to most capable. For high-complexity tasks, the bandit only considers Sonnet and Opus — sending a hard architecture task to Haiku would waste the agent's time even if it's cheap.

Layer 3: Effectiveness seeding. The bandit warms up using historical effectiveness data from the .sdd/metrics/ directory. If a previous run showed that backend tasks succeed 95% of the time with Sonnet but only 70% with Haiku, the bandit starts with that prior. No cold-start problem after the first session.

What the router learns

After a few sessions, clear patterns emerge:

Task type	Typical model	Why
Docs, docstrings	Haiku	Templated output, low reasoning
Test writing	Sonnet	Needs code understanding, not creativity
Bug fixes	Sonnet	Pattern matching on error traces
Refactoring	Sonnet/Opus	Depends on scope
Architecture, security	Opus	Requires deep reasoning

These aren't hardcoded rules — they're learned from outcomes. If your codebase has unusually complex tests, the bandit will learn to route test tasks to a stronger model.

Configuration

The bandit is enabled by default when a metrics directory exists. You can tune exploration rate and model cascade in your config:

# .sdd/config.yaml
routing:
 bandit_epsilon: 0.2 # 20% exploration
 cascade: [haiku, sonnet, opus]
 min_samples_per_arm: 5 # explore each option at least 5 times

To disable bandit routing and use pure heuristics:

routing:
 bandit_enabled: false

The numbers

Across our internal runs (self-development sessions where Bernstein improves its own codebase), the bandit router cut per-session spend roughly in half compared to the baseline of Sonnet-for-everything. Task completion rates stayed within a couple of percentage points, so cheaper models handle their assigned tasks fine. Measure your own runs with bernstein cost.

The savings compound. A 10-agent session running 50 tasks might cost 15ドル-20 with uniform Sonnet. With bandit routing, the same session runs 7ドル-10. Over weeks of iterative development, that's the difference between a side project budget and a real expense.