forked from fengshao1227/ccg-workflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Releases: mingrath/arena-workflow
Releases · mingrath/arena-workflow
v1.8.0 — Competitive Multi-Model Redesign
v1.8.0 — Competitive Multi-Model Redesign
What Changed
This fork redesigns the entire model routing architecture based on actual benchmark evidence (March 2026).
Before (Original v1.7.74)
Frontend tasks → Gemini (hardcoded)
Backend tasks → Codex (hardcoded)
Claude → Orchestrator only
Problem: No published benchmarks support this static routing. SWE-bench scores are nearly tied (Claude 80.8%, Gemini 80.6%, Codex 80.0%).
After (v1.8.0 Competitive)
Every task → ALL models compete in parallel → Weighted evaluation → Best output wins
Key Changes (25 files, +1048/-592 lines)
🏗️ Architecture
- Competitive dispatch: Every significant task dispatched to Codex + Gemini + Claude (self) in parallel
- Weighted evaluation: Benchmark-informed criteria score each output per task type
- 3 dispatch modes: Competitive (all 3, default), Focused (best-match only), Quick (Claude only)
- Consensus scoring: Review findings tagged with confidence (3/3, 2/3, 1/3 model agreement)
📊 Benchmark Evidence
| Capability | Leader | Score |
|---|---|---|
| Code quality (SWE-bench) | Claude | 80.8% |
| Terminal workflows | Codex | 77.3% |
| Visual design (WebDev Arena) | Gemini | 1487 ELO |
| Code review quality | Claude | #1 (Milvus benchmark) |
| Edge case detection | Codex | Catches bugs others miss |
| Responsive/accessibility | Claude | Leader (Index.dev test) |
| Rapid prototyping | Codex | 1000+ tok/s |
| Large codebase context | Gemini | 1M token window |
📝 Files Changed
| Category | Files | Changes |
|---|---|---|
| Command templates | 14 | Competitive dispatch + weighted comparison |
| Model prompts | 7 | Evidence-based strengths + known limitations |
| New routing guide | 1 | routing-guide.md with full benchmark data |
| Documentation | 3 | README, CLAUDE.md, package.json |
🚫 What Was Removed
- All
"前端模型=Gemini, 后端模型=Codex"static routing references - All
"后端权威/前端权威"(backend authority/frontend authority) trust rules - Domain-based routing tables in execute.md
✅ What Was Added
- Benchmark evidence table with sources
- Weighted evaluation criteria per task type (analysis, planning, implementation, review, debug)
- Model-specific strengths AND limitations in every prompt
- Consensus scoring with agreement levels in reviews
- Dispatch mode selection (competitive/focused/quick)