Releases: mingrath/arena-workflow

v1.8.0 — Competitive Multi-Model Redesign

10 Mar 04:05

@mingrath mingrath

v1.8.0-competitive

0ec616b

v1.8.0 — Competitive Multi-Model Redesign Latest

Latest

v1.8.0 — Competitive Multi-Model Redesign

What Changed

This fork redesigns the entire model routing architecture based on actual benchmark evidence (March 2026).

Before (Original v1.7.74)

Frontend tasks → Gemini (hardcoded)
Backend tasks → Codex (hardcoded)
Claude → Orchestrator only

Problem: No published benchmarks support this static routing. SWE-bench scores are nearly tied (Claude 80.8%, Gemini 80.6%, Codex 80.0%).

After (v1.8.0 Competitive)

Every task → ALL models compete in parallel → Weighted evaluation → Best output wins

Key Changes (25 files, +1048/-592 lines)

🏗️ Architecture

Competitive dispatch: Every significant task dispatched to Codex + Gemini + Claude (self) in parallel
Weighted evaluation: Benchmark-informed criteria score each output per task type
3 dispatch modes: Competitive (all 3, default), Focused (best-match only), Quick (Claude only)
Consensus scoring: Review findings tagged with confidence (3/3, 2/3, 1/3 model agreement)

📊 Benchmark Evidence

Capability	Leader	Score
Code quality (SWE-bench)	Claude	80.8%
Terminal workflows	Codex	77.3%
Visual design (WebDev Arena)	Gemini	1487 ELO
Code review quality	Claude	#1 (Milvus benchmark)
Edge case detection	Codex	Catches bugs others miss
Responsive/accessibility	Claude	Leader (Index.dev test)
Rapid prototyping	Codex	1000+ tok/s
Large codebase context	Gemini	1M token window

📝 Files Changed

Category	Files	Changes
Command templates	14	Competitive dispatch + weighted comparison
Model prompts	7	Evidence-based strengths + known limitations
New routing guide	1	`routing-guide.md` with full benchmark data
Documentation	3	README, CLAUDE.md, package.json

🚫 What Was Removed

All "前端模型=Gemini, 后端模型=Codex" static routing references
All "后端权威/前端权威" (backend authority/frontend authority) trust rules
Domain-based routing tables in execute.md

✅ What Was Added

Benchmark evidence table with sources
Weighted evaluation criteria per task type (analysis, planning, implementation, review, debug)
Model-specific strengths AND limitations in every prompt
Consensus scoring with agreement levels in reviews
Dispatch mode selection (competitive/focused/quick)

Sources

Assets 2

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: mingrath/arena-workflow

v1.8.0 — Competitive Multi-Model Redesign

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v1.8.0 — Competitive Multi-Model Redesign

What Changed

Before (Original v1.7.74)

After (v1.8.0 Competitive)

Key Changes (25 files, +1048/-592 lines)

🏗️ Architecture

📊 Benchmark Evidence

📝 Files Changed

🚫 What Was Removed

✅ What Was Added

Sources

Uh oh!