Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: mingrath/arena-workflow

v1.8.0 — Competitive Multi-Model Redesign

10 Mar 04:05
@mingrath mingrath

Choose a tag to compare

v1.8.0 — Competitive Multi-Model Redesign

What Changed

This fork redesigns the entire model routing architecture based on actual benchmark evidence (March 2026).

Before (Original v1.7.74)

Frontend tasks → Gemini (hardcoded)
Backend tasks → Codex (hardcoded)
Claude → Orchestrator only

Problem: No published benchmarks support this static routing. SWE-bench scores are nearly tied (Claude 80.8%, Gemini 80.6%, Codex 80.0%).

After (v1.8.0 Competitive)

Every task → ALL models compete in parallel → Weighted evaluation → Best output wins

Key Changes (25 files, +1048/-592 lines)

🏗️ Architecture

  • Competitive dispatch: Every significant task dispatched to Codex + Gemini + Claude (self) in parallel
  • Weighted evaluation: Benchmark-informed criteria score each output per task type
  • 3 dispatch modes: Competitive (all 3, default), Focused (best-match only), Quick (Claude only)
  • Consensus scoring: Review findings tagged with confidence (3/3, 2/3, 1/3 model agreement)

📊 Benchmark Evidence

Capability Leader Score
Code quality (SWE-bench) Claude 80.8%
Terminal workflows Codex 77.3%
Visual design (WebDev Arena) Gemini 1487 ELO
Code review quality Claude #1 (Milvus benchmark)
Edge case detection Codex Catches bugs others miss
Responsive/accessibility Claude Leader (Index.dev test)
Rapid prototyping Codex 1000+ tok/s
Large codebase context Gemini 1M token window

📝 Files Changed

Category Files Changes
Command templates 14 Competitive dispatch + weighted comparison
Model prompts 7 Evidence-based strengths + known limitations
New routing guide 1 routing-guide.md with full benchmark data
Documentation 3 README, CLAUDE.md, package.json

🚫 What Was Removed

  • All "前端模型=Gemini, 后端模型=Codex" static routing references
  • All "后端权威/前端权威" (backend authority/frontend authority) trust rules
  • Domain-based routing tables in execute.md

✅ What Was Added

  • Benchmark evidence table with sources
  • Weighted evaluation criteria per task type (analysis, planning, implementation, review, debug)
  • Model-specific strengths AND limitations in every prompt
  • Consensus scoring with agreement levels in reviews
  • Dispatch mode selection (competitive/focused/quick)

Sources

Assets 2
Loading

AltStyle によって変換されたページ (->オリジナル) /