AgentAtlas: LLM Agent Benchmarks Need More Than Accuracy

DEV Community

14 to 40 percentage points, collapsing all models to a tight 0.54–0.62 floor regardless of model family.

That means a significant portion of what leaderboards measure is not the agent's capability — it is how much scaffold the evaluation prompt provides. The paper calls this scaffold sensitivity, and it is a systematic confound in how the field currently compares models.

Component 4: Benchmark-Coverage Audit

AgentAtlas maps fifteen agent benchmarks against six behavioral axes, asking which aspects of agent behavior each benchmark actually covers. The audit reveals systematic blind spots: most benchmarks were designed to measure task success in a specific domain and never intended to assess control decisions like CONFIRM, RECOVER, or REFUSE.

Coverage Axis	Benchmarks that cover it	Benchmarks that skip it
Task success (binary)	13 / 15	2 / 15
Trajectory-level failure label	~3 / 15	~12 / 15
Safety / destructive action handling	~4 / 15	~11 / 15
Recovery behavior	~5 / 15	~10 / 15
Cost-efficiency in primary score	0 / 15	15 / 15
Multi-axis composite score	0 / 15	15 / 15

No single benchmark wins on all axes. Combining results across benchmarks using the AgentAtlas vocabulary gives a more complete picture than any leaderboard column alone.

Effloow Lab PoC: Reproducing the Taxonomy in Python

To validate the paper's core claims, Effloow Lab implemented the taxonomy logic using Python stdlib and ran it against six synthetic agent trajectories.

from enum import Enum, auto
from dataclasses import dataclass
from typing import Optional
class ControlDecision(Enum):
 ACT = auto()
 ASK = auto()
 REFUSE = auto()
 STOP = auto()
 CONFIRM = auto()
 RECOVER = auto()
class PrimaryErrorSource(Enum):
 TOOL_INVOCATION = "tool_invocation"
 PLANNING = "planning"
 HALLUCINATION = "hallucination"
 CONTEXT_LOSS = "context_loss"
 PREMATURE_STOP = "premature_stop"
 OVER_EXECUTION = "over_execution"
 REFUSAL_ERROR = "refusal_error"
 CONFIRM_OMISSION = "confirm_omission"
 RECOVERY_FAILURE = "recovery_failure"
class ImpactLevel(Enum):
 BENIGN = "benign"
 PARTIAL = "partial"
 CRITICAL = "critical"
@dataclass
class TrajectoryStep:
 step_id: int
 decision: ControlDecision
 tool: Optional[str]
 outcome: str # "success" | "failure" | "partial"
 notes: str = ""

The PoC ran two labelers over the same trajectories: a taxonomy-blind heuristic and a taxonomy-aware classifier that uses the full label menu. One case diverged clearly: an over-execution failure (an agent that continued sending emails after the report task was already complete) was misclassified as planning by the blind labeler and correctly identified as over_execution by the aware labeler.

The accuracy-masking finding reproduced cleanly. Two trajectories — one involving a tool failure followed by successful recovery, the other a premature stop — achieved nearly identical accuracy scores (67% vs. 50%) while representing fundamentally different agent behaviors. Only the failure taxonomy label (premature_stop vs. tool_invocation) distinguished them.

Full PoC code and outputs are recorded in data/lab-runs/agent-atlas-llm-benchmark-coverage-audit-paper-poc-2026.md.

What This Changes for Developers Building Agents

If you are building or evaluating an LLM agent system, the AgentAtlas framework has three direct practical implications.

Add control-decision logging to your agent loop. Tracking which of the six states your agent fires at each step costs almost nothing at instrumentation time and gives you diagnostic data that pure accuracy logging cannot provide. A spike in CONFIRM-less ACT decisions on destructive tool calls is a safety signal that no accuracy dashboard will surface.

Use failure taxonomy labels when triaging production incidents. When an agent fails in deployment, the first question is usually "what went wrong." Labeling the failure step with a primary_error_source category (hallucination? context_loss? premature_stop?) makes root-cause analysis faster and builds a structured dataset for future training or evaluation.

Be skeptical of scaffold-sensitive benchmarks. The 14–40 percentage point gap between taxonomy-aware and taxonomy-blind conditions means that some published leaderboard scores are measuring prompt-level scaffolding as much as model capability. When comparing agents, test under both conditions and report the gap.

The Broader Context: Related Work

AgentAtlas is not alone in pushing agent evaluation beyond accuracy. Two related lines of work are worth knowing:

AgentRx (Microsoft Research, arXiv:2602.02475) approaches the same problem from the debugging angle: given a failed agent trajectory, automatically localize which step was critical and why. Their grounded-theory derived failure taxonomy has significant overlap with AgentAtlas's nine categories, which is not coincidental — both are responding to the same gap in current evaluation tooling.

ATBench (arXiv:2604.02022) specifically targets trajectory safety: rather than asking whether a task succeeded, it asks whether the path the agent took was safe. This is the evaluation-axis equivalent of AgentAtlas's CONFIRM and REFUSE states.

The common thread across all three is that the field is moving from evaluating outcomes to evaluating behavior trajectories. A model that scores 80% on SWE-Bench while silently skipping confirmation steps on destructive actions is not an 80% deployable agent.

Common Questions

Q: Does AgentAtlas replace existing benchmarks like SWE-Bench or Tau-Bench?

No. The paper is explicit that it does not aim to replace existing benchmarks or introduce a new leaderboard. AgentAtlas provides a vocabulary and an audit methodology that can be applied on top of any existing benchmark. Think of it as a lens, not a replacement.

Q: How does scaffold sensitivity affect published leaderboard numbers?

The paper found that removing the explicit label menu from evaluation prompts drops every tested model's trajectory accuracy by 14–40 percentage points. This suggests that leaderboard rankings partly reflect how well a model leverages evaluation-prompt scaffolding rather than underlying capability. The effect size varies by model family, which means relative rankings also change depending on prompt format.

Q: Can I use the AgentAtlas taxonomy in my own agent evaluation pipeline today?

Yes. The taxonomy is described fully in the paper (arXiv:2605.20530) and requires no new tools or infrastructure. You can implement the six control-decision states as an enum in any language and add failure-category logging to your agent's trajectory recorder. The Effloow Lab PoC above demonstrates a minimal Python implementation using stdlib only.

Q: What is the nine-category failure taxonomy most useful for?

Root-cause analysis in production. When an agent fails, tagging the failure step with a primary_error_source category lets you aggregate failure patterns across runs, identify which categories your agent is most prone to, and target training data or evaluation coverage accordingly.

Key Takeaways

AgentAtlas addresses a real gap: current agent benchmarks were built to measure task success, not the quality of the behavioral trajectory that led to it. A 6-state control-decision taxonomy and a 9-category failure taxonomy give developers a precise vocabulary for what the accuracy column omits.

The finding that removing the taxonomy label menu from evaluation prompts collapses all model scores to a 0.54–0.62 floor should give pause to anyone citing leaderboard numbers as a measure of agent capability. A significant portion of those numbers reflect scaffolding, not the model.

For developers building production agent systems, the practical takeaway is straightforward: log control decisions, label failures, and test your agents under conditions that do not hand them the answer key.

Bottom Line

AgentAtlas does not replace benchmarks — it exposes what they miss. The 6-state control taxonomy and 9-category failure taxonomy are small, implementable additions to any agent evaluation pipeline that make the difference between an accuracy score and an honest assessment of deployability.