OpenInterpretability

When should we believe a mechanistic interpretability claim — and where, inside a model, does a decision actually live?

Mechanistic interpretability of long-horizon LLM agents, built on Qwen3.6-27B since April 2026: a protocol, a benchmark, a registry — and the WANDERING arc, a six-paper study of why agents fail to stop that ends in the first positive.

openinterp.org · decision-locator · pip install openinterp · pip install openinterp-mcp · Apache-2.0

⭐ Featured — the WANDERING arc + `decision-locator`

Long-horizon coding agents fail by WANDERING: they stay internally sure the task is solved but never emit the finish action, burning the whole turn budget. Across six papers (Qwen3.6-27B, SWE-bench Pro, all CC-BY-4.0) we showed the agent's "task-done" verdict is linearly decodable (AUROC 0.81–0.91) yet causally inert — no residual injection rescues it, and clamping the exact, named SAE "done" feature moves the probability of finishing by −0.001 — until we found where control actually lives.

The law: the knowledge–action gap on agents is a layer gap. The decision is known mid-stream (the verdict, L23) but only writable late (L51–63, ~30 layers downstream). Patching that late, task-matched block makes a stuck agent emit a real finish call 42% of the time (exact McNemar p = 0.031), from a 0% baseline.

🛠 decision-locator packages the method — find & steer the commitment layer for any tool-calling decision on any open-weight model:

pip install git+https://github.com/OpenInterpretability/decision-locator
decision-locator demo --model gpt2 # locate → sweep → steer, on a laptop

📄 The arc, permanent DOIs: #1 Tool-Entropy · #2 Right Locus · #3 Multi-Channel · #4 Modality Matters · #5 Verdict Circuit · #6 The Lever Is Late · companion note — read them at openinterp.org/research.

What's here

Core protocol

Repo	What
registry	Six Diagnostics schemas + reference implementation. JSON cards for probes, causal reports, intervention traces. Failed-Replication Registry data.

Research artifacts

Repo	What
openinterp-swebench-harness	Instrumented agent harness capturing SAE feature trajectories during agent reasoning on SWE-bench Pro. Substrate for the six-paper WANDERING arc.
decision-locator	`pip install`-able, model-agnostic tool: find the layer where a model commits a decision, and steer it. The method behind WANDERING arc paper #6. CLI + Colab + CI.
inspect-tool-entropy-collapse	The tool-entropy-collapse WANDERING detector as an Inspect eval (UK AISI `inspect_evals` submission).
mechreward	Mechanistic interpretability as reward signal for RL training. SAE features + GRPO + anti-Goodhart framework.

Developer tools

Repo	What
cli	`pip install openinterp`. FabricationGuard probe + ProbeBench leaderboard + Atlas search + Trace generation.
openinterp-mcp	MCP server + Colab backend. Bring-your-own-agent infrastructure for mech-interp research. Claude Code · Cursor · Cline compatible.
notebooks	Train your first SAE in 30 min → paper-grade at 27B. Free Colab + Kaggle + cloud ladders.

Web

Repo	What
web	openinterp.org — the protocol, the registry, the publications.

Why this exists

Probes that hit AUROC 0.95 at N=50 collapse at N=500. SAE features that "explain" a concept fail under matched-norm random controls. Steering vectors that flip outputs turn out to be softmax temperature shifts. CoT-redirect interventions that clear sabotage end up causing it. And a "task-done" feature that predicts finishing at AUROC 0.91 doesn't cause it.

We caught these in our own work — with documented walk-backs and pre-registered nulls. The protocol, and the arc's first positive, are the result.

→ Read the Six Diagnostics: openinterp.org/research → Browse the Registry: openinterp.org/atlas → Eval Standard schemas: github.com/OpenInterpretability/registry

Maintainer: Caio Vicentino · caio@openinterp.org · Fortaleza, Brazil License: Apache-2.0 (code) · CC-BY-4.0 (documentation) Collaborate: caio@openinterp.org

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenInterpretability

OpenInterpretability

⭐ Featured — the WANDERING arc + `decision-locator`

What's here

Core protocol

Research artifacts

Developer tools

Web

Why this exists

Popular repositories Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!

OpenInterpretability

⭐ Featured — the WANDERING arc + decision-locator

What's here

Core protocol

Research artifacts

Developer tools

Web

Why this exists

Popular repositories Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!

⭐ Featured — the WANDERING arc + `decision-locator`