Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

OpenInterpretability

OpenInterpretability

When should we believe a mechanistic interpretability claim — and where, inside a model, does a decision actually live?

Mechanistic interpretability of long-horizon LLM agents, built on Qwen3.6-27B since April 2026: a protocol, a benchmark, a registry — and the WANDERING arc, a six-paper study of why agents fail to stop that ends in the first positive.

openinterp.org · decision-locator · pip install openinterp · pip install openinterp-mcp · Apache-2.0


⭐ Featured — the WANDERING arc + decision-locator

Long-horizon coding agents fail by WANDERING: they stay internally sure the task is solved but never emit the finish action, burning the whole turn budget. Across six papers (Qwen3.6-27B, SWE-bench Pro, all CC-BY-4.0) we showed the agent's "task-done" verdict is linearly decodable (AUROC 0.81–0.91) yet causally inert — no residual injection rescues it, and clamping the exact, named SAE "done" feature moves the probability of finishing by −0.001 — until we found where control actually lives.

The law: the knowledge–action gap on agents is a layer gap. The decision is known mid-stream (the verdict, L23) but only writable late (L51–63, ~30 layers downstream). Patching that late, task-matched block makes a stuck agent emit a real finish call 42% of the time (exact McNemar p = 0.031), from a 0% baseline.

🛠 decision-locator packages the method — find & steer the commitment layer for any tool-calling decision on any open-weight model:

pip install git+https://github.com/OpenInterpretability/decision-locator
decision-locator demo --model gpt2 # locate → sweep → steer, on a laptop

📄 The arc, permanent DOIs: #1 Tool-Entropy · #2 Right Locus · #3 Multi-Channel · #4 Modality Matters · #5 Verdict Circuit · #6 The Lever Is Late · companion note — read them at openinterp.org/research.


What's here

Core protocol

Repo What
registry Six Diagnostics schemas + reference implementation. JSON cards for probes, causal reports, intervention traces. Failed-Replication Registry data.

Research artifacts

Repo What
openinterp-swebench-harness Instrumented agent harness capturing SAE feature trajectories during agent reasoning on SWE-bench Pro. Substrate for the six-paper WANDERING arc.
decision-locator pip install-able, model-agnostic tool: find the layer where a model commits a decision, and steer it. The method behind WANDERING arc paper #6. CLI + Colab + CI.
inspect-tool-entropy-collapse The tool-entropy-collapse WANDERING detector as an Inspect eval (UK AISI inspect_evals submission).
mechreward Mechanistic interpretability as reward signal for RL training. SAE features + GRPO + anti-Goodhart framework.

Developer tools

Repo What
cli pip install openinterp. FabricationGuard probe + ProbeBench leaderboard + Atlas search + Trace generation.
openinterp-mcp MCP server + Colab backend. Bring-your-own-agent infrastructure for mech-interp research. Claude Code · Cursor · Cline compatible.
notebooks Train your first SAE in 30 min → paper-grade at 27B. Free Colab + Kaggle + cloud ladders.

Web

Repo What
web openinterp.org — the protocol, the registry, the publications.

Why this exists

Probes that hit AUROC 0.95 at N=50 collapse at N=500. SAE features that "explain" a concept fail under matched-norm random controls. Steering vectors that flip outputs turn out to be softmax temperature shifts. CoT-redirect interventions that clear sabotage end up causing it. And a "task-done" feature that predicts finishing at AUROC 0.91 doesn't cause it.

We caught these in our own work — with documented walk-backs and pre-registered nulls. The protocol, and the arc's first positive, are the result.

→ Read the Six Diagnostics: openinterp.org/research → Browse the Registry: openinterp.org/atlas → Eval Standard schemas: github.com/OpenInterpretability/registry


Maintainer: Caio Vicentino · caio@openinterp.org · Fortaleza, Brazil License: Apache-2.0 (code) · CC-BY-4.0 (documentation) Collaborate: caio@openinterp.org

Popular repositories Loading

  1. mechreward mechreward Public

    Mechanistic interpretability as reward signal for RL training of LLMs — SAE features + GRPO + anti-Goodhart framework

    Jupyter Notebook 5

  2. notebooks notebooks Public

    Train your first SAE in 30 min → paper-grade at 27B. Free Colab · free Kaggle · cloud ladders. Every scale covered.

    Jupyter Notebook 3

  3. openinterp-swebench-harness openinterp-swebench-harness Public

    Instrumented agent harness for capturing SAE feature trajectories during SWE-bench Pro traces on Qwen3.6-27B (mech anatomy of agent reasoning failure)

    Python 2

  4. decision-locator decision-locator Public

    Find the layer where a language model commits a decision — and steer it. Any open-weight HF model. (WANDERING arc paper #6)

    Python 2

  5. web web Public

    Next.js site for OpenInterpretability — the umbrella org for mechreward and public hybrid-architecture SAEs

    TypeScript 1 1

  6. agentguard agentguard Public

    Defense-in-depth action firewall for tool-using agents, with a model-internal intent brake that closes the model-origin blind spot. Built on Zenodo 10.5281/zenodo.20679287.

    Python 1

Repositories

Loading
Type
Select type
Language
Select language
Sort
Select order
Showing 10 of 13 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading...

Most used topics

Loading...

AltStyle によって変換されたページ (->オリジナル) /