Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

OpenInterpretability/openinterp-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

1 Commit

openinterp-lab

An agent-operable mechanistic-interpretability lab on the new Google Colab CLI. Replicate real interpretability papers with one command. Teach your coding agent to do mech-interp.

uv tool install google-colab-cli && colab status # auth once
pip install git+https://github.com/OpenInterpretability/openinterp-lab
oilab replicate tool-entropy # FREE, CPU, ~1 min -> REPLICATION PASS
oilab replicate lever-is-late -y # rents an A100, replicates a causal-steering paper, tears down

What this is

The Colab CLI (June 2026) lets a terminal — or an AI agent — provision GPUs and run code on them. openinterp-lab builds the research layer on top:

  • oilab replicate <paper> — one-command, auto-verified replication of published interpretability experiments (fetch notebook → provision GPU → execute → pull results JSON → compare against published numbers → PASS / DIVERGENT verdict → tear down).
  • oilab run <notebook> --gpu A100 — run your experiment with the hardened flow (proper timeouts, ephemeral-disk-safe result capture, token injection without echoing, auto-teardown).
  • skills/openinterp-lab/SKILL.md — a skill file that teaches Claude Code / Codex / any agent to operate the whole stack: 5 research loops, the verified asset registry, and every operational gotcha we hit so your agent doesn't have to.
# give the skill to Claude Code:
cp -r skills/openinterp-lab ~/.claude/skills/
# then just ask: "replicate the lever-is-late paper and explain what it shows"

Replicable experiments

key claim under test hardware
tool-entropy tool-use entropy collapse separates WANDERING agents (AUROC 0.887) — DOI 10.5281/zenodo.20368600 CPU, free
lever-is-late the termination decision of a 27B agent is causally writable only in a late action-commitment block — task-matched donor flips real generations 42%, p=0.031 — DOI 10.5281/zenodo.20534219 Colab A100
commitment-lever (pre-registered, in flight) does that late lever generalize to a second committal action? Colab A100

Replication divergence is a finding, not a failure — open an issue with your results.json.

The 5 loops (see SKILL.md for full recipes)

  1. Replicate a paper (above).
  2. Locate & steer a decision lever on any open model — decision-locator.
  3. Probe with the causal step enforced — the report always answers predicts? AND controls? separately (the arc's core lesson: an AUROC-0.91 feature can be causally inert).
  4. SAE features on Qwen3.6-27B with the pretrained 11-layer full-stack SAE.
  5. Honest-research pipeline: PREREG → run → adversarial EVAL → Zenodo DOI. Nulls included.

Why trust this

It's extracted from a real research program — the WANDERING arc (6 papers + a tool, all open access with permanent DOIs, including the honest nulls and the corrected claims). The data is public (99 labeled SWE-bench Pro agent trajectories), the notebooks are public, and the wrapper exists because we lost a GPU run to every gotcha it now guards against.

Open problem, free to a good home

Early external detection of agent WANDERING is unsolved — we tested 6 cheap methods (tool-entropy variants, repeated actions, information gain, reasoning-text signals, fused classifiers, an LLM judge) and none beat chance early. The labeled dataset is public. If you can detect WANDERING by turn 15 at <5% FP, that's a paper. Baselines to beat are in the SKILL.md.

License

Apache-2.0. Built by OpenInterpretability · @0xCVYH.

About

Agent-operable mechanistic-interpretability lab on the Google Colab CLI — one-command paper replication, GPU research loops, and an agent skill file

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /