-
Notifications
You must be signed in to change notification settings - Fork 0
Releases: OpenInterpretability/openinterp-swebench-harness
Paper v0.8: Tool-Entropy Collapse (Cross-Architecture Signature of Agent WANDERING Failure)
Tool-Entropy Collapse: A Cross-Architecture Signature of Agent WANDERING Failure
📄 Zenodo (v0.9 calibration revision): 10.5281/zenodo.20368807
🔗 Concept DOI (always latest): 10.5281/zenodo.20368600
📌 Original v0.8: 10.5281/zenodo.20368601
💬 LessWrong post: https://www.lesswrong.com/posts/D23kWLmpcXhFAQ2RY/tool-entropy-collapse-a-cross-architecture-signature-of
PDF + LaTeX source + 5 figures.
v0.9 calibration changes vs v0.8: removed 'breakthrough' language throughout in favor of 'most promising candidate signal in this work'. Added explicit hedge in conclusion that W/S ≈ 0.41 ratio match between Qwen and Llama is the most suggestive pattern that merits independent replication before being treated as a discovery. Same empirical results, same scope, same numbers.
Headline numbers:
- 34% WANDERING blind spot in probe-only monitoring (95% CI [22.0%, 45.8%])
- 6 detector designs across 3 signal channels tested
- Tier 3 detector (v1 ∪ v5): 70% recall ×ばつ 5% FP on Qwen primary dataset (N=20 WANDERING)
- Cross-architecture suggestive pattern: W/S median entropy ratio ≈ 0.41 in Qwen AND Llama, 0.71 in GPT-5
Honest scope:
- Cross-task validation on METR MALT (15+ task families) is NULL
- Scoped claim: multi-turn code-execution agent tasks with rich action spaces
- W/S 0.41 ratio match merits independent replication on additional models
Citation:
Vicentino, C. (2026). Tool-Entropy Collapse: A Cross-Architecture Signature of
Agent WANDERING Failure (v0.9 calibration revision). Zenodo.
https://doi.org/10.5281/zenodo.20368807
Reproducibility: all scripts in this repo. Per-trajectory output JSONs in scripts/inflection_turn_out/. Apache-2.0.
Assets 9
v0.5 — paper-5 saturation-direction lever shipped
Paper-5 published 2026年05月09日
Saturation-Direction Lever: A Five-Class Taxonomy of Probe Causality in Qwen3.6-27B
→ https://openinterp.org/research/papers/saturation-direction-probe-levers
Headline finding
α=−100 robustness theorem — the L31 pre_tool capability probe direction produces +33-40pp probe-vs-random pushdown gap across code distributions spanning Qwen3.6-27B pass-rate ~7-89% (HumanEval+MBPP, BigCodeBench, Codeforces ≥2000). The α=−100 locus is saturation-independent at moderate amplitude.
Phase coverage (this release)
- Phase 0 — Smoke G1+G4 PASS
- Phase 1 — N=20 stratified, 12k captures, G4 GREEN
- Phase 2 — Differential probes, AUROC up to 0.958
- Phase 5d/6/6c — N=99 + methodology sweep
- Phase 7 — L43 pre_tool epiphenomenal (softmax-temp artifact)
- Phase 8 — L55 thinking template-locked
- Phase 10 — RG L55 mid_think first lever (+30pp pushup at α=+200)
- Phase 11/11b — 4/4 capability sites pushdown-asymmetric (+30 to +60pp)
- Phase 12 — Persona falsifier #1 → motivates saturation-direction theory
- Phase 11c — Cross-distribution BCB +33pp at α=−100
- Phase 11d — Codeforces ≥2000 +40pp at α=−100; falsifier #2 walks back saturation-magnitude corollary
- Phase 11e — Multi-site Codeforces (in progress at release)
Methodology contributions (mandatory at every site)
- Random K-matched probe baseline (paper-3 §3.1)
- Control-token normalization for log-prob shifts (paper-3 §3.2)
- Structural-rigidity α-sweep at α >> ‖h‖ (paper-3 §3.3)
- Whitespace-stripped behavioral flip metric (paper-3 §3.4)
3 of the 4 caught a confident-but-wrong claim during the work.
Two pre-registered falsification cycles
- #1 Phase 12 persona — predicted pushup-asymmetric (continuous-gradient class), observed pushdown → falsifies categorical-vs-continuous frame, motivates saturation-direction principle
- #2 Phase 11d Codeforces — predicted pushdown collapse + pushup emergence at lowest saturation (saturation-magnitude corollary), observed saturation-INDEPENDENT pushdown → walks back corollary, refines to α=−100 robustness theorem
Companion artifacts
- HF dataset —
caiovicentino1/agent-probe-guard-qwen36-27b(probe weights for v0.1 SDK; v0.2 will swap L43 → L31 pre_tool from Phase 11) - PyPI —
pip install openinterp(v0.3.0+) - Web — paper-5 + 4 prior papers at openinterp.org/research
- License — Apache-2.0 throughout
Reproducibility
12 standalone Colab notebooks in notebooks/. Build scripts in scripts/build_nb_swebench_v*.py regenerate every notebook from source. ~6.5h total compute on RTX 6000 Blackwell from cold start.