Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: OpenInterpretability/openinterp-swebench-harness

Paper v0.8: Tool-Entropy Collapse (Cross-Architecture Signature of Agent WANDERING Failure)

24 May 17:57
@caiovicentino caiovicentino

Choose a tag to compare

Tool-Entropy Collapse: A Cross-Architecture Signature of Agent WANDERING Failure

📄 Zenodo (v0.9 calibration revision): 10.5281/zenodo.20368807
🔗 Concept DOI (always latest): 10.5281/zenodo.20368600
📌 Original v0.8: 10.5281/zenodo.20368601
💬 LessWrong post: https://www.lesswrong.com/posts/D23kWLmpcXhFAQ2RY/tool-entropy-collapse-a-cross-architecture-signature-of

PDF + LaTeX source + 5 figures.

v0.9 calibration changes vs v0.8: removed 'breakthrough' language throughout in favor of 'most promising candidate signal in this work'. Added explicit hedge in conclusion that W/S ≈ 0.41 ratio match between Qwen and Llama is the most suggestive pattern that merits independent replication before being treated as a discovery. Same empirical results, same scope, same numbers.

Headline numbers:

  • 34% WANDERING blind spot in probe-only monitoring (95% CI [22.0%, 45.8%])
  • 6 detector designs across 3 signal channels tested
  • Tier 3 detector (v1 ∪ v5): 70% recall ×ばつ 5% FP on Qwen primary dataset (N=20 WANDERING)
  • Cross-architecture suggestive pattern: W/S median entropy ratio ≈ 0.41 in Qwen AND Llama, 0.71 in GPT-5

Honest scope:

  • Cross-task validation on METR MALT (15+ task families) is NULL
  • Scoped claim: multi-turn code-execution agent tasks with rich action spaces
  • W/S 0.41 ratio match merits independent replication on additional models

Citation:

Vicentino, C. (2026). Tool-Entropy Collapse: A Cross-Architecture Signature of
Agent WANDERING Failure (v0.9 calibration revision). Zenodo.
https://doi.org/10.5281/zenodo.20368807

Reproducibility: all scripts in this repo. Per-trajectory output JSONs in scripts/inflection_turn_out/. Apache-2.0.

Assets 9
Loading

v0.5 — paper-5 saturation-direction lever shipped

09 May 15:37
@caiovicentino caiovicentino

Choose a tag to compare

Paper-5 published 2026年05月09日

Saturation-Direction Lever: A Five-Class Taxonomy of Probe Causality in Qwen3.6-27B
https://openinterp.org/research/papers/saturation-direction-probe-levers

Headline finding

α=−100 robustness theorem — the L31 pre_tool capability probe direction produces +33-40pp probe-vs-random pushdown gap across code distributions spanning Qwen3.6-27B pass-rate ~7-89% (HumanEval+MBPP, BigCodeBench, Codeforces ≥2000). The α=−100 locus is saturation-independent at moderate amplitude.

Phase coverage (this release)

  • Phase 0 — Smoke G1+G4 PASS
  • Phase 1 — N=20 stratified, 12k captures, G4 GREEN
  • Phase 2 — Differential probes, AUROC up to 0.958
  • Phase 5d/6/6c — N=99 + methodology sweep
  • Phase 7 — L43 pre_tool epiphenomenal (softmax-temp artifact)
  • Phase 8 — L55 thinking template-locked
  • Phase 10 — RG L55 mid_think first lever (+30pp pushup at α=+200)
  • Phase 11/11b — 4/4 capability sites pushdown-asymmetric (+30 to +60pp)
  • Phase 12 — Persona falsifier #1 → motivates saturation-direction theory
  • Phase 11c — Cross-distribution BCB +33pp at α=−100
  • Phase 11d — Codeforces ≥2000 +40pp at α=−100; falsifier #2 walks back saturation-magnitude corollary
  • Phase 11e — Multi-site Codeforces (in progress at release)

Methodology contributions (mandatory at every site)

  1. Random K-matched probe baseline (paper-3 §3.1)
  2. Control-token normalization for log-prob shifts (paper-3 §3.2)
  3. Structural-rigidity α-sweep at α >> ‖h‖ (paper-3 §3.3)
  4. Whitespace-stripped behavioral flip metric (paper-3 §3.4)

3 of the 4 caught a confident-but-wrong claim during the work.

Two pre-registered falsification cycles

  • #1 Phase 12 persona — predicted pushup-asymmetric (continuous-gradient class), observed pushdown → falsifies categorical-vs-continuous frame, motivates saturation-direction principle
  • #2 Phase 11d Codeforces — predicted pushdown collapse + pushup emergence at lowest saturation (saturation-magnitude corollary), observed saturation-INDEPENDENT pushdown → walks back corollary, refines to α=−100 robustness theorem

Companion artifacts

  • HF dataset — caiovicentino1/agent-probe-guard-qwen36-27b (probe weights for v0.1 SDK; v0.2 will swap L43 → L31 pre_tool from Phase 11)
  • PyPI — pip install openinterp (v0.3.0+)
  • Web — paper-5 + 4 prior papers at openinterp.org/research
  • License — Apache-2.0 throughout

Reproducibility

12 standalone Colab notebooks in notebooks/. Build scripts in scripts/build_nb_swebench_v*.py regenerate every notebook from source. ~6.5h total compute on RTX 6000 Blackwell from cold start.

Loading

AltStyle によって変換されたページ (->オリジナル) /