Open notebooks for training SAEs and reproducing 2024–2026 mech-interp papers on Gemma, Qwen, and Llama. Apache-2.0.
License Apache 2.0 · openinterp.org/train
| Repo | What's in it |
|---|---|
.github |
Org profile + shared CoC + SECURITY |
web |
Next.js site behind openinterp.org |
notebooks (you are here) |
31 training + interpretability + product-reproducer notebooks |
cli |
pip install openinterp — Python SDK |
mechreward |
SAE features as dense RL reward |
| Tier | Notebook | Platform | VRAM | Cost | Model | Time |
|---|---|---|---|---|---|---|
| Hobbyist | 01_hobbyist_gemma2_2b_colab.ipynb |
Colab Free T4 | 15 GB | 0ドル | Gemma-2-2B | 30–40 min |
| Explorer | 02_explorer_qwen35_4b_kaggle.ipynb |
Kaggle ×ばつ T4 | 32 GB | 0ドル | Qwen3.5-4B (hybrid GDN) | 4–5 h |
| Paper-grade | 03_papergrade_qwen36_27b_cloud.ipynb |
Cloud RTX 6000 Pro | 96 GB | ~30ドル–60 | Qwen3.6-27B | 20–24 h |
| Notebook | What it does |
|---|---|
04_discover_features.ipynb |
Auto-label your SAE's features with Claude or GPT-4, emit feature_catalog.json |
05_build_shareable_trace.ipynb |
Your SAE + your prompt → trace.json in the Trace Theater format |
06_steer_your_model.ipynb |
Live feature intervention: baseline vs α ∈ {−3, 0, 1, 3}. Q1 preview of the Q2 Sandbox. |
| Notebook | What it does |
|---|---|
07_pick_your_tier.ipynb |
VRAM calculator + layer recommender. Zero GPU needed. |
| Notebook | Model | Platform |
|---|---|---|
08_explorer_llama3_8b_kaggle.ipynb |
Llama-3.1-8B (Meta license) | Kaggle ×ばつ T4 |
09_explorer_mistral_7b_kaggle.ipynb |
Mistral-7B-v0.3 | Kaggle ×ばつ T4 |
10_hobbyist_phi3_mini_colab.ipynb |
Phi-3-mini-4k (Microsoft) | Colab Free T4 |
| Notebook | Paper / protocol |
|---|---|
11_stage_gate_g1.ipynb |
Stage Gate 1 correlation pre-test (mechreward protocol) — ρ ≥ 0.30 on held-out GSM8K |
12_batchtopk_vs_topk.ipynb |
BatchTopK vs TopK (Bussmann et al., arxiv:2412.06410) |
| Notebook | What it does |
|---|---|
13_watchtower_preview.ipynb |
Monitor input prompts for anomalous feature activations. Q1 preview of Q4 Watchtower Enterprise. Forward-only, no generation. |
| Notebook | What it does |
|---|---|
14_attribution_patching.ipynb |
AtP* (Kramár et al. 2024, arxiv:2403.00745) — QK-fix + GradDrop node attribution |
15_sparse_feature_circuits.ipynb |
Marks et al. 2024 (arxiv:2403.19647) replication — node + edge + error-term DAG |
16_autocircuit_acdc.ipynb |
ACDC slow-mode via AutoCircuit |
17_train_crosscoder.ipynb |
Sparse Crosscoder (Lindsey et al. 2024) — shared dictionary across L11/L31/L55 |
All circuit notebooks emit JSON consumed directly by the Circuit Canvas on openinterp.org.
| Notebook | What it does |
|---|---|
18_interpscore_eval.ipynb |
Composite SAE ranking — loss_recovered + alive + L0 + sparse probing + TPP. Emits interpscore.json → PR to web/lib/leaderboard.ts. |
| Notebook | Method |
|---|---|
19_logit_lens.ipynb |
Logit Lens (nostalgebraist 2020). 5 lines of PyTorch, ~5 min on T4. |
20_tuned_lens.ipynb |
Tuned Lens (Belrose et al. 2023, arxiv:2303.08112). Pretrained or fresh-fit. |
| Notebook | Method |
|---|---|
21_linear_probe.ipynb |
sklearn LogisticRegression on residuals + diff-of-means baseline (Farquhar 2023 requires it) |
22_ccs_probe.ipynb |
Contrast Consistent Search (Burns 2022) with honest critique baselines |
23_repe_reading_vector.ipynb |
Representation Engineering LAT (Zou 2023) — extract + monitor + steer |
The full research arc behind the 2026年04月25日 blog post on hallucination in 27B reasoning models. Notebooks 24 → 28b shipped 2026年04月25日 → 26.
| Notebook | What it does |
|---|---|
24_hallucination_entity_separation_qwen36_27b.ipynb |
v0.0.1 — fake AUROC=1.0 from a ×ばつ tokenization confound. The honest negative result. |
24b_hallucination_v002_ferrando_proper.ipynb |
Ferrando 2024 replication on Qwen3.6-27B. AUROC 0.84 on 226 real Wikidata entities. |
25_steering_f61723_calibration.ipynb |
Single-feature steering null result. Detection ≠ control. |
26_multi_feature_steering.ipynb |
Multi-feature top-K (no controls). The version we almost shipped overclaimed. |
27_multi_feature_steering_with_controls.ipynb |
The walk-back. 6 controls (random-K + Claude judge + permutation). It induces hallucination, not calibration. |
28_paper_baselines_qwen36_27b.ipynb |
ICML MI Workshop 2026 paper-1 baselines. L31/f34957 0.81 vs LR ceiling 0.887 vs diff-of-means 0.859. Per-layer scan, bootstrap CI. |
28b_sensitivity_refusal_only.ipynb |
Sensitivity ablation — same residual capture, two labelling rules. Reviewer-defence. |
The methodology behind paper-1's Pearson causal-equivalence (Pearson_CE) finding.
First per-feature causal-equivalence test in the crosscoder literature.
| Notebook | What it does | Pair |
|---|---|---|
17_train_crosscoder.ipynb |
Cross-LAYER crosscoder (Lindsey 2024). Single model, multi-layer. | Gemma-2-2B L6/L12/L18 |
17b_crosscoder_model_diff_papergrade.ipynb |
Cross-MODEL crosscoder + Pearson_CE. Median cosine 0.965 vs CE 0.616 — 38% gap. | Gemma-2-2B base/IT |
17c_crosscoder_rl_diffing_papergrade.ipynb |
Cross-STAGE crosscoder. LoRA toggle pattern (single base + PEFT.disable_adapter). | Qwen3.5-4B base vs mechreward-G3 |
Each notebook reproduces an exact metric behind a shipped openinterp Guard
(SDK on PyPI, demo on HF, landing on openinterp.org/products/X).
Drop-in pip install openinterp and you have these probes.
| Notebook | Product | Headline number | Reproducer |
|---|---|---|---|
30_hallucinationguard_proof_qwen36_27b.ipynb |
FabricationGuard PoC v1 | Single-feature failed cross-bench (0.50–0.60) | Open in Colab |
31_hallucinationguard_v2_linear_probe.ipynb |
FabricationGuard v2 (production) | AUROC 0.88 cross-task · −88% confident-wrong | Open in Colab |
32_reasoningguard_proof_qwen36_27b.ipynb |
ReasoningGuard PoC | TBD — passes 3/3 ships v0.3 | Open in Colab |
Each reproducer ships:
probe.joblib+meta.jsonto HF dataset (drop-in for the SDK)verdict.jsonwith raw numbersheadline.pngfor landing pages / posts- All artifacts pushed to
caiovicentino1/<ProductName>-linearprobe-qwen36-27b(HF dataset)
All tiers use the same research-grade protocol; hyperparameters scale:
- TopK activation (Gao et al. 2024) — hard top-k, no L1 penalty
- AuxK auxiliary loss — dead-feature revival (α=1/32, k_aux=d/2, dead_threshold=10M tokens)
- Geometric-median
b_decinit (Weiszfeld) — robust to heavy-tailed residuals - Decoder column renorm every step — keeps features interpretable
- Cosine LR + warmup — non-zero floor for continued dead-feature revival
- HuggingFace streaming checkpoints — crash-safe, never lose more than 5-10 min
- sae_lens-compatible export —
safetensors+cfg.json
Use dtype=torch.bfloat16 (not the deprecated torch_dtype=) and attn_implementation='sdpa' (not flash-attn — reproducibility + install pain across Colab/Kaggle). HF_TOKEN goes through Colab/Kaggle secrets, never hard-coded. Stream checkpoints to HF every 5–10M tokens — Drive-only checkpoints die with the kernel. Use the multimodal layer-access fallback (getattr(model.model, 'layers', None) or model.model.language_model.layers), not a hard-coded .layers[N]. Report honest var_expl, L0, and dead-feature percentage — not cherry-picked seeds. CI checks all of these.
Three common PR patterns, full rules in CONTRIBUTING.md:
- Port a notebook to a new model — pick an existing notebook at your tier and swap
MODEL_ID,LAYER,D_MODEL. Name itNN_<tier>_<model>_<platform>.ipynb. - Replicate a 2024–2026 paper — title cell with arxiv link, pinned install, paper hyperparameters, inline implementation, validation cell that matches the paper's headline metric within tolerance.
- Add a platform (TPU/ROCm/MPS) — write a
_platform_<name>.pyhelper withpick_device()/get_dtype(), patch one notebook as PoC, open a draft PR and tag @caiovicentino for design review.
Before opening a PR, validate JSON: python3 -c "import json; json.load(open('notebooks/YOUR.ipynb'))". CI runs nbformat.validate. If you have a GPU, dry-run with jupyter nbconvert --to notebook --execute --ExecutePreprocessor.timeout=300 — expect heavy training cells to time out; you're just catching import + dtype bugs.
If your notebook emits a JSON that the website consumes, match the schema:
| Tool | Schema (TypeScript source) |
|---|---|
| Trace Theater | web/lib/trace-data.ts · TraceScenario |
| Circuit Canvas | web/lib/circuit-data.ts · CircuitData |
| InterpScore leaderboard | web/lib/leaderboard.ts · LeaderboardEntry |
Your SAE is an asset. Put it to work:
- Trace Theater — 10 scenarios, view + share
- InterpScore — public leaderboard, submit your SAE
- Sandbox (Q2 2026) — drag-and-drop steering
- Expeditions (Q3 2026) — turn your run into a tutorial
- Discussions — "which notebook should I use for X?"
- Good-first-issues — start here
- Contributor guide — full workflow
- hi@openinterp.org
SAELens (checkpoint format) · Gemma Scope (reference at-scale SAE suite) · Gao et al. 2024 (TopK + AuxK) · Bussmann et al. 2024 (BatchTopK) · Neuronpedia.
Apache-2.0 · openinterp.org