Find the layer where a language model commits a decision — and steer it. Any open-weight HF model. (WANDERING arc paper #6)
-
Updated
Jun 7, 2026 - Python
Find the layer where a language model commits a decision — and steer it. Any open-weight HF model. (WANDERING arc paper #6)
Open-source EU AI Act Annex IV documentation toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a structured, hash-chained evidence package.
OKI TRACE: Local LLM observability. See step-by-step, layer-by-layer what your AI thinks. Logit Lens & Attention for HuggingFace models.
Decoding the black box of LLMs: A comparative analysis of Logit Lens vs. Tuned Lens to interpret intermediate Transformer layers in GPT-2.
🏛️ Champollion cracked hieroglyphs in 1822. I applied the same logic to LLM internals. 95% accuracy, 0ドル cost, fully reproducible. Contributors welcome.
Mechanistic interpretability CLI for transformer models on Apple Silicon. Analyze per-layer predictions, monitor activation drift, compare models, discover circuits. MLX-based, no GPU needed.
Local Streamlit app for mechanistic interpretability of transformer models.
Sparse Readout Prism: a sparse LM-head basis for logit-lens readouts — companion code for the paper. Pretrained dictionaries: hf.co/hematteo/sparse-readout-prism
From-scratch PyTorch implementation of the Tuned Lens (Belrose et al., 2023) — learned per-layer affine probes that sharpen intermediate transformer predictions beyond the raw logit lens.
We optimize a compact latent state (frozen weights) to force failed multi-hop chains to output the missing answer D. 5 pre-registered controls show it simply injects D: carries it without the code-fact, leaves intermediates invisible, inert to hop corruption, and doesn’t transfer. No latent composition at 3B (Llama-3.2-3B, Qwen2.5-3B).
Empirical evidence for predictive coding tendencies in the GPT-2 family: residual stream convergence, activation patching, MLP transform analysis, zero-ablation, and logit lens across 7 languages.
Logit Lens terminal visualizer (nostalgebraist, 2020) — decodes GPT-2's intermediate layer predictions using the unembedding matrix, built with TransformerLens and Rich.
Add a description, image, and links to the logit-lens topic page so that developers can more easily learn about it.
To associate your repository with the logit-lens topic, visit your repo's landing page and select "manage topics."