Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
-
Updated
Apr 14, 2026 - Jupyter Notebook
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching"
Lightweight representation engineering dataflow operations for agent developers.
Real-time 3D visualisation of SAE feature activations inside GPT-2, token by token
Investigating whether language models encode anticipated social consequences in their activations. Uses a 2x2 factorial design crossing truth ×ばつ social valence to show that models are more sensitive to expected approval/disapproval than to truth itself.
Training and exploration of linear probes into Othello-GPT by Li et al. (2022)
Open-source EU AI Act Annex IV documentation toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a structured, hash-chained evidence package.
Implementation and analysis of Sparse Autoencoders for neural network interpretability research. Features interactive visualization dashboard and W&B integration.
Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.
Evaluating how a model 'knowing what it knows' changes from base to instruct
Testing role-based pathways on small LLMs
Knowledge Activation Mapping & Understanding Interface (KAMUI) — A Transformer Interpretability Framework Built From Scratch in PyTorch.
Mechanistic interpretability toolkit for comparing transformer activations, token shifts, and activation patching behaviour.
When does activation steering actually work? A reliability audit of steering vectors on GPT-2-small.
Reverse engineering the circuit responsible for the "greater than" capability in a language model
A Flax-based library for examining transformers, based on TransformerLens.
Probing where in Pythia's residual stream the decision to be sycophantic is already 'decided', using linear classifiers on per-layer activations against a small labeled sycophancy dataset.
Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.
Automated detection, visualization and suppression of hallucination-associated neurons in open-source LLMs — LLM mechanistic interpretability research tool
Hands-on exploration of GPT-2 and transformer internals for text generation using TransformerLens — attention, mechanistic interpretability and sampling, explained step by step.
Add a description, image, and links to the transformerlens topic page so that developers can more easily learn about it.
To associate your repository with the transformerlens topic, visit your repo's landing page and select "manage topics."