Build status Version Python Version Downloads License: MIT
π Explore Interpreto docs >>
πΌοΈ Checkout our explanation gallery >>
π Read our paper >>
The library is available on PyPI, try pip install interpreto to install it.
Checkout the tutorials to get started:
- Attributions walkthrough (both classification and generation)
- Classification concept-based explanations
- Generation concept-based explanations
Interpreto πͺ provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.
Interpreto includes both inference-based and gradient-based attribution methods.
They all work seamlessly for both classification (...ForSequenceClassification) and generation (...ForCausalLM)
Inference-based Methods:
KernelShapβ Lundberg and Lee, 2017LIMEβ Ribeiro et al., 2013Occlusionβ Zeiler and Fergus, 2014Sobolβ Fel et al., 2021
Gradient-based methods:
GradientShapβ Lundberg and Lee, 2017InputxGradientβ Simonyan et al., 2013Integrated Gradientβ Sundararajan et al., 2017Saliencyβ Simonyan et al., 2013SmoothGradβ Smilkov et al., 2017SquareGradβ Hooker et al., 2019VarGradβ Richter et al., 2020
Concept-based explanations aim to provide high-level interpretations of latent model representations.
We propose both supervised (probes and CAVs) and unsupervised (dictionary learning) approaches.
Interpreto generalizes these methods through four core steps, the two first are common between both approaches:
- Split a model in two and obtain a dataset of activations
- Learn concepts (e.g., from latent embeddings)
- Interpret concepts (mapping discovered concepts to human-understandable elements)
- Estimate concepts importance (assessing concept relevance to model outputs)
1. Split a model in two and obtain a dataset of activations: (mainly via nnsight):
Choose any layer in any HuggingFace language model with our ModelWithSplitPoints based on nnsight. Then pass a dataset through it to obtain a dataset of activations.
2. (supervised) Train probe with the ProbeExplainer
We differentiate two families of probes:
- Linear probes:
LinearRegressionProbe,LogisticRegressionProbe,LinearSVMProbe,MeansDiffProbe - Centroid-based probes:
CosineCentroidProbe,DotProductCentroidProbe,SqL2CentroidProbe,SVDDCentroidProbe,DiagonalMahalanobisCentroidProbe
Both can be tuned with bias_calibrator and normalization parameters.
2. (unsupervised) Dictionary Learning for Concept Discovery (mainly via overcomplete):
- Interpret neurons directly via
NeuronsAsConcepts NMF,Semi-NMF,ConvexNMFICA,SVD,PCA,KMeans- SAE variants:
Vanilla SAE,TopK SAE,JumpReLU SAE,BatchTopK SAE
3. (unsupervised) Available Concept Interpretation Techniques:
- Top-k tokens from tokenizer vocabulary via
TopKInputsanduse_vocab=True - Top-k tokens/words/sentences/samples from specific datasets via
TopKInputs - Label concepts via LLMs with
LLMLabels(Bills et al. 2023) - Input-to-concept attribution from dataset examples (Concept Attributions) (Jourdan et al. 2023)
Concept Interpretation Techniques Added in the future:
- Aligning concepts with human labels (Sajjad et al. 2022)
- Word cloud visualizations of concepts (Dalvi et al. 2022)
- VocabProj & TokenChange (Gur-Arieh et al. 2025)
4. (unsupervised) Concept-to-Output Attribution:
Estimate the contribution of each concept to the model output.
Can be obtained with any concept-based explainer via MethodConcepts.concept_output_gradient().
Papers available in the future:
Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:
- ConceptSHAP: Yeh et al. 2020, On Completeness-aware Concept-Based Explanations in Deep Neural Networks
- COCKATIEL: Jourdan et al. 2023, COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP
- Yun et al. 2021, Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
- FFN values interpretation: Geva et al. 2022, Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
- SparseCoding: Cunningham et al. 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models
- Parameter Interpretation: Dar et al. 2023, Analyzing Transformers in Embedding Space
Evaluation Metrics for Attribution
To evaluate attribution methods faithfulness, there are the Insertion and Deletion metrics.
Evaluation Metrics for Concepts
Concept-based methods have several steps that can be evaluated together via ConSim.
Or independently:
- Concept-space (dictionary learning evaluation)
- faithfulness:
MSE,FID, andReconstructionError - complexity:
Sparsity,SparsityRatio,SparsityRatio - stability:
Stability
- faithfulness:
- Concepts interpretations
- No metric yet, will be included soon.
- Concept-to-Output attribution
- No metric yet, will be included soon.
Feel free to propose your ideas or come and contribute with us on the Interpreto πͺ toolbox! We have a specific document where we describe in a simple way how to make your first pull request.
More from the DEEL project:
- Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
- Puncc a Python library for predictive uncertainty quantification using conformal prediction.
- oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
- deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
- deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
- Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
- DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.
This project received funding from the French "Investing for the Future β PIA3" program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.
Interpreto πͺ is a project of the FOR and the DEEL teams at the IRT Saint-ExupΓ©ry in Toulouse, France.
If you use Interpreto πͺ as part of your workflow in a scientific publication, please consider citing ποΈ our paper:
@article{poche2025interpreto, title = {Interpreto: An Explainability Library for Transformers}, author = {Poch{\'e}, Antonin and Mullor, Thomas and Sarti, Gabriele and Boisnard, Fr{\'e}d{\'e}ric and Friedrich, Corentin and Claye, Charlotte and Hoofd, Fran{\c{c}}ois and Bernas, Raphael and Hudelot, C{\'e}line and Jourdan, Fanny}, journal = {arXiv preprint arXiv:2512.09730}, year = {2025} }
The package is released under MIT license.