Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

KempnerInstitute/raptor

Repository files navigation

Block-Recurrent Dynamics in ViTs (Raptor)

tests arXiv

Mozes Jacobs $^{\star1}$ Thomas Fel $^{\star1}$ Richard Hakim $^{\star1}$
Alessandra Brondetta $^{2}$ Demba Ba $^{1,3}$ T. Andy Keller $^{1}$

$^1$Kempner Institute, Harvard University $^2$Osnabrück University $^3$Harvard University


tl;dr Our work introduces the Block-Recurrent Hypothesis (BRH), by noticing that foundation models like DINOv2 can be rewritten using only two recurrent blocks to recover 96% of the original accuracy. We leverage our framework and explore a Dynamical Interpretability approach where we interpret token evolution through layers as trajectories and show that they converge into class-dependent angular basins while late-stage updates collapse into low-rank attractors.

Ultimately, the study reveals that Vision Transformers seems to naturally converge toward compact, iterative programs instead of unique layer-by-layer transformations (indicating a lower algorithmic complexity / Kolmogorov complexity).


Setup

Environment

To run the code, you will need to create a mamba (or conda) environment from the environment.yml file. Create and activate the environment with

mamba env create -f environment.yml
mamba activate raptor

Paths

Edit src/paths.py to have the correct absolute paths to different datasets.

Extracting DINOv2 Activations for ImageNet-1k

For ImageNet, we precompute the DINOv2 activations so that Raptor can train faster. We provide a script to extract the activations from the ImageNet-1k dataset. This script is available in the data directory. This script takes around 5 hours to run on 1 H100 GPU, and storing the activations requires a lot of disk space.

cd data
python precompute_dinov2_act.py

Download Pretrained Classifiers

Download the DINOv2 linear heads from Meta's repository. These are used during training of Raptor.

cd src
wget https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_linear_head.pth
wget https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_reg4_linear_head.pth
cp dinov2_vitb14_reg4_linear_head.pth imagenet_probes/dinov2_vitb14_reg4_linear_head.pth
cp dinov2_vits14_reg4_linear_head.pth imagenet_probes/dinov2_vits14_reg4_linear_head.pth

Usage Example

Raptor training follows 4 main steps. Here, we show example usage for a 3-block Raptor:

  1. Determine max-cut segmentations. This has been done for you in src/000_max_cut_dinov2_base.ipynb.
  2. Train each block independently.
cd src
python trainer.py --teacher_force --mse --sigma 0 --lr 3e-4 --wandb --t_scale --swiglu --start_layer 0 --end_layer 7 --seed 100
python trainer.py --teacher_force --mse --sigma 0 --lr 3e-4 --wandb --t_scale --swiglu --start_layer 7 --end_layer 10 --seed 101
python trainer.py --teacher_force --weighted --sigma 0 --lr 3e-4 --wandb --t_scale --swiglu --start_layer 10 --end_layer 12 --seed 104
  1. Train the full model with the pretrained blocks.
cd src
BP1="final_weighted_False_autoregressive_False_distillation_False_teacher_True_mse_True_cosine_False_t_scale_True_swiglu_True_sigma_0.0_start_0_end_7_lr_0.0003_cls_weight_0.34_reg_weight_0.33_patch_weight_0.33_seed_100_step_312500.pt"
BP2="final_weighted_False_autoregressive_False_distillation_False_teacher_True_mse_True_cosine_False_t_scale_True_swiglu_True_sigma_0.0_start_7_end_10_lr_0.0003_cls_weight_0.34_reg_weight_0.33_patch_weight_0.33_seed_101_step_312500.pt"
BP3="final_weighted_True_autoregressive_False_distillation_False_teacher_True_mse_False_cosine_False_t_scale_True_swiglu_True_sigma_0.0_start_10_end_12_lr_0.0003_cls_weight_0.34_reg_weight_0.33_patch_weight_0.33_seed_104_step_312500.pt"
python trainer.py --raptor3 --autoreg --weighted --sigma 0 --lr 3e-4 --wandb --t_scale --swiglu --start_layer 0 --end_layer 12 --cls_weight 0.45 --reg_weight 0.10 --patch_weight 0.45 --bp1 $BP1 --bp2 $BP2 --bp3 $BP3 --seed 1101
  1. Train linear probes on the frozen pretrained checkpoints.
cd src/imagenet_probes
python train_probe.py --variant raptor3 --model_seed 1101 --seed 4005
cd src/ade20k_probes
python train_probe.py --variant raptor3 --model_seed 1101 --seed 5005
cd src/nyud_probes
python train_probe.py --variant raptor3 --model_seed 1101 --seed 6005

Reproducing Foundation Models Results (Section 3)

To reproduce the results for the foundation models section (Table 1 and Figure 7), do the following:

  1. Determine max-cut segmentations. This has been done for you in src/max_cut_dinov2_base.ipynb.
  2. Train each block independently.
cd src/runs
sbatch blocks.sh
  1. Train the full model with the pretrained blocks.
cd src/runs
sbatch 002_raptor2_pretrained.sh
sbatch 003_raptor3_pretrained.sh
sbatch 004_raptor4_pretrained.sh
  1. Train linear probes on the frozen pretrained checkpoints.
cd src/ade20k_probes
sbatch run_all.sh
cd src/imagenet_probes
sbatch run_all.sh
cd src/nyud_probes
sbatch run_all.sh
  1. Table 1
cd src
python aggregate_results.py
  1. Figure 7 Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.

AltStyle によって変換されたページ (->オリジナル) /