Training RoPE ViT · huggingface/pytorch-image-models · Discussion #2557

sinahmr
Jul 25, 2025

Hello,

I want to train a ViT model with RoPE instead of absolute positional embedding. I noticed the following item in the list of supported models in the README, but I couldn't find out how to access such a model: "ROPE-ViT - https://arxiv.org/abs/2403.13298"

There doesn't seem to be an option for it in the vision_transformer.py and I couldn't find a specific .py file for this method either.
Could you please guide me on how to train such a model?

Thank you very much.

Answered by rwightman

Jul 25, 2025

@sinahmr yes, currently all of the ViT models with ROPE embeddings in timm are based on the EVA model (eva.py). It was the first to include ROPE embeddings and it's been extended to support several variants. It's essentially a ViT with ROPE support (w/ abs pos embed option), SwiGLU option. The comments in the file have references for the model sources, papers, etc. There are also several timm definitions that I trained with registers like

vit_base_patch16_rope_reg1_gap_256.sbb_in1k
vit_betwixt_patch16_rope_reg4_gap_256.sbb_in1k
vit_medium_patch16_rope_reg1_gap_256.sbb_in1k
vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k

All of the models are a bit different, even at the same 'size' range. ...

View full answer

Replies: 1 comment 8 replies

sinahmr
Jul 25, 2025
Author

Update: I just noticed that there is a use_rot_pos_emb option in eva.py and models like vit_base_patch16_rope_mixed_ape_224 are defined in that file. So it seems like my problem is resolved, sorry for the premature question.
Just to make sure (I'm unfortunately not familiar with EVA), is training such a model (vit_base_patch16_rope_mixed_ape_224) the correct way to train ROPE-ViT?

8 replies

@rwightman

rwightman Jul 25, 2025
Maintainer

vit_base_patch16_rope_reg1_gap_256.sbb_in1k
vit_betwixt_patch16_rope_reg4_gap_256.sbb_in1k
vit_medium_patch16_rope_reg1_gap_256.sbb_in1k
vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k

All of the models are a bit different, even at the same 'size' range. So you'll want to dig into the details a bit and figure out what fits your goals.... you'll also want to decide whether to use pretrained weights or from scratch. The EVA and PE models were pretrained with huge datasets, many of the other just with ImageNet-22/12k or 1k.

Answer selected by sinahmr

@sinahmr

sinahmr Jul 25, 2025
Author

Thank you very much for the detailed answer!

@t0278611

t0278611 Aug 11, 2025

What is the difference between EvaAttention and AttentionRope?

@rwightman

rwightman Aug 11, 2025
Maintainer

The beauty of open source, you can read the code...

@t0278611

t0278611 Aug 11, 2025

Sorry, I phrased my question poorly. I read the code and I see that they are different, but I don't understand why EvaAttention wasn't implemented as a special case of AttentionRope. They look very similar to me.

Please correct me if I'm wrong: The conceptual difference seem to be

qkv bias: In EvaAttention there is the option to handle qkv_bias as a seperate addition. I have no idea what this achieves though.
qk bias: In Eva Attention there is no k bias.
Attention Mask: Attention Mask is handled differently. AttentionRope uses maybe_add_mask it assumes attn_mask to be 0 or -inf. EvaAttention assumes attn_mask to be boolean values and converts them to 0 and -inf.

Everything else seems to be the same ignoring variable / config names / default values

Uh oh!

Uh oh!

Training RoPE ViT #2557

Uh oh!

sinahmr Jul 25, 2025

Replies: 1 comment · 8 replies

Uh oh!

Uh oh!

sinahmr Jul 25, 2025 Author

Uh oh!

Uh oh!

rwightman Jul 25, 2025 Maintainer

Uh oh!

sinahmr Jul 25, 2025 Author

Uh oh!

t0278611 Aug 11, 2025

Uh oh!

rwightman Aug 11, 2025 Maintainer

Uh oh!

Uh oh!

t0278611 Aug 11, 2025

sinahmr
Jul 25, 2025

Replies: 1 comment 8 replies

sinahmr
Jul 25, 2025
Author

rwightman Jul 25, 2025
Maintainer

sinahmr Jul 25, 2025
Author

rwightman Aug 11, 2025
Maintainer