-
-
Couldn't load subscription status.
- Fork 5.1k
-
Hello,
I want to train a ViT model with RoPE instead of absolute positional embedding. I noticed the following item in the list of supported models in the README, but I couldn't find out how to access such a model: "ROPE-ViT - https://arxiv.org/abs/2403.13298"
There doesn't seem to be an option for it in the vision_transformer.py and I couldn't find a specific .py file for this method either.
Could you please guide me on how to train such a model?
Thank you very much.
Beta Was this translation helpful? Give feedback.
All reactions
@sinahmr yes, currently all of the ViT models with ROPE embeddings in timm are based on the EVA model (eva.py). It was the first to include ROPE embeddings and it's been extended to support several variants. It's essentially a ViT with ROPE support (w/ abs pos embed option), SwiGLU option. The comments in the file have references for the model sources, papers, etc. There are also several timm definitions that I trained with registers like
vit_base_patch16_rope_reg1_gap_256.sbb_in1k
vit_betwixt_patch16_rope_reg4_gap_256.sbb_in1k
vit_medium_patch16_rope_reg1_gap_256.sbb_in1k
vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k
All of the models are a bit different, even at the same 'size' range. ...
Replies: 1 comment 8 replies
-
Update: I just noticed that there is a use_rot_pos_emb option in eva.py and models like vit_base_patch16_rope_mixed_ape_224 are defined in that file. So it seems like my problem is resolved, sorry for the premature question.
Just to make sure (I'm unfortunately not familiar with EVA), is training such a model (vit_base_patch16_rope_mixed_ape_224) the correct way to train ROPE-ViT?
Beta Was this translation helpful? Give feedback.
All reactions
-
@sinahmr yes, currently all of the ViT models with ROPE embeddings in timm are based on the EVA model (eva.py). It was the first to include ROPE embeddings and it's been extended to support several variants. It's essentially a ViT with ROPE support (w/ abs pos embed option), SwiGLU option. The comments in the file have references for the model sources, papers, etc. There are also several timm definitions that I trained with registers like
vit_base_patch16_rope_reg1_gap_256.sbb_in1k
vit_betwixt_patch16_rope_reg4_gap_256.sbb_in1k
vit_medium_patch16_rope_reg1_gap_256.sbb_in1k
vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k
All of the models are a bit different, even at the same 'size' range. So you'll want to dig into the details a bit and figure out what fits your goals.... you'll also want to decide whether to use pretrained weights or from scratch. The EVA and PE models were pretrained with huge datasets, many of the other just with ImageNet-22/12k or 1k.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Thank you very much for the detailed answer!
Beta Was this translation helpful? Give feedback.
All reactions
-
What is the difference between EvaAttention and AttentionRope?
Beta Was this translation helpful? Give feedback.
All reactions
-
The beauty of open source, you can read the code...
Beta Was this translation helpful? Give feedback.
All reactions
-
Sorry, I phrased my question poorly. I read the code and I see that they are different, but I don't understand why EvaAttention wasn't implemented as a special case of AttentionRope. They look very similar to me.
Please correct me if I'm wrong: The conceptual difference seem to be
- qkv bias: In EvaAttention there is the option to handle qkv_bias as a seperate addition. I have no idea what this achieves though.
- qk bias: In Eva Attention there is no k bias.
- Attention Mask: Attention Mask is handled differently. AttentionRope uses
maybe_add_maskit assumes attn_mask to be 0 or -inf. EvaAttention assumes attn_mask to be boolean values and converts them to 0 and -inf.
Everything else seems to be the same ignoring variable / config names / default values
Beta Was this translation helpful? Give feedback.