-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
-
Hi!
I have some questions regarding the dynamic_img_size and img_size parameters when creating a timm model and loading pre-trained weights. From my understanding, setting patch_size and img_size interpolates the patch embeddings and the conv2d projection layer, respectively, to match the specified values if they differ from those used during the model's pretraining.
However, I’m a bit unclear on the specific role of enabling dynamic_img_size.
- Does this option allow the model to handle varying input sizes dynamically during training and inference?
- Is this achieved through interpolation, similar to what happens when specifying a different img_size?
- Would it be correct to say that a use case for setting both img_size and dynamic_img_size=True is to train the model on a fixed img_size while allowing inference on images of varying sizes?
- Alternatively, could another use case involve initializing the model with a different img_size (compared to pretraining) and then allowing flexibility to process various sizes during training?
- Lastly, if all processed images are of the same size, could enabling dynamic_img_size=True introduce any performance drawbacks?
Thank you in advance for your insights!
Beta Was this translation helpful? Give feedback.
All reactions
@vadori so, yeah a bit confusing...
Changing img_size and patch_size are a step change that will permantently change the corresponding value for the model, interpolating the original pretrained values once at load time. Those values change the model configuration such that the dimensions of some parameters are different, it is essentially a different model architecture variation for each combination of those values.
Just like the original model, once resized the model expects all inputs to match the new size.
To again load and use the model correctly after it's been trained or fine-tuned with those values different from the original pretrained weights, you have to continue using the mode...
Replies: 3 comments 1 reply
-
@vadori so, yeah a bit confusing...
Changing img_size and patch_size are a step change that will permantently change the corresponding value for the model, interpolating the original pretrained values once at load time. Those values change the model configuration such that the dimensions of some parameters are different, it is essentially a different model architecture variation for each combination of those values.
Just like the original model, once resized the model expects all inputs to match the new size.
To again load and use the model correctly after it's been trained or fine-tuned with those values different from the original pretrained weights, you have to continue using the model with those values the same (or define a new model, existing defined model with matching sizing). You can of course again map those weights as 'pretrained' and then load again into the model with another different img and/or patch size but I don't think I've ever found a need for that...
Now, dynamic_img_size (and related dynamic_img_pad) do not alter the sizes of any parameters, they remain as they were, it's the same model weights are compatible without any interpolation or adjustment needed. Aside from having the functionality enabled the weights remain the same. Every input can be a different size and the aspects of the model that need to match the input size (position embedding) is scaled to match that. So for each batch at train or inference time you can use a different size.
There is a slight runtime hit for dynamic image size to do the interpolation but for most models, esp larger ones I feel it's negligible.
I do feel there is another tradeoff to consider. I feel there is an optimal range for which a given absolute position embedding can be interpolated for a different resolution. With dynamic you are always using the original size, so say a patch16 224x224 vit, your pos-embed maps to 14x14 grid (if you unflatten it) and any input size you throw at it will be adding interpolation of those points, the pos embed will remaine quite coarse compared to say the 32x32 feature grid if you pass in a 512x512 image.
If you have considerable data to fine-tune or train a model on at a different resolution (range). I feel it's better to adjust the img_size / patch_size to fit your range and, fine-tune or train so you have a finer grained pos embed and then consider using dynamic_img_size if you need the flexibility to have different sizes at train (or especially inference) around that new size.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Great, thank you very much for your response, @rwightman!
Beta Was this translation helpful? Give feedback.
All reactions
-
Thank you very much for this great feature @rwightman! Could you please point us to any specific papers to cite when using it in scientific research?
Beta Was this translation helpful? Give feedback.
All reactions
-
@giuliomat95 there isn't a paper related to these features, so citing the main library would be appropriate https://scholar.google.ca/citations?view_op=view_citation&hl=en&citation_for_view=cLfKCzoAAAAJ:IjCSPb-OGe4C
For augmentations and timm enabled training schemes the ResNet Stikes Back paper was a good reference (https://scholar.google.ca/citations?view_op=view_citation&hl=en&citation_for_view=cLfKCzoAAAAJ:WF5omc3nYNoC)
Beta Was this translation helpful? Give feedback.