Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Clarification on dynamic_img_size and img_size parameters in timm models #2414

Answered by rwightman
vadori asked this question in Q&A
Discussion options

Hi!

I have some questions regarding the dynamic_img_size and img_size parameters when creating a timm model and loading pre-trained weights. From my understanding, setting patch_size and img_size interpolates the patch embeddings and the conv2d projection layer, respectively, to match the specified values if they differ from those used during the model's pretraining.

However, I’m a bit unclear on the specific role of enabling dynamic_img_size.

  1. Does this option allow the model to handle varying input sizes dynamically during training and inference?
  2. Is this achieved through interpolation, similar to what happens when specifying a different img_size?
  3. Would it be correct to say that a use case for setting both img_size and dynamic_img_size=True is to train the model on a fixed img_size while allowing inference on images of varying sizes?
  4. Alternatively, could another use case involve initializing the model with a different img_size (compared to pretraining) and then allowing flexibility to process various sizes during training?
  5. Lastly, if all processed images are of the same size, could enabling dynamic_img_size=True introduce any performance drawbacks?

Thank you in advance for your insights!

You must be logged in to vote

@vadori so, yeah a bit confusing...


Changing img_size and patch_size are a step change that will permantently change the corresponding value for the model, interpolating the original pretrained values once at load time. Those values change the model configuration such that the dimensions of some parameters are different, it is essentially a different model architecture variation for each combination of those values.

Just like the original model, once resized the model expects all inputs to match the new size.

To again load and use the model correctly after it's been trained or fine-tuned with those values different from the original pretrained weights, you have to continue using the mode...

Replies: 3 comments 1 reply

Comment options

@vadori so, yeah a bit confusing...


Changing img_size and patch_size are a step change that will permantently change the corresponding value for the model, interpolating the original pretrained values once at load time. Those values change the model configuration such that the dimensions of some parameters are different, it is essentially a different model architecture variation for each combination of those values.

Just like the original model, once resized the model expects all inputs to match the new size.

To again load and use the model correctly after it's been trained or fine-tuned with those values different from the original pretrained weights, you have to continue using the model with those values the same (or define a new model, existing defined model with matching sizing). You can of course again map those weights as 'pretrained' and then load again into the model with another different img and/or patch size but I don't think I've ever found a need for that...


Now, dynamic_img_size (and related dynamic_img_pad) do not alter the sizes of any parameters, they remain as they were, it's the same model weights are compatible without any interpolation or adjustment needed. Aside from having the functionality enabled the weights remain the same. Every input can be a different size and the aspects of the model that need to match the input size (position embedding) is scaled to match that. So for each batch at train or inference time you can use a different size.

There is a slight runtime hit for dynamic image size to do the interpolation but for most models, esp larger ones I feel it's negligible.

I do feel there is another tradeoff to consider. I feel there is an optimal range for which a given absolute position embedding can be interpolated for a different resolution. With dynamic you are always using the original size, so say a patch16 224x224 vit, your pos-embed maps to 14x14 grid (if you unflatten it) and any input size you throw at it will be adding interpolation of those points, the pos embed will remaine quite coarse compared to say the 32x32 feature grid if you pass in a 512x512 image.

If you have considerable data to fine-tune or train a model on at a different resolution (range). I feel it's better to adjust the img_size / patch_size to fit your range and, fine-tune or train so you have a finer grained pos embed and then consider using dynamic_img_size if you need the flexibility to have different sizes at train (or especially inference) around that new size.

You must be logged in to vote
0 replies
Answer selected by vadori
Comment options

Great, thank you very much for your response, @rwightman!

You must be logged in to vote
0 replies
Comment options

Thank you very much for this great feature @rwightman! Could you please point us to any specific papers to cite when using it in scientific research?

You must be logged in to vote
1 reply
Comment options

@giuliomat95 there isn't a paper related to these features, so citing the main library would be appropriate https://scholar.google.ca/citations?view_op=view_citation&hl=en&citation_for_view=cLfKCzoAAAAJ:IjCSPb-OGe4C

For augmentations and timm enabled training schemes the ResNet Stikes Back paper was a good reference (https://scholar.google.ca/citations?view_op=view_citation&hl=en&citation_for_view=cLfKCzoAAAAJ:WF5omc3nYNoC)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /