Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

--resume error #2572

Unanswered
atti0127 asked this question in Q&A
Aug 20, 2025 · 4 comments · 4 replies
Discussion options

I trained cait_xxs24 until 225 epoch and tried to resume training using --resume, and accuracy started to decrease as epoch goes on... Something is wrong or I think I'm missing something when using resume. Below is the resume script I used.

Current checkpoints:
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-1.pth.tar', 71.604)
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-0.pth.tar', 70.6)
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-2.pth.tar', 69.976)

./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 --resume output_xxs24/~~

You must be logged in to vote

Replies: 4 comments 4 replies

Comment options

@atti0127 are you sure it doesn't do that without resume? there is a LR warmup by default and if it gets too high your performance will drop

You must be logged in to vote
1 reply
Comment options

summary.csv

These are the results up to epoch 225, using training script below (rest of the setting were exactly same as deit, as mentioned in cait paper)

./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24

I tried to started training again using last.pth file only adding --resume, and accuracy started to dropped (as I aforementioned)

Comment options

It's not resuming from epoch 225 though, it's starting from 0 so you may have stomped over your last/best checkpoints using the same output for. Looks for the highest number checkpoint file
...
On Wed, Aug 20, 2025, 7:35 AM atti0127 ***@***.***> wrote: summary.csv <https://github.com/user-attachments/files/21899502/summary.csv> These are the results up to epoch 225, using training script below (rest of the setting were exactly same as deit, as mentioned in cait paper) ./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 I tried to started training again using last.pth file only adding --resume, and accuracy started to dropped (as I aforementioned) — Reply to this email directly, view it on GitHub <#2572 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLQICESVHUJD752BT4ZUP33OSBTTAVCNFSM6AAAAACEKGOYY2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJWGY2DMNA> . You are receiving this because you commented.Message ID: <huggingface/pytorch-image-models/repo-discussions/2572/comments/14166464@ github.com>
You must be logged in to vote
0 replies
Comment options

Sorry, I meant you should try finding the highest checkpoint file to resume from if you overwrote your last/best by mistake
...
On Wed, Aug 20, 2025, 9:12 AM Ross Wightman ***@***.***> wrote: It's not resuming from epoch 225 though, it's starting from 0 so you may have stomped over your last/best checkpoints using the same output for. Looks for the highest number checkpoint file On Wed, Aug 20, 2025, 7:35 AM atti0127 ***@***.***> wrote: > summary.csv > <https://github.com/user-attachments/files/21899502/summary.csv> > > These are the results up to epoch 225, using training script below (rest > of the setting were exactly same as deit, as mentioned in cait paper) > > ./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 > --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 > --grad-accum-steps 4 --output output_xxs24 > > I tried to started training again using last.pth file only adding > --resume, and accuracy started to dropped (as I aforementioned) > > — > Reply to this email directly, view it on GitHub > <#2572 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABLQICESVHUJD752BT4ZUP33OSBTTAVCNFSM6AAAAACEKGOYY2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJWGY2DMNA> > . > You are receiving this because you commented.Message ID: > <huggingface/pytorch-image-models/repo-discussions/2572/comments/14166464 > @github.com> >
You must be logged in to vote
1 reply
Comment options

checkpoint-211.pth.tar checkpoint-212.pth.tar checkpoint-214.pth.tar checkpoint-219.pth.tar checkpoint-220.pth.tar checkpoint-221.pth.tar checkpoint-222.pth.tar checkpoint-223.pth.tar checkpoint-224.pth.tar checkpoint-225.pth.tar last.pth.tar model_best.pth.tar summary.csv

The file consists like above. I tried below scripts and all results started with train: 0 and resulted in accuracy dropped

--resume output_xxs24/checkpoint-225.pth.tar
--resume output_xxs24/last.pth.tar
--resume output_xxs24/model_best.pth.tar


(timm) user1@user1-System-Product-Name:~/Desktop/pytorch-image-models$ ./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 --resume output_xxs24/20250814-223557-cait_xxs24_224-224/model_best.pth.tar
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 device per process.Process 1, total 2, device cuda:1.
Training in distributed mode with multiple processes, 1 device per process.Process 0, total 2, device cuda:0.
Model cait_xxs24_224 created, param count:11993328
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created AdamW (adamw) optimizer: lr: 0.001, betas: (0.9, 0.999), eps: 1e-08, weight_decay: 0.05, amsgrad: False, foreach: None, maximize: False, capturable: False
AMP not enabled. Training in torch.float32.
Restoring model state from checkpoint...
Restoring optimizer state from checkpoint...
Loaded checkpoint 'output_xxs24/20250814-223557-cait_xxs24_224-224/model_best.pth.tar' (epoch 225)
Using native Torch DistributedDataParallel.
Scheduled epochs: 410 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Train: 0 [ 0/1251 ( 0%)] Loss: 3.27 (3.89) Time: 4.511s, 227.01/s (4.511s, 227.01/s) LR: 1.000e-06 Data: 0.031 (1.119)
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Train: 0 [ 50/1251 ( 4%)] Loss: 4.29 (3.83) Time: 1.043s, 981.32/s (1.103s, 928.77/s) LR: 1.000e-06 Data: 0.044 (0.058)
Train: 0 [ 100/1251 ( 8%)] Loss: 3.91 (3.83) Time: 1.042s, 982.44/s (1.072s, 955.04/s) LR: 1.000e-06 Data: 0.044 (0.049)
Train: 0 [ 150/1251 ( 12%)] Loss: 4.20 (3.81) Time: 1.048s, 976.83/s (1.062s, 964.31/s) LR: 1.000e-06 Data: 0.058 (0.046)

Comment options

I'm really not sure what's going on here, I've never seen the scripts fail to resume like that. Are any modifications made? What revision of the scripts are checked out? What PyTorch version? Are the arguments besides the ones pasted being used?

blah
You must be logged in to vote
2 replies
Comment options

Instead of defining arg parser in training script, I changed it directly in train.py file (For example, I changed '--opt', default='sgd' to '--opt', default='adamw' in train.py (default of timm train.py is sgd) instead of incorporating --opt adamw in training script. Will this be the reason?

All I did was directly changing all hyperparameters same as official deit github's main.py in timm's train.py and adjusting the hyperparameters for cait in training script, mentioned in cait paper that differs from deit setting (--epochs 400, --lr 0.001 --drop-path 0.1)

By the way, I'm using python 3.9 with torch 1.13 in conda environment

Comment options

group.add_argument('--start-epoch', default=None, type=int, metavar='N',

I found the reason. I set default of --start-epoch to 0. Sorry for confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /