-
-
Couldn't load subscription status.
- Fork 5.1k
-
I trained cait_xxs24 until 225 epoch and tried to resume training using --resume, and accuracy started to decrease as epoch goes on... Something is wrong or I think I'm missing something when using resume. Below is the resume script I used.
Current checkpoints:
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-1.pth.tar', 71.604)
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-0.pth.tar', 70.6)
('output_xxs24/20250820-120753-cait_xxs24_224-224/checkpoint-2.pth.tar', 69.976)
./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 --resume output_xxs24/~~
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 4 comments 4 replies
-
@atti0127 are you sure it doesn't do that without resume? there is a LR warmup by default and if it gets too high your performance will drop
Beta Was this translation helpful? Give feedback.
All reactions
-
These are the results up to epoch 225, using training script below (rest of the setting were exactly same as deit, as mentioned in cait paper)
./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24
I tried to started training again using last.pth file only adding --resume, and accuracy started to dropped (as I aforementioned)
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.
All reactions
-
checkpoint-211.pth.tar checkpoint-212.pth.tar checkpoint-214.pth.tar checkpoint-219.pth.tar checkpoint-220.pth.tar checkpoint-221.pth.tar checkpoint-222.pth.tar checkpoint-223.pth.tar checkpoint-224.pth.tar checkpoint-225.pth.tar last.pth.tar model_best.pth.tar summary.csv
The file consists like above. I tried below scripts and all results started with train: 0 and resulted in accuracy dropped
--resume output_xxs24/checkpoint-225.pth.tar
--resume output_xxs24/last.pth.tar
--resume output_xxs24/model_best.pth.tar
(timm) user1@user1-System-Product-Name:~/Desktop/pytorch-image-models$ ./distributed_train.sh 2 --data-dir imagenet --model cait_xxs24_224 --batch-size 128 --epochs 400 --aug-repeats 3 --lr 0.001 --drop-path 0.1 --grad-accum-steps 4 --output output_xxs24 --resume output_xxs24/20250814-223557-cait_xxs24_224-224/model_best.pth.tar
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 device per process.Process 1, total 2, device cuda:1.
Training in distributed mode with multiple processes, 1 device per process.Process 0, total 2, device cuda:0.
Model cait_xxs24_224 created, param count:11993328
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created AdamW (adamw) optimizer: lr: 0.001, betas: (0.9, 0.999), eps: 1e-08, weight_decay: 0.05, amsgrad: False, foreach: None, maximize: False, capturable: False
AMP not enabled. Training in torch.float32.
Restoring model state from checkpoint...
Restoring optimizer state from checkpoint...
Loaded checkpoint 'output_xxs24/20250814-223557-cait_xxs24_224-224/model_best.pth.tar' (epoch 225)
Using native Torch DistributedDataParallel.
Scheduled epochs: 410 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Train: 0 [ 0/1251 ( 0%)] Loss: 3.27 (3.89) Time: 4.511s, 227.01/s (4.511s, 227.01/s) LR: 1.000e-06 Data: 0.031 (1.119)
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Train: 0 [ 50/1251 ( 4%)] Loss: 4.29 (3.83) Time: 1.043s, 981.32/s (1.103s, 928.77/s) LR: 1.000e-06 Data: 0.044 (0.058)
Train: 0 [ 100/1251 ( 8%)] Loss: 3.91 (3.83) Time: 1.042s, 982.44/s (1.072s, 955.04/s) LR: 1.000e-06 Data: 0.044 (0.049)
Train: 0 [ 150/1251 ( 12%)] Loss: 4.20 (3.81) Time: 1.048s, 976.83/s (1.062s, 964.31/s) LR: 1.000e-06 Data: 0.058 (0.046)
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm really not sure what's going on here, I've never seen the scripts fail to resume like that. Are any modifications made? What revision of the scripts are checked out? What PyTorch version? Are the arguments besides the ones pasted being used?
blahBeta Was this translation helpful? Give feedback.
All reactions
-
Instead of defining arg parser in training script, I changed it directly in train.py file (For example, I changed '--opt', default='sgd' to '--opt', default='adamw' in train.py (default of timm train.py is sgd) instead of incorporating --opt adamw in training script. Will this be the reason?
All I did was directly changing all hyperparameters same as official deit github's main.py in timm's train.py and adjusting the hyperparameters for cait in training script, mentioned in cait paper that differs from deit setting (--epochs 400, --lr 0.001 --drop-path 0.1)
By the way, I'm using python 3.9 with torch 1.13 in conda environment
Beta Was this translation helpful? Give feedback.
All reactions
-
group.add_argument('--start-epoch', default=None, type=int, metavar='N',
I found the reason. I set default of --start-epoch to 0. Sorry for confusing.
Beta Was this translation helpful? Give feedback.