-
Notifications
You must be signed in to change notification settings - Fork 187
offload_to_disk=True is very slow for second initial forward and uses more VRAM #2174
-
@Qubitium
Hi, I noticed that when offload_to_disk=False the second initial forward is very fast (around 6 seconds for 1534 samples on GLM-4.5-Air)
Also in this mode less VRAM is used for same dataset. I was able to pass layers 0 and 1, and proceeding. With vram_strategy="balanced" on 8 x 3090.
While when offload_to_disk=True it takes 6+ minutes and it uses more VRAM, so was not able to pass layer 0, even with "balanced":
Another observation is Minimax-M2 on my setup pass first layer with 16 samples, but with 64+ samples there is CUDA OOM on line 347 in modeling_minimax_m2.py:
attn_weights = torch.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
I will think about moving inputs and outputs back and forth between VRAM and RAM (or to another GPU) during forward pass, to be able to use more samples. And/or optimizing modeling_minimax_m2.py to release unneeded tensors.
Beta Was this translation helpful? Give feedback.
All reactions
-
👀 1