offload_to_disk=True is very slow for second initial forward and uses more VRAM · ModelCloud/GPTQModel · Discussion #2174

avtc
Nov 4, 2025

@Qubitium
Hi, I noticed that when offload_to_disk=False the second initial forward is very fast (around 6 seconds for 1534 samples on GLM-4.5-Air)

image

Also in this mode less VRAM is used for same dataset. I was able to pass layers 0 and 1, and proceeding. With vram_strategy="balanced" on 8 x 3090.

While when offload_to_disk=True it takes 6+ minutes and it uses more VRAM, so was not able to pass layer 0, even with "balanced":

image image

Another observation is Minimax-M2 on my setup pass first layer with 16 samples, but with 64+ samples there is CUDA OOM on line 347 in modeling_minimax_m2.py:

attn_weights = torch.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

I will think about moving inputs and outputs back and forth between VRAM and RAM (or to another GPU) during forward pass, to be able to use more samples. And/or optimizing modeling_minimax_m2.py to release unneeded tensors.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

offload_to_disk=True is very slow for second initial forward and uses more VRAM #2174

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

avtc
Nov 4, 2025

Replies: 0 comments

Select a reply

Uh oh!

offload_to_disk=True is very slow for second initial forward and uses more VRAM #2174

Uh oh!

Uh oh!

avtc Nov 4, 2025

Replies: 0 comments

avtc
Nov 4, 2025