-
Notifications
You must be signed in to change notification settings - Fork 187
-
@Qubitium Hi!
I have thought about how to improve quantization of MoE models, as the input dataset size is limited by VRAM (usually of a single GPU, which also had to hold modules), and have several ideas, one of ideas I have tried to implement with a help of Antigravity/gemini 3 pro/claude 4.5.
The idea is to pass whole dataset/and subsequent inputs/ to each expert, not only routed data.
The implementation I have squashed into the single commit for easier review, but as it based not on recent code - on the pre-data-parallel branch, and also lack of my knowledge to review the correctness of behavior, I am not ready to propose it as PR yet.
avtc@9d96430
I am still testing it, will share results later on.
What I see is each expert receive more calibration data (just like non-moe module), forward pass become 2.5-3 times longer, loss become ~10-100 times lower. This is for example from Minimax-M2 first layer, with 1536 samples and 0.05 damp:
{
"process": "gptq",
"layer": 0,
"module": "block_sparse_moe.experts.1.w1",
"loss": "0.0000032814",
"samples": "496497",
"damp": "0.05000",
"time": "0.833",
"fwd_time": "329.542",
"(v)ram": "11187.40MB, 2837.37MB",
"dynamic": null
}
vs
{
"process": "gptq",
"layer": 0,
"module": "block_sparse_moe.experts.1.w1",
"loss": "0.0001497157",
"samples": "10186",
"damp": "0.05000",
"time": "3.729",
"fwd_time": "105.122",
"(v)ram": "11215.26MB, 2618.66MB",
"dynamic": null
}
Can you suggest the best dump setting to be used with such approach?
Beta Was this translation helpful? Give feedback.
All reactions
-
🚀 1
Replies: 3 comments 8 replies
-
@avtc Interesting work! Let me know how it works out. Some points:
-
The Error Loss that you see for gptq quantization is a little misleading in that lower does not always mean better. If the errors are so low as to approach zero, it just means there might be too much calibration data, and or too many repeating values in the calibration, that is overfitting. High error loss is obviously a sign of bad calibration but on the opposite end, the goal is not to get error loss to 0 which actually may be just as bad.
-
So vary the dataset, which you already do, and make sure there are not too many repeating patterns in the calibration dataset.
-
Very intersted to find out how brute forcing the
moemodules with unrouted tokens will affect the final model. Thanks for working on this!
Beta Was this translation helpful? Give feedback.
All reactions
-
@Qubitium
Few shots with GLM-4.6-Reap-268B, w4g128 + dynamic.
image
Full dataset passed to each expert, damp 0.01.
S(Shot)1 with OpenWebUI: Create a Playable Synth Keyboard using html, css, js in a single html file
image
https://jsfiddle.net/msfkLngj/
S3 with OpenWebUI
image
S1 with Kilo, Code mode: Write super mario bros clone using html,css,js
image
https://jsfiddle.net/j94ykfb5/
Compare with routed dataset, damp=0.025:
S1
image
S3
image
S2 with Kilo, Code mode, (S1 had error on collision with coin)
image
Sampling params: t=1.0 minp=0 topp=0.95 k=40
Looks like "full dataset" variant is more consistent, makes less errors, solved all my few evals, with 1-3 shots. While "routed" 0.025 solved only part of evals and with issues. The quantization time was 20+ hours for "full dataset" and around 11 hours for "routed" on 8x3090 with 1808 samples ~530K tokens.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
👀 1
-
shared experts has 64 group size to be able to run with tp8 on vllm, and the total model size allows using all available context window of 200K with q8 kv cache.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@avtc Based on your test I think we should definitely add this as an option for gpt-qmodel users with MoE now more prevalent than ever and MoE routing is a thorn in quantization. Conceptually, it should not work as well as it does, but reality trumps theory. =) Do you want to work up a PR?
Beta Was this translation helpful? Give feedback.
All reactions
-
@Qubitium Hi, I have tried to test my changes with recent main branch on GLM-4.5-Air, with 10 samples from c4/en on 4x3090, with offload_to_disk=False (and True), and encountered a bug that layer 0 cannot be finalized, it hangs forever. I reproduced same issue on latest main branch 17b7bcc. It looks like this:
image
I have asked Sonnet 4.5, GLM-4.6 but their suggestions did not resolve the issue. Can you check if you can reproduce? I have even added a flag to wait for layer finalization before proceeding to next layer, and it just hangs and does not move to the next layer.
The code:
SAMPLES = 10
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(SAMPLES))["text"]
quant_config = QuantizeConfig(
bits=4,
group_size=32,
sym=True,
desc_act=False,
act_group_aware=True,
dynamic={},
damp_percent=0.05,
damp_auto_increment=0.01,
fail_safe=True,
offload_to_disk=False,
)
model = GPTQModel.load(model_path, quant_config)
model.quantize(
calibration_dataset,
batch_size=1,
)
The issues is not reproduced with Qwen3-Coder-30B-A3B-Instruct...
I have tried same model from two different folders - padded and not padded - and the issue reproduced for both. (I though maybe files become broken, but looks like not)
Even if I exclude self_attn modules, it anyway hangs for first layer with quantized modules and does not proceed:
image
Beta Was this translation helpful? Give feedback.
All reactions
-
@Qubitium I have identified the issue - it was caused by pack_block extension, maybe cached wrong version or idk, will investigate how to fix
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@avtc I have hit this bug myself yesterday. I thought i have fixed it with extension delete and recompile on load with locks but it is not working. Whats worse is the issue has randomness where some users have this error and others dont.
Beta Was this translation helpful? Give feedback.
All reactions
-
would be nice to have an error with suggestion how to fix if it hangs (until fixed)... because debug of this was very long and hard...
i.e. find ~/.cache/torch_extensions -name "*pack_block*" -type d -exec rm -rf {} +
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes. The first time I tried to debug this, it wasted 4 hours of my time. And the bug is still not fixed. The problem is we cannot detect it because it enters into the pack and never returns, it just stalls for no reason at all. Unless we have an external thread/timer to watch for ops, which adds even more complexity we cannot detect this hang correctly. Need to fix this bug asap.
Beta Was this translation helpful? Give feedback.