Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Pass whole dataset/and subsequent inputs/ to each expert, not only routed data. #2219

avtc started this conversation in Ideas
Discussion options

@Qubitium Hi!
I have thought about how to improve quantization of MoE models, as the input dataset size is limited by VRAM (usually of a single GPU, which also had to hold modules), and have several ideas, one of ideas I have tried to implement with a help of Antigravity/gemini 3 pro/claude 4.5.

The idea is to pass whole dataset/and subsequent inputs/ to each expert, not only routed data.
The implementation I have squashed into the single commit for easier review, but as it based not on recent code - on the pre-data-parallel branch, and also lack of my knowledge to review the correctness of behavior, I am not ready to propose it as PR yet.
avtc@9d96430

I am still testing it, will share results later on.

What I see is each expert receive more calibration data (just like non-moe module), forward pass become 2.5-3 times longer, loss become ~10-100 times lower. This is for example from Minimax-M2 first layer, with 1536 samples and 0.05 damp:

{
 "process": "gptq",
 "layer": 0,
 "module": "block_sparse_moe.experts.1.w1",
 "loss": "0.0000032814",
 "samples": "496497",
 "damp": "0.05000",
 "time": "0.833",
 "fwd_time": "329.542",
 "(v)ram": "11187.40MB, 2837.37MB",
 "dynamic": null
}
vs
{
 "process": "gptq",
 "layer": 0,
 "module": "block_sparse_moe.experts.1.w1",
 "loss": "0.0001497157",
 "samples": "10186",
 "damp": "0.05000",
 "time": "3.729",
 "fwd_time": "105.122",
 "(v)ram": "11215.26MB, 2618.66MB",
 "dynamic": null
}

Can you suggest the best dump setting to be used with such approach?

You must be logged in to vote

Replies: 3 comments 8 replies

Comment options

@avtc Interesting work! Let me know how it works out. Some points:

  1. The Error Loss that you see for gptq quantization is a little misleading in that lower does not always mean better. If the errors are so low as to approach zero, it just means there might be too much calibration data, and or too many repeating values in the calibration, that is overfitting. High error loss is obviously a sign of bad calibration but on the opposite end, the goal is not to get error loss to 0 which actually may be just as bad.

  2. So vary the dataset, which you already do, and make sure there are not too many repeating patterns in the calibration dataset.

  3. Very intersted to find out how brute forcing the moe modules with unrouted tokens will affect the final model. Thanks for working on this!

You must be logged in to vote
0 replies
Comment options

@Qubitium
Few shots with GLM-4.6-Reap-268B, w4g128 + dynamic.
image

Full dataset passed to each expert, damp 0.01.
S(Shot)1 with OpenWebUI: Create a Playable Synth Keyboard using html, css, js in a single html file
image
https://jsfiddle.net/msfkLngj/

S3 with OpenWebUI
image

S1 with Kilo, Code mode: Write super mario bros clone using html,css,js
image
https://jsfiddle.net/j94ykfb5/

Compare with routed dataset, damp=0.025:
S1
image
S3
image
S2 with Kilo, Code mode, (S1 had error on collision with coin)
image

Sampling params: t=1.0 minp=0 topp=0.95 k=40
Looks like "full dataset" variant is more consistent, makes less errors, solved all my few evals, with 1-3 shots. While "routed" 0.025 solved only part of evals and with issues. The quantization time was 20+ hours for "full dataset" and around 11 hours for "routed" on 8x3090 with 1808 samples ~530K tokens.

You must be logged in to vote
1 reply
Comment options

shared experts has 64 group size to be able to run with tp8 on vllm, and the total model size allows using all available context window of 200K with q8 kv cache.

Comment options

@avtc Based on your test I think we should definitely add this as an option for gpt-qmodel users with MoE now more prevalent than ever and MoE routing is a thorn in quantization. Conceptually, it should not work as well as it does, but reality trumps theory. =) Do you want to work up a PR?

You must be logged in to vote
7 replies
Comment options

@Qubitium Hi, I have tried to test my changes with recent main branch on GLM-4.5-Air, with 10 samples from c4/en on 4x3090, with offload_to_disk=False (and True), and encountered a bug that layer 0 cannot be finalized, it hangs forever. I reproduced same issue on latest main branch 17b7bcc. It looks like this:
image

I have asked Sonnet 4.5, GLM-4.6 but their suggestions did not resolve the issue. Can you check if you can reproduce? I have even added a flag to wait for layer finalization before proceeding to next layer, and it just hangs and does not move to the next layer.

The code:

SAMPLES = 10
calibration_dataset = load_dataset(
 "allenai/c4",
 data_files="en/c4-train.00001-of-01024.json.gz",
 split="train"
 ).select(range(SAMPLES))["text"]
 
quant_config = QuantizeConfig(
 bits=4,
 group_size=32,
 sym=True,
 desc_act=False,
 act_group_aware=True,
 dynamic={},
 damp_percent=0.05,
 damp_auto_increment=0.01,
 fail_safe=True,
 offload_to_disk=False,
 )
model = GPTQModel.load(model_path, quant_config)
model.quantize(
 calibration_dataset,
 batch_size=1,
 )

The issues is not reproduced with Qwen3-Coder-30B-A3B-Instruct...
I have tried same model from two different folders - padded and not padded - and the issue reproduced for both. (I though maybe files become broken, but looks like not)

Even if I exclude self_attn modules, it anyway hangs for first layer with quantized modules and does not proceed:
image

Comment options

@Qubitium I have identified the issue - it was caused by pack_block extension, maybe cached wrong version or idk, will investigate how to fix

Comment options

@avtc I have hit this bug myself yesterday. I thought i have fixed it with extension delete and recompile on load with locks but it is not working. Whats worse is the issue has randomness where some users have this error and others dont.

Comment options

would be nice to have an error with suggestion how to fix if it hangs (until fixed)... because debug of this was very long and hard...
i.e. find ~/.cache/torch_extensions -name "*pack_block*" -type d -exec rm -rf {} +

Comment options

Yes. The first time I tried to debug this, it wasted 4 hours of my time. And the bug is still not fixed. The problem is we cannot detect it because it enters into the pack and never returns, it just stalls for no reason at all. Unless we have an external thread/timer to watch for ops, which adds even more complexity we cannot detect this hang correctly. Need to fix this bug asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet
2 participants

AltStyle によって変換されたページ (->オリジナル) /