-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[1/N] add fp8 fp32 scale support for custom RL model#368
[1/N] add fp8 fp32 scale support for custom RL model #368yiakwy-xpu-ml-framework-team wants to merge 2 commits into
Conversation
yiakwy-xpu-ml-framework-team
commented
Jun 9, 2026
@antirez could you have a look at it ?
antirez
commented
Jun 9, 2026
Hi, the PR itself has a few quality issues but especially it is not clear why it would be useful for the proejct as a whole given that we convert from DS4 hugging face formats.
Quantization is successful.
@antirez Thank you for the quick response, let me explain.
-
our sft/RL model of deepseek v4 has embedding layer (bf16 or int32), while deepseek model has embedding with type int64
-
since we are running in Hopper platform , our expert weight stored with E4M3 FP8 weight and weight scale stored with FP32 for best performance (which can verified in SGLang):
Customer DSV4 sglang fp8 serving in Hopper platform with identity injectioin, private/public knowledge injection and enhanced security shield module
-
Huggingface model is not SGLang compatible version, while our version is; and huggingface does not consider convert SFT/RL model from Bf16 to FP8 variants
The model is tuned specifically to handle Candonese, Chinese madarin and English efficiently.
Hi @antirez I can make sure the modification can generate correct model checkpoint for ds4. Wish your attention.
GB10 (2 bit dsv4-sft-rl, 15 toks/sec) :
截屏2026年06月12日 17 54 44Serving with raw model (no system prompt):
2135b13f7dc854cf793c7d7e849dc316
Uh oh!
There was an error while loading. Please reload this page.
Background
We added fp8 RL+SFT version of Deepseek V4 in week 0 support and suppressed DeepSeek V4 baseline in all major dimensions from our internal evaluation.
Hence we want to add 2 bit support for DeepSeek V4 with our Expert Pruning technology:
截屏2026年06月09日 15 40 50
Noted, in H100/H800, we usually don't use E8M0 for scale, since it will introduce runtime overhead. FP32 scale is the best.