-
Couldn't load subscription status.
- Fork 6.5k
GGUF fix for unquantized types when using unquantize kernels #12498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Even if the `qweight_type` is one of the `UNQUANTIZED_TYPES`, qweight still has to be "dequantized" because it is stored as an 8-bit tensor. Without doing so, it is therefore a shape mismatch in the following matmul. Side notes: - why isn't DIFFUSERS_GGUF_CUDA_KERNELS on by default? It's significantly faster and only used when installed - https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?
https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?
In fact, there is a pre-release building for torch 2.8 (https://huggingface.co/Isotr0py/ggml/tree/shmem-mmq/build), but I found there is some regression about kernel size and performance in these kernels.
Anyway, I have found out the root issue about the regression and fixing it, and will make a release with torch2.8 and 2.9 support tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems dequantize_gguf_tensor missed implementation for FP16 and FP32 qweight:
diffusers/src/diffusers/quantizers/gguf/utils.py
Lines 487 to 501 in dbe4136
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was bf16 in my use case.
fp16 and fp32 would fail in any case, whether native or dequant kernels are used.
this PR therefore currently only fixes the bf16 case for kernel dequant - for native bf16 already works
Uh oh!
There was an error while loading. Please reload this page.
Even if the
qweight_typeis one of theUNQUANTIZED_TYPES, qweight still has to be "dequantized" because it is stored as an 8-bit tensor. Without doing so, it is a shape mismatch in the following matmul.Side notes:
Who can review?
@DN6 @Isotr0py