GGUF fix for unquantized types when using unquantize kernels #12498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

dxqb wants to merge 1 commit into huggingface:main

from dxqb:patch-1

Open

GGUF fix for unquantized types when using unquantize kernels #12498

dxqb wants to merge 1 commit into huggingface:main from dxqb:patch-1

Conversation

@dxqb

Copy link

@dxqb dxqb commented Oct 16, 2025 •

edited

Loading

Even if the qweight_type is one of the UNQUANTIZED_TYPES, qweight still has to be "dequantized" because it is stored as an 8-bit tensor. Without doing so, it is a shape mismatch in the following matmul.

Side notes:

https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?
thank you for this GGUF implementation! With torch 2.8 and torch.compile it is fast even without specialized kernels. torch 2.7 fails to compile the native dequantization code for some reason, so I couldn't directly compare torch.compile with the custom kernel.

Who can review?

@DN6 @Isotr0py

@dxqb


 GGUF fix for unquantized types when using unquantize kernels

3ae202f

Even if the `qweight_type` is one of the `UNQUANTIZED_TYPES`, qweight still has to be "dequantized" because it is stored as an 8-bit tensor. Without doing so, it is therefore a shape mismatch in the following matmul.
Side notes:
 - why isn't DIFFUSERS_GGUF_CUDA_KERNELS on by default? It's significantly faster and only used when installed
 - https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?

@Isotr0py

Copy link

Contributor

Isotr0py commented Oct 17, 2025

https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?

In fact, there is a pre-release building for torch 2.8 (https://huggingface.co/Isotr0py/ggml/tree/shmem-mmq/build), but I found there is some regression about kernel size and performance in these kernels.

Anyway, I have found out the root issue about the regression and fixing it, and will make a release with torch2.8 and 2.9 support tonight.

Isotr0py

Isotr0py reviewed

Oct 17, 2025

View reviewed changes

src/diffusers/quantizers/gguf/utils.py

Comment on lines +82 to +83

weight = dequantize_gguf_tensor(qweight)

return x @ weight.T

Copy link

Contributor

@Isotr0py Isotr0py Oct 17, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems dequantize_gguf_tensor missed implementation for FP16 and FP32 qweight:

diffusers/src/diffusers/quantizers/gguf/utils.py

Lines 487 to 501 in dbe4136

dequantize_functions = {

gguf.GGMLQuantizationType.IQ4_NL: dequantize_blocks_IQ4_NL,

gguf.GGMLQuantizationType.IQ4_XS: dequantize_blocks_IQ4_XS,

gguf.GGMLQuantizationType.BF16: dequantize_blocks_BF16,

gguf.GGMLQuantizationType.Q8_0: dequantize_blocks_Q8_0,

gguf.GGMLQuantizationType.Q5_1: dequantize_blocks_Q5_1,

gguf.GGMLQuantizationType.Q5_0: dequantize_blocks_Q5_0,

gguf.GGMLQuantizationType.Q4_1: dequantize_blocks_Q4_1,

gguf.GGMLQuantizationType.Q4_0: dequantize_blocks_Q4_0,

gguf.GGMLQuantizationType.Q6_K: dequantize_blocks_Q6_K,

gguf.GGMLQuantizationType.Q5_K: dequantize_blocks_Q5_K,

gguf.GGMLQuantizationType.Q4_K: dequantize_blocks_Q4_K,

gguf.GGMLQuantizationType.Q3_K: dequantize_blocks_Q3_K,

gguf.GGMLQuantizationType.Q2_K: dequantize_blocks_Q2_K,

}

Copy link

Author

@dxqb dxqb Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was bf16 in my use case.

fp16 and fp32 would fail in any case, whether native or dequant kernels are used.
this PR therefore currently only fixes the bf16 case for kernel dequant - for native bf16 already works

Labels

None yet

2 participants

@dxqb @Isotr0py

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GGUF fix for unquantized types when using unquantize kernels #12498

Are you sure you want to change the base?

GGUF fix for unquantized types when using unquantize kernels #12498

Uh oh!

Conversation

@dxqb dxqb commented Oct 16, 2025 •

edited

Loading

Uh oh!

Who can review?

Uh oh!

Isotr0py commented Oct 17, 2025

Uh oh!

@Isotr0py Isotr0py Oct 17, 2025 •

edited

Loading

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@dxqb dxqb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

GGUF fix for unquantized types when using unquantize kernels #12498

Are you sure you want to change the base?

GGUF fix for unquantized types when using unquantize kernels #12498

Uh oh!

Conversation

@dxqb dxqb commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Who can review?

Uh oh!

Isotr0py commented Oct 17, 2025

Uh oh!

@Isotr0py Isotr0py Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@dxqb dxqb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

@dxqb dxqb commented Oct 16, 2025 •

edited

Loading

@Isotr0py Isotr0py Oct 17, 2025 •

edited

Loading