-
Notifications
You must be signed in to change notification settings - Fork 187
Add conversion from ModelCloud Quantizations (GPTQ, GPTQ-v2, QQQ + Rotation) to GGUF #1544
-
I really like what you guys have done with this project and your quantization schemes. And, I am happy we can convert ModelCloud GPTQ to MLX. But, Is it at all possible to get a conversion from ModelCloud quants (GPTQ, GPQ-v2, and, QQQ + Rotation) to GGUF? Please 🙏 - thank you in advance.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
Replies: 1 comment 5 replies
-
@joseph777111 It is entirely possible but the conversion is not native and if not aligned, may have secondary loss. We only provided the feature for MLX for academic reasons and have not actually benchmarked if there is a loss. The concept is simple, we dequantize the weights back to bfloat16 and have gguf go at it. Some people might think this is so naive but those who know how quantization works, knows this that exactly how quantization works. Quantization is not about directly going from bfloat16 to 4bits. The primary pass is actually to make sure the bfloat16 is smoothed so the group_size/block quantizations can work effectively in packing stage.
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
Thank you! I was thinking about the group_size/block... This is exactly what I was hoping for. If you don't mind having this as an experimental/academic feature, could this please be added? It would be very nice to be able to experiment with your quantizations with GGUF. And, being able to quantize and export the de-quantized model weights that are already "set up", using GPTQ, GPTQ-v2, etc, and converting and quantizing them to GGUF would make this a reality.
Back in the LLaMA-2 days, I was able to use OmniQuant's de-quantized ("fake") model-weights successfully in conversion and quantization to GGUF - also: at that time, GGUF had a built-in AWQ conversion script, so we also were able to convert models to GGUF using AWQ scales (this unfortunately is no longer supported, as no one used the feature but me...). But, I digress...
I know GGUF is compatible with other quantization schemes like GPTQ and AWQ as long as the group_size and block size are the same as that which GGUF uses. Being able to use GPTQ, GPTQ-v2, etc with GGUF would open the doors to much more accurate GGUF models. I really hope this can be added as an experimental/academic feature. Because, I really like llama.cpp's system agnostic approach, but I crave more accurate quantizations like GPTQ, GPTQ-v2, etc, and I know I am not alone. 😋
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Exactly, as long as GGUF quantization (packing) is aligned with the gptq format, then I see very low to no loss. For example, don't convert gptq 4bit group size=128 to gguf 3bit and group_size=32. The bit and group_size should be aligned as much possible.
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
So, just to clarify: we can export the de-quantized GPTQ model-weights to BF16 via saving/exporting to MLX to use with other compatible quantization frameworks, such as GGUF. So, the academic feature request is, technically, already implicitly supported? 🤔👀
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@joseph777111 If you are up to the task, please check the mlx export code. We already do full BF16 dequantzization. So you just need to copy the mlx export code and replace the mlx part with gguf equivalent! This would be a great feature/PR to have. Let me know. I can assist with you on the PR if you run into any problem.
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
Sorry, I never saw and was never notified that you messaged me. Please forgive my ineptness to reply. That sounds like a great idea. Unfortunately, I don't have time right now. But, if I ever find the time, I would love to pursue this. Thank you for suggesting this. 😋
Beta Was this translation helpful? Give feedback.