-
Notifications
You must be signed in to change notification settings - Fork 13.5k
-
On arm64 devices, Q4_0 delivers the best speed.
mmproj can be converted into the datatypes available in the conversion script {f32,f16,bf16,q8_0,tq1_0,tq2_0}.
However, when I try
bin/llama-quantize /data/playground/vlm2/LFM2-VL-1.6B/mmproj-LFM2-VL-1.6B-Q8_0.gguf /tmp/out.gguf Q4_0
an error is thrown
llama_model_quantize: failed to quantize: unknown model architecture: 'clip'
Is there a way to quantize mmproj into Q4_0?
Here are comparison results between ExecuTorch (mmproj in Q4_0) and llama.cpp on Raspberry Pi 5.
Executorch
Prompt Tokens: 158 Generated Tokens: 33
Model Load Time: 2.11 (seconds)
Total inference time: 3.46 (seconds) Rate: 9.52 (tokens/second)
Prompt evaluation: 2.55 (seconds) Rate: 61.98 (tokens/second)
Generated 33 tokens: 0.92 (seconds) Rate: 36.04 (tokens/second)
Time to first generated token: 2.55 (seconds)
llama.cpp
llama_perf_context_print: load time = 234.01 ms
llama_perf_context_print: prompt eval time = 4146.94 ms / 158 tokens ( 26.25 ms per token, 38.10 tokens per second)
llama_perf_context_print: eval time = 646.18 ms / 27 runs ( 23.93 ms per token, 41.78 tokens per second)
llama_perf_context_print: total time = 4932.17 ms / 185 tokens
llama_perf_context_print: graphs reused = 0
Prompt evaluation time is 2.55s vs 4.15s.
cc: @ngxson
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 10 replies
-
The problem is that llama_model_quantize_impl calls model.load_arch:
Line 604 in 657b8a7
which does this:
Lines 447 to 451 in 657b8a7
which only knows of these (text models):
Lines 7 to 97 in 657b8a7
Support can be added by bypassing this for clip, but you'd have to make sure you don't quantize tensors that should not be quantized.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
I added a PR for this: #16592
Beta Was this translation helpful? Give feedback.
All reactions
-
🎉 2
-
Unfortunately, the tensor shapes of the LFM2-VL-1.6B vision tower have shapes not divided by any of the Q_K or Q_0 quants, so we can't actually quantize it further than f16
Beta Was this translation helpful? Give feedback.
All reactions
-
I could convert them directly to q8_0 here https://huggingface.co/LiquidAI/LFM2-VL-1.6B-GGUF/tree/main, I guess llama-quantize does it differently.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thank you for implementing it @ngxson !
Beta Was this translation helpful? Give feedback.
All reactions
-
I could convert them directly to q8_0 here https://huggingface.co/LiquidAI/LFM2-VL-1.6B-GGUF/tree/main, I guess llama-quantize does it differently.
@tdakhran Most likely convert automatically fell back to F16 for the clip ffn. I created a hybrid quant for LFM2-VL-1.6B here : https://huggingface.co/steampunque/LFM2-VL-1.6B-Hybrid-GGUF . Also available are mmproj in Q8_0 and Q4_0 with padded clip ffn for use with the model.
Beta Was this translation helpful? Give feedback.