`Q4_0` quantization support for mmproj · ggml-org/llama.cpp · Discussion #15453

tdakhran
Aug 20, 2025

On arm64 devices, Q4_0 delivers the best speed.

mmproj can be converted into the datatypes available in the conversion script {f32,f16,bf16,q8_0,tq1_0,tq2_0}.

However, when I try

bin/llama-quantize /data/playground/vlm2/LFM2-VL-1.6B/mmproj-LFM2-VL-1.6B-Q8_0.gguf /tmp/out.gguf Q4_0

an error is thrown

llama_model_quantize: failed to quantize: unknown model architecture: 'clip'

Is there a way to quantize mmproj into Q4_0?

Here are comparison results between ExecuTorch (mmproj in Q4_0) and llama.cpp on Raspberry Pi 5.

Executorch
 Prompt Tokens: 158 Generated Tokens: 33
 Model Load Time: 2.11 (seconds)
 Total inference time: 3.46 (seconds) Rate: 9.52 (tokens/second)
 Prompt evaluation: 2.55 (seconds) Rate: 61.98 (tokens/second)
 Generated 33 tokens: 0.92 (seconds) Rate: 36.04 (tokens/second)
 Time to first generated token: 2.55 (seconds)
llama.cpp
llama_perf_context_print: load time = 234.01 ms
llama_perf_context_print: prompt eval time = 4146.94 ms / 158 tokens ( 26.25 ms per token, 38.10 tokens per second)
llama_perf_context_print: eval time = 646.18 ms / 27 runs ( 23.93 ms per token, 41.78 tokens per second)
llama_perf_context_print: total time = 4932.17 ms / 185 tokens
llama_perf_context_print: graphs reused = 0

Prompt evaluation time is 2.55s vs 4.15s.

cc: @ngxson

Replies: 1 comment 10 replies

CISC
Aug 20, 2025
Collaborator

The problem is that llama_model_quantize_impl calls model.load_arch:

llama.cpp/src/llama-quant.cpp

Line 604 in 657b8a7

model.load_arch (ml);

which does this:

llama.cpp/src/llama-model.cpp

Lines 447 to 451 in 657b8a7

void llama_model::load_arch(llama_model_loader & ml) {

arch = ml.get_arch();

if (arch == LLM_ARCH_UNKNOWN) {

throw std::runtime_error("unknown model architecture: '" + ml.get_arch_name() + "'");

}

which only knows of these (text models):

llama.cpp/src/llama-arch.cpp

Lines 7 to 97 in 657b8a7

static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {

{ LLM_ARCH_LLAMA, "llama" },

{ LLM_ARCH_LLAMA4, "llama4" },

{ LLM_ARCH_DECI, "deci" },

{ LLM_ARCH_FALCON, "falcon" },

{ LLM_ARCH_GROK, "grok" },

{ LLM_ARCH_GPT2, "gpt2" },

{ LLM_ARCH_GPTJ, "gptj" },

{ LLM_ARCH_GPTNEOX, "gptneox" },

{ LLM_ARCH_MPT, "mpt" },

{ LLM_ARCH_BAICHUAN, "baichuan" },

{ LLM_ARCH_STARCODER, "starcoder" },

{ LLM_ARCH_REFACT, "refact" },

{ LLM_ARCH_BERT, "bert" },

{ LLM_ARCH_NOMIC_BERT, "nomic-bert" },

{ LLM_ARCH_NOMIC_BERT_MOE, "nomic-bert-moe" },

{ LLM_ARCH_NEO_BERT, "neo-bert" },

{ LLM_ARCH_JINA_BERT_V2, "jina-bert-v2" },

{ LLM_ARCH_BLOOM, "bloom" },

{ LLM_ARCH_STABLELM, "stablelm" },

{ LLM_ARCH_QWEN, "qwen" },

{ LLM_ARCH_QWEN2, "qwen2" },

{ LLM_ARCH_QWEN2MOE, "qwen2moe" },

{ LLM_ARCH_QWEN2VL, "qwen2vl" },

{ LLM_ARCH_QWEN3, "qwen3" },

{ LLM_ARCH_QWEN3MOE, "qwen3moe" },

{ LLM_ARCH_PHI2, "phi2" },

{ LLM_ARCH_PHI3, "phi3" },

{ LLM_ARCH_PHIMOE, "phimoe" },

{ LLM_ARCH_PLAMO, "plamo" },

{ LLM_ARCH_PLAMO2, "plamo2" },

{ LLM_ARCH_CODESHELL, "codeshell" },

{ LLM_ARCH_ORION, "orion" },

{ LLM_ARCH_INTERNLM2, "internlm2" },

{ LLM_ARCH_MINICPM, "minicpm" },

{ LLM_ARCH_MINICPM3, "minicpm3" },

{ LLM_ARCH_GEMMA, "gemma" },

{ LLM_ARCH_GEMMA2, "gemma2" },

{ LLM_ARCH_GEMMA3, "gemma3" },

{ LLM_ARCH_GEMMA3N, "gemma3n" },

{ LLM_ARCH_STARCODER2, "starcoder2" },

{ LLM_ARCH_MAMBA, "mamba" },

{ LLM_ARCH_MAMBA2, "mamba2" },

{ LLM_ARCH_JAMBA, "jamba" },

{ LLM_ARCH_FALCON_H1, "falcon-h1" },

{ LLM_ARCH_XVERSE, "xverse" },

{ LLM_ARCH_COMMAND_R, "command-r" },

{ LLM_ARCH_COHERE2, "cohere2" },

{ LLM_ARCH_DBRX, "dbrx" },

{ LLM_ARCH_OLMO, "olmo" },

{ LLM_ARCH_OLMO2, "olmo2" },

{ LLM_ARCH_OLMOE, "olmoe" },

{ LLM_ARCH_OPENELM, "openelm" },

{ LLM_ARCH_ARCTIC, "arctic" },

{ LLM_ARCH_DEEPSEEK, "deepseek" },

{ LLM_ARCH_DEEPSEEK2, "deepseek2" },

{ LLM_ARCH_CHATGLM, "chatglm" },

{ LLM_ARCH_GLM4, "glm4" },

{ LLM_ARCH_GLM4_MOE, "glm4moe" },

{ LLM_ARCH_BITNET, "bitnet" },

{ LLM_ARCH_T5, "t5" },

{ LLM_ARCH_T5ENCODER, "t5encoder" },

{ LLM_ARCH_JAIS, "jais" },

{ LLM_ARCH_NEMOTRON, "nemotron" },

{ LLM_ARCH_EXAONE, "exaone" },

{ LLM_ARCH_EXAONE4, "exaone4" },

{ LLM_ARCH_RWKV6, "rwkv6" },

{ LLM_ARCH_RWKV6QWEN2, "rwkv6qwen2" },

{ LLM_ARCH_RWKV7, "rwkv7" },

{ LLM_ARCH_ARWKV7, "arwkv7" },

{ LLM_ARCH_GRANITE, "granite" },

{ LLM_ARCH_GRANITE_MOE, "granitemoe" },

{ LLM_ARCH_GRANITE_HYBRID, "granitehybrid" },

{ LLM_ARCH_CHAMELEON, "chameleon" },

{ LLM_ARCH_WAVTOKENIZER_DEC, "wavtokenizer-dec" },

{ LLM_ARCH_PLM, "plm" },

{ LLM_ARCH_BAILINGMOE, "bailingmoe" },

{ LLM_ARCH_DOTS1, "dots1" },

{ LLM_ARCH_ARCEE, "arcee" },

{ LLM_ARCH_ERNIE4_5, "ernie4_5" },

{ LLM_ARCH_ERNIE4_5_MOE, "ernie4_5-moe" },

{ LLM_ARCH_HUNYUAN_MOE, "hunyuan-moe" },

{ LLM_ARCH_HUNYUAN_DENSE, "hunyuan-dense" },

{ LLM_ARCH_SMOLLM3, "smollm3" },

{ LLM_ARCH_OPENAI_MOE, "gpt-oss" },

{ LLM_ARCH_LFM2, "lfm2" },

{ LLM_ARCH_DREAM, "dream" },

{ LLM_ARCH_SMALLTHINKER, "smallthinker" },

{ LLM_ARCH_LLADA, "llada" },

{ LLM_ARCH_UNKNOWN, "(unknown)" },

};

Support can be added by bypassing this for clip, but you'd have to make sure you don't quantize tensors that should not be quantized.

10 replies

@ngxson

ngxson Oct 15, 2025
Collaborator

I added a PR for this: #16592

@ngxson

ngxson Oct 15, 2025
Collaborator

Unfortunately, the tensor shapes of the LFM2-VL-1.6B vision tower have shapes not divided by any of the Q_K or Q_0 quants, so we can't actually quantize it further than f16

@tdakhran

tdakhran Oct 15, 2025
Author

I could convert them directly to q8_0 here https://huggingface.co/LiquidAI/LFM2-VL-1.6B-GGUF/tree/main, I guess llama-quantize does it differently.

@tdakhran

tdakhran Oct 15, 2025
Author

Thank you for implementing it @ngxson !

@steampunque

steampunque Oct 18, 2025

I could convert them directly to q8_0 here https://huggingface.co/LiquidAI/LFM2-VL-1.6B-GGUF/tree/main, I guess llama-quantize does it differently.

@tdakhran Most likely convert automatically fell back to F16 for the clip ffn. I created a hybrid quant for LFM2-VL-1.6B here : https://huggingface.co/steampunque/LFM2-VL-1.6B-Hybrid-GGUF . Also available are mmproj in Q8_0 and Q4_0 with padded clip ffn for use with the model.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`Q4_0` quantization support for mmproj #15453

Uh oh!

{{title}}

Uh oh!

tdakhran
Aug 20, 2025

Replies: 1 comment 10 replies

Uh oh!

{{title}}

Uh oh!

CISC
Aug 20, 2025
Collaborator

Uh oh!

{{title}}

Uh oh!

ngxson Oct 15, 2025
Collaborator

Uh oh!

{{title}}

Uh oh!

ngxson Oct 15, 2025
Collaborator

Uh oh!

{{title}}

Uh oh!

tdakhran Oct 15, 2025
Author

Uh oh!

{{title}}

Uh oh!

tdakhran Oct 15, 2025
Author

Uh oh!

{{title}}

Uh oh!

steampunque Oct 18, 2025

Select a reply

Uh oh!

Q4_0 quantization support for mmproj #15453

Uh oh!

tdakhran Aug 20, 2025

Replies: 1 comment · 10 replies

Uh oh!

CISC Aug 20, 2025 Collaborator

Uh oh!

ngxson Oct 15, 2025 Collaborator

Uh oh!

ngxson Oct 15, 2025 Collaborator

Uh oh!

tdakhran Oct 15, 2025 Author

Uh oh!

tdakhran Oct 15, 2025 Author

Uh oh!

steampunque Oct 18, 2025

`Q4_0` quantization support for mmproj #15453

tdakhran
Aug 20, 2025

Replies: 1 comment 10 replies

CISC
Aug 20, 2025
Collaborator

ngxson Oct 15, 2025
Collaborator

ngxson Oct 15, 2025
Collaborator

tdakhran Oct 15, 2025
Author

tdakhran Oct 15, 2025
Author