Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Q4_0 quantization support for mmproj #15453

tdakhran started this conversation in Ideas
Discussion options

On arm64 devices, Q4_0 delivers the best speed.

mmproj can be converted into the datatypes available in the conversion script {f32,f16,bf16,q8_0,tq1_0,tq2_0}.

However, when I try

bin/llama-quantize /data/playground/vlm2/LFM2-VL-1.6B/mmproj-LFM2-VL-1.6B-Q8_0.gguf /tmp/out.gguf Q4_0

an error is thrown

llama_model_quantize: failed to quantize: unknown model architecture: 'clip'

Is there a way to quantize mmproj into Q4_0?

Here are comparison results between ExecuTorch (mmproj in Q4_0) and llama.cpp on Raspberry Pi 5.

Executorch
 Prompt Tokens: 158 Generated Tokens: 33
 Model Load Time: 2.11 (seconds)
 Total inference time: 3.46 (seconds) Rate: 9.52 (tokens/second)
 Prompt evaluation: 2.55 (seconds) Rate: 61.98 (tokens/second)
 Generated 33 tokens: 0.92 (seconds) Rate: 36.04 (tokens/second)
 Time to first generated token: 2.55 (seconds)
llama.cpp
llama_perf_context_print: load time = 234.01 ms
llama_perf_context_print: prompt eval time = 4146.94 ms / 158 tokens ( 26.25 ms per token, 38.10 tokens per second)
llama_perf_context_print: eval time = 646.18 ms / 27 runs ( 23.93 ms per token, 41.78 tokens per second)
llama_perf_context_print: total time = 4932.17 ms / 185 tokens
llama_perf_context_print: graphs reused = 0

Prompt evaluation time is 2.55s vs 4.15s.

cc: @ngxson

You must be logged in to vote

Replies: 1 comment 10 replies

Comment options

The problem is that llama_model_quantize_impl calls model.load_arch:

model.load_arch (ml);

which does this:

void llama_model::load_arch(llama_model_loader & ml) {
arch = ml.get_arch();
if (arch == LLM_ARCH_UNKNOWN) {
throw std::runtime_error("unknown model architecture: '" + ml.get_arch_name() + "'");
}

which only knows of these (text models):

static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_LLAMA, "llama" },
{ LLM_ARCH_LLAMA4, "llama4" },
{ LLM_ARCH_DECI, "deci" },
{ LLM_ARCH_FALCON, "falcon" },
{ LLM_ARCH_GROK, "grok" },
{ LLM_ARCH_GPT2, "gpt2" },
{ LLM_ARCH_GPTJ, "gptj" },
{ LLM_ARCH_GPTNEOX, "gptneox" },
{ LLM_ARCH_MPT, "mpt" },
{ LLM_ARCH_BAICHUAN, "baichuan" },
{ LLM_ARCH_STARCODER, "starcoder" },
{ LLM_ARCH_REFACT, "refact" },
{ LLM_ARCH_BERT, "bert" },
{ LLM_ARCH_NOMIC_BERT, "nomic-bert" },
{ LLM_ARCH_NOMIC_BERT_MOE, "nomic-bert-moe" },
{ LLM_ARCH_NEO_BERT, "neo-bert" },
{ LLM_ARCH_JINA_BERT_V2, "jina-bert-v2" },
{ LLM_ARCH_BLOOM, "bloom" },
{ LLM_ARCH_STABLELM, "stablelm" },
{ LLM_ARCH_QWEN, "qwen" },
{ LLM_ARCH_QWEN2, "qwen2" },
{ LLM_ARCH_QWEN2MOE, "qwen2moe" },
{ LLM_ARCH_QWEN2VL, "qwen2vl" },
{ LLM_ARCH_QWEN3, "qwen3" },
{ LLM_ARCH_QWEN3MOE, "qwen3moe" },
{ LLM_ARCH_PHI2, "phi2" },
{ LLM_ARCH_PHI3, "phi3" },
{ LLM_ARCH_PHIMOE, "phimoe" },
{ LLM_ARCH_PLAMO, "plamo" },
{ LLM_ARCH_PLAMO2, "plamo2" },
{ LLM_ARCH_CODESHELL, "codeshell" },
{ LLM_ARCH_ORION, "orion" },
{ LLM_ARCH_INTERNLM2, "internlm2" },
{ LLM_ARCH_MINICPM, "minicpm" },
{ LLM_ARCH_MINICPM3, "minicpm3" },
{ LLM_ARCH_GEMMA, "gemma" },
{ LLM_ARCH_GEMMA2, "gemma2" },
{ LLM_ARCH_GEMMA3, "gemma3" },
{ LLM_ARCH_GEMMA3N, "gemma3n" },
{ LLM_ARCH_STARCODER2, "starcoder2" },
{ LLM_ARCH_MAMBA, "mamba" },
{ LLM_ARCH_MAMBA2, "mamba2" },
{ LLM_ARCH_JAMBA, "jamba" },
{ LLM_ARCH_FALCON_H1, "falcon-h1" },
{ LLM_ARCH_XVERSE, "xverse" },
{ LLM_ARCH_COMMAND_R, "command-r" },
{ LLM_ARCH_COHERE2, "cohere2" },
{ LLM_ARCH_DBRX, "dbrx" },
{ LLM_ARCH_OLMO, "olmo" },
{ LLM_ARCH_OLMO2, "olmo2" },
{ LLM_ARCH_OLMOE, "olmoe" },
{ LLM_ARCH_OPENELM, "openelm" },
{ LLM_ARCH_ARCTIC, "arctic" },
{ LLM_ARCH_DEEPSEEK, "deepseek" },
{ LLM_ARCH_DEEPSEEK2, "deepseek2" },
{ LLM_ARCH_CHATGLM, "chatglm" },
{ LLM_ARCH_GLM4, "glm4" },
{ LLM_ARCH_GLM4_MOE, "glm4moe" },
{ LLM_ARCH_BITNET, "bitnet" },
{ LLM_ARCH_T5, "t5" },
{ LLM_ARCH_T5ENCODER, "t5encoder" },
{ LLM_ARCH_JAIS, "jais" },
{ LLM_ARCH_NEMOTRON, "nemotron" },
{ LLM_ARCH_EXAONE, "exaone" },
{ LLM_ARCH_EXAONE4, "exaone4" },
{ LLM_ARCH_RWKV6, "rwkv6" },
{ LLM_ARCH_RWKV6QWEN2, "rwkv6qwen2" },
{ LLM_ARCH_RWKV7, "rwkv7" },
{ LLM_ARCH_ARWKV7, "arwkv7" },
{ LLM_ARCH_GRANITE, "granite" },
{ LLM_ARCH_GRANITE_MOE, "granitemoe" },
{ LLM_ARCH_GRANITE_HYBRID, "granitehybrid" },
{ LLM_ARCH_CHAMELEON, "chameleon" },
{ LLM_ARCH_WAVTOKENIZER_DEC, "wavtokenizer-dec" },
{ LLM_ARCH_PLM, "plm" },
{ LLM_ARCH_BAILINGMOE, "bailingmoe" },
{ LLM_ARCH_DOTS1, "dots1" },
{ LLM_ARCH_ARCEE, "arcee" },
{ LLM_ARCH_ERNIE4_5, "ernie4_5" },
{ LLM_ARCH_ERNIE4_5_MOE, "ernie4_5-moe" },
{ LLM_ARCH_HUNYUAN_MOE, "hunyuan-moe" },
{ LLM_ARCH_HUNYUAN_DENSE, "hunyuan-dense" },
{ LLM_ARCH_SMOLLM3, "smollm3" },
{ LLM_ARCH_OPENAI_MOE, "gpt-oss" },
{ LLM_ARCH_LFM2, "lfm2" },
{ LLM_ARCH_DREAM, "dream" },
{ LLM_ARCH_SMALLTHINKER, "smallthinker" },
{ LLM_ARCH_LLADA, "llada" },
{ LLM_ARCH_UNKNOWN, "(unknown)" },
};

Support can be added by bypassing this for clip, but you'd have to make sure you don't quantize tensors that should not be quantized.

You must be logged in to vote
10 replies
Comment options

ngxson Oct 15, 2025
Collaborator

I added a PR for this: #16592

Comment options

ngxson Oct 15, 2025
Collaborator

Unfortunately, the tensor shapes of the LFM2-VL-1.6B vision tower have shapes not divided by any of the Q_K or Q_0 quants, so we can't actually quantize it further than f16

Comment options

I could convert them directly to q8_0 here https://huggingface.co/LiquidAI/LFM2-VL-1.6B-GGUF/tree/main, I guess llama-quantize does it differently.

Comment options

Thank you for implementing it @ngxson !

Comment options

I could convert them directly to q8_0 here https://huggingface.co/LiquidAI/LFM2-VL-1.6B-GGUF/tree/main, I guess llama-quantize does it differently.

@tdakhran Most likely convert automatically fell back to F16 for the clip ffn. I created a hybrid quant for LFM2-VL-1.6B here : https://huggingface.co/steampunque/LFM2-VL-1.6B-Hybrid-GGUF . Also available are mmproj in Q8_0 and Q4_0 with padded clip ffn for use with the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /