compute types per platform · ggml-org/llama.cpp · Discussion #16730

okuvshynov
Oct 23, 2025

If I get some model in gguf format (say, glm-air-q8), use same type for kv cache (f16) and run it:

on M2 Ultra
on CUDA
on CPU only

Will the types used for compute and activations be the same? Will we first de-quantize weights and cache to the same type (f16/f32/whatever), will we use same types for operations (operate on f16, accumulate to f32), etc.?

Or, will it depend on platform/kernel implementation on that platform? If it is different, what's the right place (pointers to code?) to learn these differences?

Thank you!

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compute types per platform #16730

Uh oh!

{{title}}

Uh oh!

okuvshynov
Oct 23, 2025

Replies: 0 comments

Select a reply

Uh oh!

compute types per platform #16730

Uh oh!

okuvshynov Oct 23, 2025

Replies: 0 comments

okuvshynov
Oct 23, 2025