-
Notifications
You must be signed in to change notification settings - Fork 13.4k
compute types per platform #16730
Unanswered
okuvshynov
asked this question in
Q&A
compute types per platform
#16730
-
If I get some model in gguf format (say, glm-air-q8), use same type for kv cache (f16) and run it:
- on M2 Ultra
- on CUDA
- on CPU only
Will the types used for compute and activations be the same? Will we first de-quantize weights and cache to the same type (f16/f32/whatever), will we use same types for operations (operate on f16, accumulate to f32), etc.?
Or, will it depend on platform/kernel implementation on that platform? If it is different, what's the right place (pointers to code?) to learn these differences?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment