-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups #16713
-
|
Hi, I have a moderate setup without any dedicated GPU. My main purpose of buying this setup was to buy something within my budget for experimentation while keeping running cost low as well (15W to 35W TDP). MoE models and llama.cpp providing vulkan back-end is only inference engine which enables AI inference accessible to everyday users. I am sharing some benchmarks of running models at Q8 (Almost full precision) which everyday consumers might be able to run on their setups. If you have more models to share please go ahead add awareness for other people. llama.cpp build: fb34984 (6812) Vulkan Backend My Setup: Operating System: Ubuntu 24.04.3 LTS Conclusion thus far:
Details of benchmarks ranModel: Qwen3-Coder-30B-A3B same for (Qwen3-30B-A3B-Instruct-2507 and Qwen3-30B-A3B-Thinking-2507)
Model: gpt-oss-20b
Model: Granite-4.0-h-tiny
Model: Ling-mini-2.0
|
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment
-
|
Sharing some of my understanding for new comers. For most of the models with the increase in context packet processing speed decreases. So keep that in mind while choosing your model. Similarly for bigger response generation speed also decreases. Following are some benchmarks: Qwen3-Coder 30B.A3B
Ling-mini-2.0
granite-4.0-h-tiny
Ling linear and Qwen3 next are not support at the moment in llama.cpp (I believe in progress). They are suppose to be better at higher context and larger generation. |
Beta Was this translation helpful? Give feedback.