Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups · ggml-org/llama.cpp · Discussion #16713

engrtipusultan
Oct 22, 2025

Hi, I have a moderate setup without any dedicated GPU. My main purpose of buying this setup was to buy something within my budget for experimentation while keeping running cost low as well (15W to 35W TDP).

MoE models and llama.cpp providing vulkan back-end is only inference engine which enables AI inference accessible to everyday users.

I am sharing some benchmarks of running models at Q8 (Almost full precision) which everyday consumers might be able to run on their setups. If you have more models to share please go ahead add awareness for other people.

llama.cpp build: fb34984 (6812) Vulkan Backend

My Setup:

Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-33-generic
Vulkan: Mesa 25.2.5 (apiVersion= 1.4.318)
Hardware: GMKtec M5 PLUS (Mini PC)
CPU: AMD Ryzen 7 5825U (8 cores, 16 Threads)
GPU: Radeon Vega 8 (gfx_target_version= gfx90c)
RAM: 64GB DDR4-3200 (32GB x 2)
Storage: 512 GB M.2 2280 PCIe Gen 3

Conclusion thus far:

Model @ Q8	pp512 (Packet Processing) Token / sec	tg128 (Token Generation) Token / sec	Comments
Qwen3-Coder-30B-A3B	95.76	12.97	Maybe the best option at my setup
Qwen3-30B-A3B-Instruct-2507	95.76	12.97
Qwen3-30B-A3B-Thinking-2507	95.76	12.97
gpt-oss-20b	131.74	11.55
Granite-4.0-h-tiny	201.17	21.15	Best option in terms of memory utilization requirements and speed.
Ling-mini-2.0	227.23	34.29	Fastest option
Ring-mini-2.0	227.23	34.29

Details of benchmarks ran

Model: Qwen3-Coder-30B-A3B same for (Qwen3-30B-A3B-Instruct-2507 and Qwen3-30B-A3B-Thinking-2507)

llama-bench -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	Vulkan	99	4	512	4096	0	pp512	95.76 ± 0.78
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	Vulkan	99	4	512	4096	0	tg128	12.97 ± 0.02

Model: gpt-oss-20b
llama-bench -m /home/tipu/AI/models/other/jinx-gpt-oss/jinx-gpt-oss-20b-mxfp4.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
gpt-oss 20B F16	12.83 GiB	20.91 B	Vulkan	99	4	512	4096	0	pp512	131.74 ± 0.81
gpt-oss 20B F16	12.83 GiB	20.91 B	Vulkan	99	4	512	4096	0	tg128	11.55 ± 0.01

Model: Granite-4.0-h-tiny
llama-bench -m /home/tipu/AI/models/other/granite-4.0-h-tiny/granite-4.0-h-tiny-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	4	512	4096	0	pp512	201.17 ± 1.52
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	4	512	4096	0	tg128	21.15 ± 0.04

Model: Ling-mini-2.0
llama-bench -m /home/tipu/AI/models/other/Huihui-Ling-mini-2.0/Huihui-Ling-mini-2.0-abliterated-q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	4	512	4096	0	pp512	227.23 ± 2.13
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	4	512	4096	0	tg128	34.29 ± 0.04

Replies: 1 comment

engrtipusultan
Oct 23, 2025
Author

Sharing some of my understanding for new comers.
If you are loading bigger model then you can decrease --batch-size to decrease RAM utilization for KV cache. Decreasing --batch-size will maybe reduce packet processing but you will be able to fit in bigger context size for communication with LLM.

For most of the models with the increase in context packet processing speed decreases. So keep that in mind while choosing your model. Similarly for bigger response generation speed also decreases. Following are some benchmarks:

Qwen3-Coder 30B.A3B

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
qwen3moe 30B.A3B Q8_0	24.64 GiB	24.87 B	Vulkan	99	512	4096	pp512	105.52 ± 0.00
qwen3moe 30B.A3B Q8_0	24.64 GiB	24.87 B	Vulkan	99	512	4096	pp32768	27.93 ± 0.00
qwen3moe 30B.A3B Q8_0	24.64 GiB	24.87 B	Vulkan	99	512	4096	tg512	12.70 ± 0.00
qwen3moe 30B.A3B Q8_0	24.64 GiB	24.87 B	Vulkan	99	512	4096	tg16768	5.88 ± 0.00

Ling-mini-2.0

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	512	4096	pp512	228.53 ± 2.52
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	512	4096	pp32768	101.08 ± 0.20
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	512	4096	tg512	33.82 ± 0.01
bailingmoe2 16B.A1B Q8_0	16.11 GiB	16.26 B	Vulkan	99	512	4096	tg32768	12.87 ± 0.01

granite-4.0-h-tiny
Granite 4.0 introduces a hybrid Mamba-2/transformer architecture. They are said to be have better throughput at higher contexts and larger text generations. Same shows in benchmark.

model	size	params	backend	ngl	n_batch	n_ubatch	test	t/s
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	512	4096	pp512	202.28 ± 0.00
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	512	4096	pp32768	171.45 ± 0.00
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	512	4096	tg512	21.16 ± 0.00
granitehybrid 7B.A1B Q8_0	6.88 GiB	6.94 B	Vulkan	99	512	4096	tg16768	19.50 ± 0.00

Ling linear and Qwen3 next are not support at the moment in llama.cpp (I believe in progress). They are suppose to be better at higher context and larger generation.

0 replies

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups #16713

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

engrtipusultan
Oct 22, 2025

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

engrtipusultan
Oct 23, 2025
Author

Select a reply

Uh oh!

Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups #16713

Uh oh!

Uh oh!

engrtipusultan Oct 22, 2025

Replies: 1 comment

Uh oh!

Uh oh!

engrtipusultan Oct 23, 2025 Author

engrtipusultan
Oct 22, 2025

engrtipusultan
Oct 23, 2025
Author