Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Llama.cpp: Bringing Power of Local AI to Everyday Consumer Setups #16713

engrtipusultan started this conversation in Show and tell
Discussion options

Hi, I have a moderate setup without any dedicated GPU. My main purpose of buying this setup was to buy something within my budget for experimentation while keeping running cost low as well (15W to 35W TDP).

MoE models and llama.cpp providing vulkan back-end is only inference engine which enables AI inference accessible to everyday users.

I am sharing some benchmarks of running models at Q8 (Almost full precision) which everyday consumers might be able to run on their setups. If you have more models to share please go ahead add awareness for other people.

llama.cpp build: fb34984 (6812) Vulkan Backend

My Setup:

Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-33-generic
Vulkan: Mesa 25.2.5 (apiVersion= 1.4.318)
Hardware: GMKtec M5 PLUS (Mini PC)
CPU: AMD Ryzen 7 5825U (8 cores, 16 Threads)
GPU: Radeon Vega 8 (gfx_target_version= gfx90c)
RAM: 64GB DDR4-3200 (32GB x 2)
Storage: 512 GB M.2 2280 PCIe Gen 3

Conclusion thus far:

Model @ Q8 pp512 (Packet Processing) Token / sec tg128 (Token Generation) Token / sec Comments
Qwen3-Coder-30B-A3B 95.76 12.97 Maybe the best option at my setup
Qwen3-30B-A3B-Instruct-2507 95.76 12.97
Qwen3-30B-A3B-Thinking-2507 95.76 12.97
gpt-oss-20b 131.74 11.55
Granite-4.0-h-tiny 201.17 21.15 Best option in terms of memory utilization requirements and speed.
Ling-mini-2.0 227.23 34.29 Fastest option
Ring-mini-2.0 227.23 34.29
Details of benchmarks ran

Model: Qwen3-Coder-30B-A3B same for (Qwen3-30B-A3B-Instruct-2507 and Qwen3-30B-A3B-Thinking-2507)

llama-bench -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model size params backend ngl threads n_batch n_ubatch mmap test t/s
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B Vulkan 99 4 512 4096 0 pp512 95.76 ± 0.78
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B Vulkan 99 4 512 4096 0 tg128 12.97 ± 0.02

Model: gpt-oss-20b
llama-bench -m /home/tipu/AI/models/other/jinx-gpt-oss/jinx-gpt-oss-20b-mxfp4.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model size params backend ngl threads n_batch n_ubatch mmap test t/s
gpt-oss 20B F16 12.83 GiB 20.91 B Vulkan 99 4 512 4096 0 pp512 131.74 ± 0.81
gpt-oss 20B F16 12.83 GiB 20.91 B Vulkan 99 4 512 4096 0 tg128 11.55 ± 0.01

Model: Granite-4.0-h-tiny
llama-bench -m /home/tipu/AI/models/other/granite-4.0-h-tiny/granite-4.0-h-tiny-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model size params backend ngl threads n_batch n_ubatch mmap test t/s
granitehybrid 7B.A1B Q8_0 6.88 GiB 6.94 B Vulkan 99 4 512 4096 0 pp512 201.17 ± 1.52
granitehybrid 7B.A1B Q8_0 6.88 GiB 6.94 B Vulkan 99 4 512 4096 0 tg128 21.15 ± 0.04

Model: Ling-mini-2.0
llama-bench -m /home/tipu/AI/models/other/Huihui-Ling-mini-2.0/Huihui-Ling-mini-2.0-abliterated-q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 -r 8

model size params backend ngl threads n_batch n_ubatch mmap test t/s
bailingmoe2 16B.A1B Q8_0 16.11 GiB 16.26 B Vulkan 99 4 512 4096 0 pp512 227.23 ± 2.13
bailingmoe2 16B.A1B Q8_0 16.11 GiB 16.26 B Vulkan 99 4 512 4096 0 tg128 34.29 ± 0.04
You must be logged in to vote

Replies: 1 comment

Comment options

Sharing some of my understanding for new comers.
If you are loading bigger model then you can decrease --batch-size to decrease RAM utilization for KV cache. Decreasing --batch-size will maybe reduce packet processing but you will be able to fit in bigger context size for communication with LLM.

For most of the models with the increase in context packet processing speed decreases. So keep that in mind while choosing your model. Similarly for bigger response generation speed also decreases. Following are some benchmarks:

Qwen3-Coder 30B.A3B

model size params backend ngl n_batch n_ubatch mmap test t/s
qwen3moe 30B.A3B Q8_0 24.64 GiB 24.87 B Vulkan 99 512 4096 0 pp512 105.52 ± 0.00
qwen3moe 30B.A3B Q8_0 24.64 GiB 24.87 B Vulkan 99 512 4096 0 pp32768 27.93 ± 0.00
qwen3moe 30B.A3B Q8_0 24.64 GiB 24.87 B Vulkan 99 512 4096 0 tg512 12.70 ± 0.00
qwen3moe 30B.A3B Q8_0 24.64 GiB 24.87 B Vulkan 99 512 4096 0 tg16768 5.88 ± 0.00

Ling-mini-2.0

model size params backend ngl n_batch n_ubatch mmap test t/s
bailingmoe2 16B.A1B Q8_0 16.11 GiB 16.26 B Vulkan 99 512 4096 0 pp512 228.53 ± 2.52
bailingmoe2 16B.A1B Q8_0 16.11 GiB 16.26 B Vulkan 99 512 4096 0 pp32768 101.08 ± 0.20
bailingmoe2 16B.A1B Q8_0 16.11 GiB 16.26 B Vulkan 99 512 4096 0 tg512 33.82 ± 0.01
bailingmoe2 16B.A1B Q8_0 16.11 GiB 16.26 B Vulkan 99 512 4096 0 tg32768 12.87 ± 0.01

granite-4.0-h-tiny
Granite 4.0 introduces a hybrid Mamba-2/transformer architecture. They are said to be have better throughput at higher contexts and larger text generations. Same shows in benchmark.

model size params backend ngl n_batch n_ubatch mmap test t/s
granitehybrid 7B.A1B Q8_0 6.88 GiB 6.94 B Vulkan 99 512 4096 0 pp512 202.28 ± 0.00
granitehybrid 7B.A1B Q8_0 6.88 GiB 6.94 B Vulkan 99 512 4096 0 pp32768 171.45 ± 0.00
granitehybrid 7B.A1B Q8_0 6.88 GiB 6.94 B Vulkan 99 512 4096 0 tg512 21.16 ± 0.00
granitehybrid 7B.A1B Q8_0 6.88 GiB 6.94 B Vulkan 99 512 4096 0 tg16768 19.50 ± 0.00

Ling linear and Qwen3 next are not support at the moment in llama.cpp (I believe in progress). They are suppose to be better at higher context and larger generation.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant

AltStyle によって変換されたページ (->オリジナル) /