Search code, repositories, users, issues, pull requests...

icsy7867 Oct 14, 2025

Hi. It would be great to see a Qwen Next 80B benchmark for these two models:

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (Has acceptable t/s even on CPU... I'm not sure if this one runs on llama.cpp)

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (Official quants)

Thanks.

Yeah I really want to see the performance of a specific model comparing full 16 bit precision, Q8, Q4, FP4 and FP8.

None the less, thank you for the wonderful data!

mfarme
Oct 15, 2025

Getting similar performance with my Farmework Desktop. Thanks for helping my FOMO.

12 replies

@LucidityCrash

LucidityCrash Oct 15, 2025

Someone please help explain this to me? I am not trying to bash on this machine, I am just trying to understand the justification for paying almost twice as much for the same performance with similar specs.

I'm sure the connectx-7 200GB networking has something to do with the pricing difference :)

@cocoderss

cocoderss Oct 15, 2025

btw it's most likely a much better choice if you want GB10 to buy the Asus GB10 systems for 1ドルk less (at least that's what I did) - DGX Spark is more expensive but it's not the only choice

Interesting, the Asus GB10 seems to run with 240W power adapter, much higher than the DGX Spark. I wonder if you will get more performance given the higher power intake.

icsy7867 Oct 15, 2025

btw it's most likely a much better choice if you want GB10 to buy the Asus GB10 systems for 1ドルk less (at least that's what I did) - DGX Spark is more expensive but it's not the only choice

Interesting, the Asus GB10 seems to run with 240W power adapter, much higher than the DGX Spark. I wonder if you will get more performance given the higher power intake.

I havent seen the specs. But its possible ASUS just used a power adapter with a high enough rating for the device? For example, I can plug a 90watt compatible power adapter into my 45watt laptop. It will pull what it needs to.

@geerlingguy

geerlingguy Oct 16, 2025

@bartlettroscoe i benched gpt-oss 120b on Framework Desktop a couple months ago: geerlingguy/ai-benchmarks#21 (comment)

@Djip007

Djip007 Oct 21, 2025

with "correct" rocm and build I get:

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp1	45.40 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp2	57.58 ± 0.95
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp3	74.03 ± 2.34
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp4	90.93 ± 2.95
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp8	142.31 ± 5.57
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp12	173.14 ± 12.88
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp16	205.43 ± 6.72
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp24	235.43 ± 11.38
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp32	234.24 ± 10.83
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp48	216.49 ± 10.21
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp64	311.52 ± 7.33
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp96	386.08 ± 10.33
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp128	446.85 ± 6.77
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp192	509.42 ± 8.09
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp256	594.22 ± 9.46
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp384	698.31 ± 3.26
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp512	763.53 ± 4.88
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp768	845.23 ± 6.57
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp1024	927.17 ± 1.20
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp1536	987.73 ± 1.96
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp2048	1017.17 ± 4.10
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp3072	939.48 ± 2.72
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp4096	953.72 ± 1.16
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	tg16	45.43 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	999	2048	1	pp512+tg64	264.68 ± 0.82

netrunnereve
Oct 15, 2025
Collaborator

Can you run the classic llama 2 7B Q4_0 so it can be compared on the chart?

0 replies

atsyplikhin
Oct 15, 2025

Super interesting, thanks for sharing, Georgi!

llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

Could you please help me understand: Does "-d" mean KV cache length before the "-p" prefill happens? What does "-ub" define, eg batch size?

1 reply

ggerganov Oct 15, 2025
Maintainer Author

Does "-d" mean KV cache length before the "-p" prefill happens?

Yes.

What does "-ub" define, eg batch size?

Yes.

beebopkim
Oct 15, 2025

Could you add llama2-7b result to #15013?

0 replies

Ramalama2
Oct 15, 2025

Awesome, thank you!
So for gpt-oss-120B around 35 tokens/s on dgx spark.
On vllm im getting with 131k context and at almost any length around 180 tokens/s on a 300W RTX6000 96gb Max-Q edition.

So whats the sense of a dgx spark? I mean sure it has 128gb memory, but i can offload bigger models between 96gb vram and the rest to normal Ram (CPU)...
So in the end i can run even bigger models and even faster as the dgx could.

Its too expensive for what it offers. If the DGX Spark would be around 2k, like the Ryzen Max 395+ Mini-PC's it would be fine and okay.
But for 4k usd/eur its absolutely senseless...

PS: And a Mac Mini/Studio is a much better option at 4k usd/eur, compared to a DGX Sparc.

9 replies

@Ramalama2

Ramalama2 Oct 16, 2025

Guys, please dont take fp4 or fp8 as a win.

Let me explain:
I do compare embedding models in different quantisations (for my project @work).

Comparing embedding Models is actually great, because you can simply query the resulting vector database and see the quantisation impacts.

From my tests, no matter which Model, be it Qwen3-Embedding or BGE-M3 or anything else, the impact of Quantisation is Huge!

FP32 is Amazing
BF16 is still Amazing
int8/Q8 = you see already a degradation because the results start to differ, but only 5-10% of the results are different.
Q4 = 50% of the results are different, almost unusable Model

So you Guys want to tell me that FP4 is a win?
In my Opinion FP8 is fine and usable, but FP4 will be unusable crap.
No Matter what the Marketing says, 1% quality loss is a huge lie!!!

I didnt tested fp4 tho, not even fp8, so i cant say for sure.
But from my experience with all other quantisations fp4 should be crap.

Cheers!

icsy7867 Oct 16, 2025

It depends on the model. In many cases, in my experience FP4 does a fantastic job. Also NVFP4 has the potential to be amazing.

So is it situational? Sure, it can be. But I don't think it's something that can be ignored.

Also, FP8 is also great, I have found little reason to not use it.

@lhl

lhl Oct 18, 2025

I'd agree that everyone should eval for their particular downstream tasks rather than just trusting perplexity or KLD. When running quants on my 405B model I ran JA MT-Bench evals and was surprised to find a bigger difference with FP8-Dynamic than IQ3_M.

@icsy7867 I know you're just theory-crafting instead of running tests, but see my PRO 6000 TensorRT/NVFP4 benchmark below, but there is zero throughput benefit from NVFP4. Maybe related to NVIDIA/TransformerEngine#2255 - I never use TensorRT and it's impossible to build so I just used the latest docker for my tests (tensorrt-llm/release:1.2.0rc0) but I've put my full scripts/details online so it's easy for anyone to rent any GPU they want to check any configuration/variation for themselves.

#16578 (reply in thread)

@Djip007

Djip007 Oct 21, 2025

yes it rely depend on model. for exemple I get for Mistral-SMAL:

BF16	Q8_0_L	Q8_0	Q8_0 Q8_0	Q6_K	Q6_K Q8_0	Q5_K_M	Q4_K_M	Q3_K_M
Mean PPL	5.377047	5.417646	5.428002	5.429658	5.433468	5.432926	5.448926	5.521099	5.798507
Mean KLD	0.008340	0.010369	0.010459	0.012241	0.012291	0.014935	0.027426	0.079385
Maximum KLD	2.048998	3.975800	1.263743	5.553815	5.662407	3.943127	4.050639	7.999546
99.9% KLD	0.204782	0.223453	0.219347	0.247532	0.250371	0.367634	0.993010	2.745419
99.0% KLD	0.078322	0.087357	0.087095	0.099235	0.099381	0.123670	0.250287	0.844125
95.0% KLD	0.032427	0.037600	0.038312	0.043401	0.043684	0.050811	0.088569	0.267027
90.0% KLD	0.019813	0.023899	0.024312	0.027942	0.028040	0.032904	0.055239	0.157323
Median KLD	0.003369	0.005111	0.005167	0.006354	0.006390	0.007717	0.013581	0.036258
10.0% KLD	0.000082	0.000128	0.000131	0.000159	0.000163	0.000188	0.000353	0.001116
5.0% KLD	0.000016	0.000027	0.000028	0.000036	0.000037	0.000043	0.000087	0.000311
1.0% KLD	-0.000000	0.000001	0.000001	0.000003	0.000003	0.000003	0.000010	0.000045
0.1% KLD	-0.000016	-0.000011	-0.000010	-0.000007	-0.000007	-0.000007	-0.000001	0.000008
Minimum KLD	-0.000157	-0.000188	-0.000198	-0.000248	-0.000164	-0.000149	-0.000273	-0.000017
Same top p	95.971	94.905	94.947	94.457	94.394	94.030	92.372	88.237

icsy7867 Oct 22, 2025

I'd agree that everyone should eval for their particular downstream tasks rather than just trusting perplexity or KLD. When running quants on my 405B model I ran JA MT-Bench evals and was surprised to find a bigger difference with FP8-Dynamic than IQ3_M.

@icsy7867 I know you're just theory-crafting instead of running tests, but see my PRO 6000 TensorRT/NVFP4 benchmark below, but there is zero throughput benefit from NVFP4. Maybe related to NVIDIA/TransformerEngine#2255 - I never use TensorRT and it's impossible to build so I just used the latest docker for my tests (tensorrt-llm/release:1.2.0rc0) but I've put my full scripts/details online so it's easy for anyone to rent any GPU they want to check any configuration/variation for themselves.

#16578 (reply in thread)

I appreciate the edit you did there. But you arent wrong, I wish I had a Blackwell gpu to test. But I am surprised the 6000 Pro doesnt have a speedup there from the FP4 tensor cores. Your data is much appreciated though, thanks.

cocoderss
Oct 15, 2025

@ggerganov Are there llama.cpp benchmarks for the AGX Thor? It seems it's similar offering but Nvidia markets it as twice as fast.

There are no official detailed spec sheet for the DGX Spark to make a comparison to the Thor (2560 cuda cores and 92 tensor cores), but Nvidia claims 2PLOPS (sparse FP4) for the Thor and 1PFLOPS (sparse FP4) for the Spark.
I guess this might only affect batching, but it would be interesting to know given that Thor is cheaper than Spark.

5 replies

ggerganov Oct 15, 2025
Maintainer Author

I'm not familiar with AGX Thor. But if you have one, you can easily run the same benchmarks on it.

woachk Oct 15, 2025

Quick tldr:

Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory. And no raytracing cores. While Spark is sm_121 with the full consumer Blackwell feature set.

Thor and Spark have relatively similar memory bandwidth. The Thor CPU is much slower.

Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.

Thor has 4 cursed Synopsys 25GbE NICs (set to 10GbE by default, see https://docs.nvidia.com/jetson/archives/r38.2/DeveloperGuide/SD/Kernel/Enable25GbEthernetOnQSFP.html as it doesn't have auto-negociation of the link rate) exposed via a QSFP connector providing 4x25GbE while Spark systems have regular ConnectX-7.

Thor uses a downstream L4T stack instead of regular NVIDIA drivers unlike Spark. But at least the CUDA SDK is the same unlike prior Tegras. Oh and you get less other IO too.

Side note: might be better to also consider GB10 systems from OEMs. Those are available for cheaper than AGX Thor devkits too.

@cocoderss

cocoderss Oct 15, 2025

I'm not familiar with AGX Thor. But if you have one, you can easily run the same benchmarks on it.

I don't have one unfortunately, hoping whoever does will run those benchmarks.

Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.

This is a very weird and interesting tradeoff.

@yf225

yf225 Oct 15, 2025

Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory

@woachk does "tensor memory" here refer to TMEM?

woachk Oct 16, 2025

Yes.

eous
Oct 15, 2025

For those curious about Thor performance
(All models are the same as linked in the original benchmark with the same command)
llama.cpp git commit: f9fb33f
Jetpack 7.0 [L4T 38.2.2]
Docker container: nvcr.io/nvidia/pytorch:25.09-py3
MAXN and jetson_clocks enabled

gpt-oss-20b-gguf

# ./bin/llama-bench -m /workspace/models/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 | 2008.85 ± 4.18 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 60.85 ± 0.17 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1862.13 ± 4.80 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 55.03 ± 0.06 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1740.90 ± 3.24 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 53.58 ± 0.18 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 1446.75 ± 3.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 52.49 ± 1.94 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 1193.93 ± 0.72 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 48.33 ± 0.04 |
build: f9fb33f2 (6771)

Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

# ./bin/llama-bench -m /workspace/models/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1654.25 ± 1.80 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 | 44.26 ± 0.11 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1410.87 ± 2.22 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 39.46 ± 0.04 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1228.69 ± 1.78 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 36.88 ± 0.13 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 985.39 ± 7.04 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 33.55 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 686.45 ± 0.93 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 26.92 ± 0.05 |
build: f9fb33f2 (6771)

gpt-oss-120b

# ./bin/llama-bench -m /workspace/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 | 967.20 ± 6.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 42.00 ± 0.09 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 932.85 ± 2.33 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 38.81 ± 0.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 892.28 ± 2.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 39.22 ± 1.05 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 827.57 ± 1.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 37.77 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 677.70 ± 1.06 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 34.02 ± 0.02 |
build: f9fb33f2 (6771)

9 replies

llama_bench_comparison_dark llama_batched_comparison_dark

woachk Oct 16, 2025

That commit only applies the change to if (prop.major == 12 && prop.minor == 1) {, wonder if also adding it to 11.0 changes things

@eous

eous Oct 16, 2025

I did a quick one off build where I removed the conditional around the scheduling block to force spin and I do see a consistent improvement. Just looking at power draw there is a probably at least another 10-20% performance untapped on thor beyond moving it to the spin scheduler. Currently looks like we are mostly cpu bound.

Llama-bench Test Results (Qwen3moe 30B)

schedule Default Spin Improvement (%)
test
pp2048 1654.25 1700.05 2.77
pp2048 @ d16384 985.39 992.37 0.71
pp2048 @ d32768 686.45 687.30 0.12
pp2048 @ d4096 1410.87 1446.22 2.51
pp2048 @ d8192 1228.69 1257.35 2.33
tg32 44.26 45.67 3.19
tg32 @ d16384 33.55 33.62 0.21
tg32 @ d32768 26.92 27.05 0.48
tg32 @ d4096 39.46 40.64 2.99
tg32 @ d8192 36.88 38.09 3.28

Average improvement: 1.86%
Best improvement: 3.28% (tg32 @ d8192)
Worst improvement: 0.12% (pp2048 @ d32768)

Llama-batched-bench Test Results

PP=4096:
Average throughput improvement: 2.03%
Best batch size improvement: B2 (4.48%)
Worst batch size improvement: B16 (0.06%)

PP=8192:
Average throughput improvement: 0.05%
Best batch size improvement: B32 (0.07%)
Worst batch size improvement: B16 (0.03%)

Spin schedule
 Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
Test: llama-bench
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1700.05 ± 2.02 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 | 45.67 ± 0.11 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1446.22 ± 3.54 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 40.64 ± 0.05 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1257.35 ± 0.75 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 38.09 ± 0.09 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 992.37 ± 1.89 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 33.62 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 687.30 ± 0.48 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 27.05 ± 0.03 |
Test: llama-batched-bench
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 4096 | 32 | 1 | 4128 | 2.537 | 1614.38 | 0.789 | 40.54 | 3.327 | 1240.92 |
| 4096 | 32 | 2 | 8256 | 4.949 | 1655.30 | 1.301 | 49.18 | 6.250 | 1320.87 |
| 4096 | 32 | 4 | 16512 | 9.887 | 1657.09 | 1.663 | 76.98 | 11.550 | 1429.62 |
| 4096 | 32 | 8 | 33024 | 19.739 | 1660.11 | 2.289 | 111.86 | 22.027 | 1499.25 |
| 4096 | 32 | 16 | 66048 | 39.464 | 1660.65 | 3.279 | 156.14 | 42.743 | 1545.23 |
| 4096 | 32 | 32 | 132096 | 78.936 | 1660.49 | 5.033 | 203.46 | 83.968 | 1573.16 |
| 8192 | 32 | 1 | 8224 | 5.314 | 1541.47 | 0.839 | 38.14 | 6.153 | 1336.50 |
| 8192 | 32 | 2 | 16448 | 10.614 | 1543.68 | 1.396 | 45.86 | 12.009 | 1369.61 |
| 8192 | 32 | 4 | 32896 | 21.220 | 1544.24 | 1.888 | 67.79 | 23.108 | 1423.59 |
| 8192 | 32 | 8 | 65792 | 42.394 | 1545.87 | 2.792 | 91.68 | 45.187 | 1456.01 |
| 8192 | 32 | 16 | 131584 | 84.800 | 1545.66 | 4.206 | 121.73 | 89.006 | 1478.37 |
| 8192 | 32 | 32 | 263168 | 169.577 | 1545.87 | 6.867 | 149.11 | 176.444 | 1491.51 |

woachk Oct 16, 2025

For prompt processing there's a lot more on the table but that means switching to tcgen05 MMA instructions. (Which is a separate instruction set than the regular tensor core one)

And there's also the matter of using lower precision MMAs in general

@aazzolini

aazzolini Oct 17, 2025

I believe that Thor doesn't support tcgen05 because it doesn't have tensor-memory

See https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma

woachk Oct 17, 2025

Thor does have tensor memory - it uses the data centre tensor cores (it's sm_110[a]), Spark does not.

qdrddr
Oct 16, 2025

Would love to see accuracy of the same models on main banchmarks running in DGX as they will vary on different HW & FW in addition to the speed.

As its clearly sing here https://artificialanalysis.ai/models/gpt-oss-120b/providers
CleanShot 2025年10月16日 at 16 41 40

0 replies

mightcouldb1
Oct 17, 2025

Please bench the full Qwen3 coder model

2 replies

ggerganov Oct 17, 2025
Maintainer Author

There isn't any measurable benefits in terms of quality compared to Q8_0, so don't think there is any point in benching that as it is most likely going to perform worse in terms of speed.

@mightcouldb1

mightcouldb1 Oct 17, 2025

I am just impressed that it might run at all. It's there any bench on fine-tuning?

qdrddr
Oct 17, 2025

Would love to see this this cluster setup in the comparison table too
EXO Lab cluster with 2xDGX + MacStudio
https://blog.exolabs.net/nvidia-dgx-spark/

1 reply

ggerganov Oct 17, 2025
Maintainer Author

AFAICT this is vaporware.

aazzolini
Oct 17, 2025

On the subject of Spark and Thor, I have been looking for alternatives to TensorRT for python-free and community driven inference engine. I'm looking to leverage nvfp4 tensor cores , and wonder if there's any project or folks working to support those in llama.cpp?

6 replies

woachk Oct 17, 2025

The whole Blackwell product range, from the RTX 5050 onwards to the B200/300 through iGPUs

woachk Oct 17, 2025

That said: NVIDIA/TransformerEngine#2255

@lhl

lhl Oct 17, 2025

Just as an FYI, I don't have a Spark but I tested NVP4 on an RTX PRO 6000 (Llama 3.1 8B Instruct). NVP4 w/ TensorRT does not perform better than llama.cpp at bs=1, and at higher concurency, doesn't take a lead until c=32.

I didn't test quality loss, but from a pure throughput perspective, I don't think the current NVFP4 implementation is particularly good. Certainly not worth all the custom quanting and other hassles...

Config	Req/s	Prefill Tok/s	Decode Tok/s	Total Tok/s	Max Out Tok/s	TTFT mean	TTFT med	TTFT p99	TPOT mean	TPOT med	TPOT p99
llama.cpp.q4_k_m	1.65	1683.45	207.16	1890.61	223.00	74.17	75.75	85.71	4.36	4.22	8.40
sglang.fp8-auto	1.15	1173.85	142.83	1316.68	146.00	54.88	55.31	55.79	6.61	6.62	6.62
sglang.fp8-dynamic	1.04	1065.99	130.29	1196.28	132.00	55.91	56.30	57.13	7.28	7.29	7.29
sglang.w4a16	1.56	1590.93	194.85	1785.78	204.00	53.69	54.10	54.79	4.74	4.75	4.76
trt.fp8	0.59	605.67	74.33	680.01	76.00	39.94	40.24	40.76	13.24	13.24	13.27
trt.nvfp4	0.60	608.22	74.38	682.61	76.00	30.91	31.05	31.31	13.30	13.30	13.34
vllm.fp8-dynamic	0.77	789.55	94.90	884.45	98.00	34.94	35.12	36.43	10.34	10.34	10.36
vllm.w4a16	1.52	1549.83	189.81	1739.64	196.00	49.09	49.39	50.30	4.92	4.92	4.96

@aazzolini

aazzolini Oct 17, 2025

@lhl what's the prefill sequence length in the profiles above?
my usecase is pre-fill only at seqlen > 300

@lhl

lhl Oct 18, 2025

This is using a standard vLLM bench - ShareGPT w/ prefill 1024 and decode 128 I believe. If you have a specific use case it's probably best to just trying the device directly - I think they're available for a buck or two on Vast or Runpod.

I think the compute is particularly strong for a client card. For example, the PRO 6000 actually beats an H100 on our Whisper inference sweeps. (Still trains much slower though)

Here's my LLM sweep scripts (and raw results) btw: https://github.com/AUGMXNT/speed-benchmarking/tree/main/nvfp4

eugr
Oct 22, 2025

@ggerganov - what flags did you use to compile for DGX Spark? Also, did you set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1?
I've just got the spark, and I'm not getting the same performance numbers as you. Also, the model loading is super slow. Not sure what's going on, I'm probably missing something.

It does seem to offload layers to GPU properly, but nvtop/nvidia-smi shows host memory utilization growing to quite large numbers (more than 100GB and then it all goes to GPU memory). In comparison, my Strix Halo PC loads the same model 5x faster.

My numbers:

Without GGML_CUDA_ENABLE_UNIFIED_MEMORY=1:

Model loading time - 1 minute 44 seconds using this command:

build/bin/llama-server -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -ngl 999 -ub 2048

Benchmarks:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048	1737.17 ± 81.66
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32	45.87 ± 0.74
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d4096	1777.81 ± 5.92
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d4096	43.41 ± 0.31
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d8192	1720.17 ± 8.49
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d8192	41.52 ± 0.29
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d16384	1512.23 ± 11.81
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d16384	38.39 ± 0.15
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d32768	1231.86 ± 6.14
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d32768	34.29 ± 0.07

build: 03792ad (6816)

With GGML_CUDA_ENABLE_UNIFIED_MEMORY=1:

Model loading time: 49 seconds
Benchmarks:

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048	1672.33 ± 65.23
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32	40.61 ± 0.38
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d4096	1661.97 ± 8.73
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d4096	38.29 ± 0.35
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d8192	1587.22 ± 12.23
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d8192	36.85 ± 0.42
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d16384	1384.96 ± 6.77
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d16384	34.62 ± 0.22
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	pp2048 @ d32768	1124.23 ± 4.65
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	99	2048	1	tg32 @ d32768	30.47 ± 0.08

For comparison, from my GMKTek Evo X2 (AMD AI MAX+ 395), same llama.cpp build, compiled with HIP:

Model loading time: 25 seconds (8 seconds if still in caches!!!)

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	pp2048	999.59 ± 4.31
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	tg32	47.49 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	pp2048 @ d4096	824.37 ± 1.16
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	tg32 @ d4096	44.23 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	pp2048 @ d8192	703.42 ± 1.54
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	tg32 @ d8192	42.52 ± 0.04
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	pp2048 @ d16384	514.89 ± 3.86
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	tg32 @ d16384	39.71 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	pp2048 @ d32768	348.59 ± 2.11
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	2048	1	tg32 @ d32768	35.39 ± 0.01

Any ideas? You benchmarks look closer to what I'd expect from this device. And long loading time makes me think that it is doing some extra mallocs/copying.

8 replies

@eugr

eugr Oct 22, 2025

Re: GDS - not sure what's going on there, but:

eugr@spark:~$ /usr/local/cuda/gds/tools/gdscheck.py -p
 GDS release version: 1.15.1.6
 libcufile version: 2.12
 Platform: aarch64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA : Unsupported
 NVMe : Unsupported
 NVMeOF : Unsupported
 SCSI : Unsupported
 ScaleFlux CSD : Unsupported
 NVMesh : Unsupported
 DDN EXAScaler : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS : Unsupported
 BeeGFS : Unsupported
 ScaTeFS : Unsupported
 WekaFS : Unsupported
 Userspace RDMA : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library : Not Loaded (libcufile_rdma.so)
 --rdma devices : Not configured
 --rdma_device_status : Up: 0 Down: 0