-
Couldn't load subscription status.
- Fork 13.4k
Performance of llama.cpp on NVIDIA DGX Spark #16578
-
OverviewThis document summarizes the performance of Benchmarks include:
Models:
Feel free to request additional benchmarks for models and use cases. BenchmarksBuild with: cmake -B build-cuda -DGGML_CUDA=ON cmake --build build-cuda -j Using the following commands: # sequential requests llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0 # parallel requests llama-batched-bench -m [model.gguf] -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmap History
gpt-oss-20bModel: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF
gpt-oss-120bModel: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF
Qwen3 Coder 30B A3BModel: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF
Qwen2.5 CoderModel: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF
Gemma 3 4B QATModel: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF
GLM 4.5 AirModel: https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main
More info |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 31 -
🎉 4 -
❤️ 9 -
🚀 3
Replies: 15 comments 56 replies
-
Thanks for the benchmark! I would like to request additional benchmark for a very popular model GLM-4.5-Air-FP8:
https://huggingface.co/zai-org/GLM-4.5-Air-FP8
and quants for it:
- Q4_K_M
- Q6_K
- Q8 (if possible)
https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main
Beta Was this translation helpful? Give feedback.
All reactions
-
Saw the benchmark results. Thank you so much for the work! Appreciate very much.
Beta Was this translation helpful? Give feedback.
All reactions
-
Hi. It would be great to see a Qwen Next 80B benchmark for these two models:
https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
(Has acceptable t/s even on CPU... I'm not sure if this one runs on llama.cpp)
https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
(Official quants)
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions
-
Not support yet with open pr currently
Beta Was this translation helpful? Give feedback.
All reactions
-
Hi. It would be great to see a Qwen Next 80B benchmark for these two models:
https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (Has acceptable t/s even on CPU... I'm not sure if this one runs on llama.cpp)
https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (Official quants)
Thanks.
Yeah I really want to see the performance of a specific model comparing full 16 bit precision, Q8, Q4, FP4 and FP8.
None the less, thank you for the wonderful data!
Beta Was this translation helpful? Give feedback.
All reactions
-
Getting similar performance with my Farmework Desktop. Thanks for helping my FOMO.
Beta Was this translation helpful? Give feedback.
All reactions
-
Someone please help explain this to me? I am not trying to bash on this machine, I am just trying to understand the justification for paying almost twice as much for the same performance with similar specs.
I'm sure the connectx-7 200GB networking has something to do with the pricing difference :)
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
btw it's most likely a much better choice if you want GB10 to buy the Asus GB10 systems for 1ドルk less (at least that's what I did) - DGX Spark is more expensive but it's not the only choice
Interesting, the Asus GB10 seems to run with 240W power adapter, much higher than the DGX Spark. I wonder if you will get more performance given the higher power intake.
Beta Was this translation helpful? Give feedback.
All reactions
-
btw it's most likely a much better choice if you want GB10 to buy the Asus GB10 systems for 1ドルk less (at least that's what I did) - DGX Spark is more expensive but it's not the only choice
Interesting, the Asus GB10 seems to run with 240W power adapter, much higher than the DGX Spark. I wonder if you will get more performance given the higher power intake.
I havent seen the specs. But its possible ASUS just used a power adapter with a high enough rating for the device? For example, I can plug a 90watt compatible power adapter into my 45watt laptop. It will pull what it needs to.
Beta Was this translation helpful? Give feedback.
All reactions
-
@bartlettroscoe i benched gpt-oss 120b on Framework Desktop a couple months ago: geerlingguy/ai-benchmarks#21 (comment)
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
|
with "correct" rocm and build I get:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Can you run the classic llama 2 7B Q4_0 so it can be compared on the chart?
Beta Was this translation helpful? Give feedback.
All reactions
-
Super interesting, thanks for sharing, Georgi!
llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
Could you please help me understand: Does "-d" mean KV cache length before the "-p" prefill happens? What does "-ub" define, eg batch size?
Beta Was this translation helpful? Give feedback.
All reactions
-
Does "-d" mean KV cache length before the "-p" prefill happens?
Yes.
What does "-ub" define, eg batch size?
Yes.
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
Could you add llama2-7b result to #15013?
Beta Was this translation helpful? Give feedback.
All reactions
-
Awesome, thank you!
So for gpt-oss-120B around 35 tokens/s on dgx spark.
On vllm im getting with 131k context and at almost any length around 180 tokens/s on a 300W RTX6000 96gb Max-Q edition.
So whats the sense of a dgx spark? I mean sure it has 128gb memory, but i can offload bigger models between 96gb vram and the rest to normal Ram (CPU)...
So in the end i can run even bigger models and even faster as the dgx could.
Its too expensive for what it offers. If the DGX Spark would be around 2k, like the Ryzen Max 395+ Mini-PC's it would be fine and okay.
But for 4k usd/eur its absolutely senseless...
PS: And a Mac Mini/Studio is a much better option at 4k usd/eur, compared to a DGX Sparc.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 7
-
Guys, please dont take fp4 or fp8 as a win.
Let me explain:
I do compare embedding models in different quantisations (for my project @work).
Comparing embedding Models is actually great, because you can simply query the resulting vector database and see the quantisation impacts.
From my tests, no matter which Model, be it Qwen3-Embedding or BGE-M3 or anything else, the impact of Quantisation is Huge!
FP32 is Amazing
BF16 is still Amazing
int8/Q8 = you see already a degradation because the results start to differ, but only 5-10% of the results are different.
Q4 = 50% of the results are different, almost unusable Model
So you Guys want to tell me that FP4 is a win?
In my Opinion FP8 is fine and usable, but FP4 will be unusable crap.
No Matter what the Marketing says, 1% quality loss is a huge lie!!!
I didnt tested fp4 tho, not even fp8, so i cant say for sure.
But from my experience with all other quantisations fp4 should be crap.
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 3
-
It depends on the model. In many cases, in my experience FP4 does a fantastic job. Also NVFP4 has the potential to be amazing.
So is it situational? Sure, it can be. But I don't think it's something that can be ignored.
Also, FP8 is also great, I have found little reason to not use it.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
I'd agree that everyone should eval for their particular downstream tasks rather than just trusting perplexity or KLD. When running quants on my 405B model I ran JA MT-Bench evals and was surprised to find a bigger difference with FP8-Dynamic than IQ3_M.
@icsy7867 I know you're just theory-crafting instead of running tests, but see my PRO 6000 TensorRT/NVFP4 benchmark below, but there is zero throughput benefit from NVFP4. Maybe related to NVIDIA/TransformerEngine#2255 - I never use TensorRT and it's impossible to build so I just used the latest docker for my tests (tensorrt-llm/release:1.2.0rc0) but I've put my full scripts/details online so it's easy for anyone to rent any GPU they want to check any configuration/variation for themselves.
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
|
yes it rely depend on model. for exemple I get for Mistral-SMAL:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
I'd agree that everyone should eval for their particular downstream tasks rather than just trusting perplexity or KLD. When running quants on my 405B model I ran JA MT-Bench evals and was surprised to find a bigger difference with FP8-Dynamic than IQ3_M.
@icsy7867 I know you're just theory-crafting instead of running tests, but see my PRO 6000 TensorRT/NVFP4 benchmark below, but there is zero throughput benefit from NVFP4. Maybe related to NVIDIA/TransformerEngine#2255 - I never use TensorRT and it's impossible to build so I just used the latest docker for my tests (tensorrt-llm/release:1.2.0rc0) but I've put my full scripts/details online so it's easy for anyone to rent any GPU they want to check any configuration/variation for themselves.
I appreciate the edit you did there. But you arent wrong, I wish I had a Blackwell gpu to test. But I am surprised the 6000 Pro doesnt have a speedup there from the FP4 tensor cores. Your data is much appreciated though, thanks.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@ggerganov Are there llama.cpp benchmarks for the AGX Thor? It seems it's similar offering but Nvidia markets it as twice as fast.
There are no official detailed spec sheet for the DGX Spark to make a comparison to the Thor (2560 cuda cores and 92 tensor cores), but Nvidia claims 2PLOPS (sparse FP4) for the Thor and 1PFLOPS (sparse FP4) for the Spark.
I guess this might only affect batching, but it would be interesting to know given that Thor is cheaper than Spark.
Beta Was this translation helpful? Give feedback.
All reactions
-
👀 1
-
I'm not familiar with AGX Thor. But if you have one, you can easily run the same benchmarks on it.
Beta Was this translation helpful? Give feedback.
All reactions
-
Quick tldr:
Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory. And no raytracing cores. While Spark is sm_121 with the full consumer Blackwell feature set.
Thor and Spark have relatively similar memory bandwidth. The Thor CPU is much slower.
Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.
Thor has 4 cursed Synopsys 25GbE NICs (set to 10GbE by default, see https://docs.nvidia.com/jetson/archives/r38.2/DeveloperGuide/SD/Kernel/Enable25GbEthernetOnQSFP.html as it doesn't have auto-negociation of the link rate) exposed via a QSFP connector providing 4x25GbE while Spark systems have regular ConnectX-7.
Thor uses a downstream L4T stack instead of regular NVIDIA drivers unlike Spark. But at least the CUDA SDK is the same unlike prior Tegras. Oh and you get less other IO too.
Side note: might be better to also consider GB10 systems from OEMs. Those are available for cheaper than AGX Thor devkits too.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 5 -
👎 2
-
I'm not familiar with AGX Thor. But if you have one, you can easily run the same benchmarks on it.
I don't have one unfortunately, hoping whoever does will run those benchmarks.
Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.
This is a very weird and interesting tradeoff.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory
@woachk does "tensor memory" here refer to TMEM?
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
For those curious about Thor performance
(All models are the same as linked in the original benchmark with the same command)
llama.cpp git commit: f9fb33f
Jetpack 7.0 [L4T 38.2.2]
Docker container: nvcr.io/nvidia/pytorch:25.09-py3
MAXN and jetson_clocks enabled
gpt-oss-20b-gguf
# ./bin/llama-bench -m /workspace/models/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes | model | size | params | backend | ngl | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 | 2008.85 ± 4.18 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 60.85 ± 0.17 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1862.13 ± 4.80 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 55.03 ± 0.06 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1740.90 ± 3.24 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 53.58 ± 0.18 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 1446.75 ± 3.01 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 52.49 ± 1.94 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 1193.93 ± 0.72 | | gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 48.33 ± 0.04 | build: f9fb33f2 (6771)
Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF
# ./bin/llama-bench -m /workspace/models/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes | model | size | params | backend | ngl | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1654.25 ± 1.80 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 | 44.26 ± 0.11 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1410.87 ± 2.22 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 39.46 ± 0.04 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1228.69 ± 1.78 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 36.88 ± 0.13 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 985.39 ± 7.04 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 33.55 ± 0.01 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 686.45 ± 0.93 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 26.92 ± 0.05 | build: f9fb33f2 (6771)
gpt-oss-120b
# ./bin/llama-bench -m /workspace/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes | model | size | params | backend | ngl | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 | 967.20 ± 6.04 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 42.00 ± 0.09 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 932.85 ± 2.33 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 38.81 ± 0.04 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 892.28 ± 2.88 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 39.22 ± 1.05 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 827.57 ± 1.28 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 37.77 ± 0.01 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 677.70 ± 1.06 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 34.02 ± 0.02 | build: f9fb33f2 (6771)
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 5
-
That commit only applies the change to if (prop.major == 12 && prop.minor == 1) {, wonder if also adding it to 11.0 changes things
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
I did a quick one off build where I removed the conditional around the scheduling block to force spin and I do see a consistent improvement. Just looking at power draw there is a probably at least another 10-20% performance untapped on thor beyond moving it to the spin scheduler. Currently looks like we are mostly cpu bound.
llama_bench_comparison_dark llama_batched_comparison_darkLlama-bench Test Results (Qwen3moe 30B)
schedule Default Spin Improvement (%)
test
pp2048 1654.25 1700.05 2.77
pp2048 @ d16384 985.39 992.37 0.71
pp2048 @ d32768 686.45 687.30 0.12
pp2048 @ d4096 1410.87 1446.22 2.51
pp2048 @ d8192 1228.69 1257.35 2.33
tg32 44.26 45.67 3.19
tg32 @ d16384 33.55 33.62 0.21
tg32 @ d32768 26.92 27.05 0.48
tg32 @ d4096 39.46 40.64 2.99
tg32 @ d8192 36.88 38.09 3.28
Average improvement: 1.86%
Best improvement: 3.28% (tg32 @ d8192)
Worst improvement: 0.12% (pp2048 @ d32768)
Llama-batched-bench Test Results
PP=4096:
Average throughput improvement: 2.03%
Best batch size improvement: B2 (4.48%)
Worst batch size improvement: B16 (0.06%)
PP=8192:
Average throughput improvement: 0.05%
Best batch size improvement: B32 (0.07%)
Worst batch size improvement: B16 (0.03%)
Spin schedule Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes Test: llama-bench | model | size | params | backend | ngl | n_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1700.05 ± 2.02 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 | 45.67 ± 0.11 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1446.22 ± 3.54 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 40.64 ± 0.05 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1257.35 ± 0.75 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 38.09 ± 0.09 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 992.37 ± 1.89 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 33.62 ± 0.01 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 687.30 ± 0.48 | | qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 27.05 ± 0.03 | Test: llama-batched-bench | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | |-------|--------|------|--------|----------|----------|----------|----------|----------|----------| | 4096 | 32 | 1 | 4128 | 2.537 | 1614.38 | 0.789 | 40.54 | 3.327 | 1240.92 | | 4096 | 32 | 2 | 8256 | 4.949 | 1655.30 | 1.301 | 49.18 | 6.250 | 1320.87 | | 4096 | 32 | 4 | 16512 | 9.887 | 1657.09 | 1.663 | 76.98 | 11.550 | 1429.62 | | 4096 | 32 | 8 | 33024 | 19.739 | 1660.11 | 2.289 | 111.86 | 22.027 | 1499.25 | | 4096 | 32 | 16 | 66048 | 39.464 | 1660.65 | 3.279 | 156.14 | 42.743 | 1545.23 | | 4096 | 32 | 32 | 132096 | 78.936 | 1660.49 | 5.033 | 203.46 | 83.968 | 1573.16 | | 8192 | 32 | 1 | 8224 | 5.314 | 1541.47 | 0.839 | 38.14 | 6.153 | 1336.50 | | 8192 | 32 | 2 | 16448 | 10.614 | 1543.68 | 1.396 | 45.86 | 12.009 | 1369.61 | | 8192 | 32 | 4 | 32896 | 21.220 | 1544.24 | 1.888 | 67.79 | 23.108 | 1423.59 | | 8192 | 32 | 8 | 65792 | 42.394 | 1545.87 | 2.792 | 91.68 | 45.187 | 1456.01 | | 8192 | 32 | 16 | 131584 | 84.800 | 1545.66 | 4.206 | 121.73 | 89.006 | 1478.37 | | 8192 | 32 | 32 | 263168 | 169.577 | 1545.87 | 6.867 | 149.11 | 176.444 | 1491.51 |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
For prompt processing there's a lot more on the table but that means switching to tcgen05 MMA instructions. (Which is a separate instruction set than the regular tensor core one)
And there's also the matter of using lower precision MMAs in general
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
I believe that Thor doesn't support tcgen05 because it doesn't have tensor-memory
Beta Was this translation helpful? Give feedback.
All reactions
-
Thor does have tensor memory - it uses the data centre tensor cores (it's sm_110[a]), Spark does not.
See https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Would love to see accuracy of the same models on main banchmarks running in DGX as they will vary on different HW & FW in addition to the speed.
As its clearly sing here https://artificialanalysis.ai/models/gpt-oss-120b/providers
CleanShot 2025年10月16日 at 16 41 40
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Please bench the full Qwen3 coder model
Beta Was this translation helpful? Give feedback.
All reactions
-
There isn't any measurable benefits in terms of quality compared to Q8_0, so don't think there is any point in benching that as it is most likely going to perform worse in terms of speed.
Beta Was this translation helpful? Give feedback.
All reactions
-
I am just impressed that it might run at all. It's there any bench on fine-tuning?
Beta Was this translation helpful? Give feedback.
All reactions
-
Would love to see this this cluster setup in the comparison table too
EXO Lab cluster with 2xDGX + MacStudio
https://blog.exolabs.net/nvidia-dgx-spark/
Beta Was this translation helpful? Give feedback.
All reactions
-
AFAICT this is vaporware.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
❤️ 2
-
On the subject of Spark and Thor, I have been looking for alternatives to TensorRT for python-free and community driven inference engine. I'm looking to leverage nvfp4 tensor cores , and wonder if there's any project or folks working to support those in llama.cpp?
Beta Was this translation helpful? Give feedback.
All reactions
-
The whole Blackwell product range, from the RTX 5050 onwards to the B200/300 through iGPUs
Beta Was this translation helpful? Give feedback.
All reactions
-
That said: NVIDIA/TransformerEngine#2255
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Just as an FYI, I don't have a Spark but I tested NVP4 on an RTX PRO 6000 (Llama 3.1 8B Instruct). NVP4 w/ TensorRT does not perform better than llama.cpp at bs=1, and at higher concurency, doesn't take a lead until c=32. I didn't test quality loss, but from a pure throughput perspective, I don't think the current NVFP4 implementation is particularly good. Certainly not worth all the custom quanting and other hassles...
|
Beta Was this translation helpful? Give feedback.
All reactions
-
@lhl what's the prefill sequence length in the profiles above?
my usecase is pre-fill only at seqlen > 300
Beta Was this translation helpful? Give feedback.
All reactions
-
This is using a standard vLLM bench - ShareGPT w/ prefill 1024 and decode 128 I believe. If you have a specific use case it's probably best to just trying the device directly - I think they're available for a buck or two on Vast or Runpod.
I think the compute is particularly strong for a client card. For example, the PRO 6000 actually beats an H100 on our Whisper inference sweeps. (Still trains much slower though)
Here's my LLM sweep scripts (and raw results) btw: https://github.com/AUGMXNT/speed-benchmarking/tree/main/nvfp4
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
@ggerganov - what flags did you use to compile for DGX Spark? Also, did you set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1? It does seem to offload layers to GPU properly, but nvtop/nvidia-smi shows host memory utilization growing to quite large numbers (more than 100GB and then it all goes to GPU memory). In comparison, my Strix Halo PC loads the same model 5x faster. My numbers: Without GGML_CUDA_ENABLE_UNIFIED_MEMORY=1: Model loading time - 1 minute 44 seconds using this command: build/bin/llama-server -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -ngl 999 -ub 2048Benchmarks: build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
build: 03792ad (6816) With GGML_CUDA_ENABLE_UNIFIED_MEMORY=1: Model loading time: 49 seconds
For comparison, from my GMKTek Evo X2 (AMD AI MAX+ 395), same llama.cpp build, compiled with HIP: Model loading time: 25 seconds (8 seconds if still in caches!!!)
Any ideas? You benchmarks look closer to what I'd expect from this device. And long loading time makes me think that it is doing some extra mallocs/copying. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Re: GDS - not sure what's going on there, but:
eugr@spark:~$ /usr/local/cuda/gds/tools/gdscheck.py -p GDS release version: 1.15.1.6 libcufile version: 2.12 Platform: aarch64 ============ ENVIRONMENT: ============ ===================== DRIVER CONFIGURATION: ===================== NVMe P2PDMA : Unsupported NVMe : Unsupported NVMeOF : Unsupported SCSI : Unsupported ScaleFlux CSD : Unsupported NVMesh : Unsupported DDN EXAScaler : Unsupported IBM Spectrum Scale : Unsupported NFS : Unsupported BeeGFS : Unsupported ScaTeFS : Unsupported WekaFS : Unsupported Userspace RDMA : Unsupported --Mellanox PeerDirect : Disabled --rdma library : Not Loaded (libcufile_rdma.so) --rdma devices : Not configured --rdma_device_status : Up: 0 Down: 0
Beta Was this translation helpful? Give feedback.
All reactions
-
Hmm, something that I wonder about.
You should be able to rely on HMM (cudaDevAttrPageableMemoryAccess) on GB10 with "just" using the host memory mapping (even for mmap'd files)... and not dealing with any CUDA memory allocation APIs. Perf overhead will be there because of 4KB pages though, but wonder if that alleviates the loading times...
Beta Was this translation helpful? Give feedback.
All reactions
-
I use to test to direct register mmap with hip on AMD APU and it use to work with no more penality that what I get with GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 but for AMD/HIP there is special config on alloc. I did not have NVIDIA APU to look what is needed for CUDA.
On AMD the gain/lose is not because of 4k page, but because of cache coherency from CPU/GPU by default.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
@ggerganov so, I've got really curious and decided to test the kernel theory. Installed Fedora 43 beta on DGX Spark, nvidia-open drivers, CUDA 13 (used RHEL 10 package). Needed to patch CUDA's math-operations.h as rsqrt/rsqrtf signature wasn't matching the one in C++ 15 that comes with Fedora 43, but other than that was able to compile llama.cpp (and it was able to detect ARM features properly - something that didn't work on stock DGX OS!!!). And lo and behold, loading gpt-oss-120b from cold takes 19.5 seconds - slightly faster than Strix Halo!!!! Big improvement compared to 56 seconds on DGX OS! On a flip side, getting worse performance on token generation:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👀 1
-
just in case strix halo can run even faster:
$ sudo hdparm -t --direct /dev/nvme0n1
/dev/nvme0n1:
Timing O_DIRECT disk reads: 14690 MB in 3.00 seconds = 4896.16 MB/sec
Beta Was this translation helpful? Give feedback.
All reactions
-
Throughput is not the only metric.
We need to take into account that different HW/FW produce different accuracy for the same model.
And can vary from a little to drastic difference.
Can someone test popular LLMs like gpt-oss?
Beta Was this translation helpful? Give feedback.