Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Performance of llama.cpp on NVIDIA DGX Spark #16578

ggerganov started this conversation in Show and tell
Discussion options

Overview

This document summarizes the performance of llama.cpp for various models on the new NVIDIA DGX Spark.

Benchmarks include:

  • Prefill (pp) and generation (tg) at various context depths (d)
  • Batch sizes of 1, 2, 4, 8, 16, 32 typical for local environments

Models:

  • gpt-oss-20b
  • gpt-oss-120b
  • Qwen3 Coder 30B A3B
  • Qwen2.5 Coder 7B
  • Gemma 3 4B QAT
  • GLM 4.5 Air

Feel free to request additional benchmarks for models and use cases.

Benchmarks

Build with:

cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda -j

Using the following commands:

# sequential requests
llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
# parallel requests
llama-batched-bench -m [model.gguf] -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmap

History

gpt-oss-20b

Model: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

  • llama-bench

    model size params fa mmap test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 pp2048 3608.14 ± 9.33
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 tg32 77.85 ± 0.40
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 pp2048 @ d4096 3354.22 ± 16.76
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 tg32 @ d4096 72.21 ± 0.73
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 pp2048 @ d8192 3153.73 ± 17.53
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 tg32 @ d8192 68.56 ± 0.73
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 pp2048 @ d16384 2668.77 ± 9.73
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 tg32 @ d16384 63.91 ± 0.05
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 pp2048 @ d32768 2070.54 ± 3.55
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B 1 0 tg32 @ d32768 55.79 ± 0.07

    build: 73a48c9 (6845)

  • llama-batched-bench

    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    4096 32 1 4128 1.145 3578.71 0.438 73.03 1.583 2608.21
    4096 32 2 8256 2.288 3580.60 0.759 84.34 3.047 2709.81
    4096 32 4 16512 4.557 3595.50 0.952 134.46 5.509 2997.41
    4096 32 8 33024 9.120 3592.97 1.213 211.04 10.333 3195.96
    4096 32 16 66048 18.215 3597.90 1.682 304.33 19.897 3319.42
    4096 32 32 132096 36.423 3598.60 2.398 427.10 38.821 3402.72
    8192 32 1 8224 2.331 3514.61 0.467 68.50 2.798 2939.24
    8192 32 2 16448 4.639 3531.62 0.791 80.88 5.430 3028.83
    8192 32 4 32896 9.296 3524.86 0.997 128.43 10.293 3195.99
    8192 32 8 65792 18.577 3527.77 1.346 190.21 19.923 3302.31
    8192 32 16 131584 37.167 3526.54 1.942 263.69 39.109 3364.54
    8192 32 32 263168 74.256 3530.28 2.923 350.27 77.179 3409.83

gpt-oss-120b

Model: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

  • llama-bench

    model size params fa mmap test t/s
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 pp2048 1860.76 ± 4.22
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 tg32 55.33 ± 0.16
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 pp2048 @ d4096 1813.91 ± 6.94
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 tg32 @ d4096 51.73 ± 0.10
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 pp2048 @ d8192 1710.95 ± 3.51
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 tg32 @ d8192 48.86 ± 0.44
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 pp2048 @ d16384 1522.16 ± 5.37
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 tg32 @ d16384 45.31 ± 0.08
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 pp2048 @ d32768 1236.60 ± 3.44
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B 1 0 tg32 @ d32768 39.36 ± 0.04

    build: 73a48c9 (6845)

  • llama-batched-bench

    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    4096 32 1 4128 2.247 1823.26 0.617 51.84 2.864 1441.42
    4096 32 2 8256 4.417 1854.73 1.171 54.65 5.588 1477.47
    4096 32 4 16512 8.843 1852.81 1.518 84.32 10.361 1593.71
    4096 32 8 33024 17.684 1852.99 2.040 125.50 19.724 1674.33
    4096 32 16 66048 35.389 1851.85 2.943 173.95 38.333 1723.02
    4096 32 32 132096 70.731 1853.11 4.390 233.24 75.121 1758.45
    8192 32 1 8224 4.503 1819.34 0.657 48.73 5.159 1593.97
    8192 32 2 16448 9.055 1809.46 1.245 51.42 10.299 1596.99
    8192 32 4 32896 17.928 1827.79 1.603 79.84 19.531 1684.31
    8192 32 8 65792 35.949 1823.02 2.250 113.79 38.199 1722.35
    8192 32 16 131584 71.856 1824.10 3.329 153.82 75.184 1750.15
    8192 32 32 263168 143.542 1826.25 5.253 194.95 148.795 1768.67

Qwen3 Coder 30B A3B

Model: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

  • llama-bench

    model size params fa mmap test t/s
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 pp2048 2938.67 ± 22.27
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 tg32 60.30 ± 0.23
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 pp2048 @ d4096 2529.18 ± 9.55
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 tg32 @ d4096 53.18 ± 0.04
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 pp2048 @ d8192 2253.00 ± 13.67
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 tg32 @ d8192 45.19 ± 0.41
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 pp2048 @ d16384 1796.26 ± 5.98
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 tg32 @ d16384 37.99 ± 0.05
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 pp2048 @ d32768 1253.38 ± 4.29
    qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B 1 0 tg32 @ d32768 28.35 ± 0.02

    build: 73a48c9 (6845)

  • llama-batched-bench

    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    4096 32 1 4128 1.456 2813.86 0.602 53.14 2.058 2005.95
    4096 32 2 8256 2.886 2838.12 1.119 57.20 4.005 2061.29
    4096 32 4 16512 5.775 2836.82 1.547 82.72 7.323 2254.86
    4096 32 8 33024 11.519 2844.63 2.195 116.65 13.714 2408.06
    4096 32 16 66048 23.020 2846.94 3.204 159.81 26.224 2518.65
    4096 32 32 132096 46.073 2844.91 4.890 209.40 50.963 2592.02
    8192 32 1 8224 3.070 2668.09 0.713 44.91 3.783 2173.98
    8192 32 2 16448 6.124 2675.19 1.269 50.45 7.393 2224.80
    8192 32 4 32896 12.261 2672.53 1.801 71.08 14.062 2339.40
    8192 32 8 65792 24.495 2675.48 2.700 94.82 27.195 2419.26
    8192 32 16 131584 48.973 2676.42 4.278 119.68 53.251 2471.02
    8192 32 32 263168 97.905 2677.54 6.976 146.80 104.880 2509.22

Qwen2.5 Coder

Model: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

  • llama-bench

    model size params fa mmap test t/s
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 pp2048 2277.32 ± 3.48
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 tg32 29.09 ± 0.02
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 pp2048 @ d4096 2091.33 ± 8.73
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 tg32 @ d4096 28.12 ± 0.03
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 pp2048 @ d8192 1905.85 ± 5.89
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 tg32 @ d8192 27.33 ± 0.01
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 pp2048 @ d16384 1591.53 ± 6.30
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 tg32 @ d16384 25.89 ± 0.01
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 pp2048 @ d32768 1295.05 ± 2.95
    qwen2 7B Q8_0 7.54 GiB 7.62 B 1 0 tg32 @ d32768 22.73 ± 0.04

    build: 73a48c9 (6845)

  • llama-batched-bench

    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    4096 32 1 4128 1.845 2220.63 1.136 28.17 2.980 1385.06
    4096 32 2 8256 3.670 2231.93 1.250 51.20 4.920 1677.91
    4096 32 4 16512 7.334 2233.91 1.371 93.37 8.705 1896.82
    4096 32 8 33024 14.630 2239.86 1.580 161.98 16.210 2037.26
    4096 32 16 66048 29.266 2239.31 2.065 247.96 31.331 2108.07
    4096 32 32 132096 58.567 2237.98 2.752 372.16 61.319 2154.26
    8192 32 1 8224 3.778 2168.30 1.173 27.28 4.951 1661.08
    8192 32 2 16448 7.560 2167.25 1.340 47.77 8.899 1848.21
    8192 32 4 32896 15.114 2168.07 1.535 83.36 16.649 1975.82
    8192 32 8 65792 30.224 2168.32 1.863 137.38 32.088 2050.37
    8192 32 16 131584 60.552 2164.62 2.655 192.84 63.207 2081.80
    8192 32 32 263168 121.060 2165.41 3.867 264.84 124.927 2106.58

Gemma 3 4B QAT

Model: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF

  • llama-bench

    model size params fa mmap test t/s
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 pp2048 5693.38 ± 9.40
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 tg32 80.58 ± 0.20
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 pp2048 @ d4096 5250.64 ± 14.32
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 tg32 @ d4096 68.99 ± 1.01
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 pp2048 @ d8192 4926.56 ± 39.56
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 tg32 @ d8192 67.82 ± 0.15
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 pp2048 @ d16384 4493.57 ± 42.72
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 tg32 @ d16384 64.30 ± 0.17
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 pp2048 @ d32768 3779.74 ± 35.74
    gemma3 4B Q4_0 2.35 GiB 3.88 B 1 0 tg32 @ d32768 58.23 ± 0.08

    build: 73a48c9 (6845)

  • llama-batched-bench

    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    4096 32 1 4128 0.705 5807.78 0.462 69.25 1.167 3536.14
    4096 32 2 8256 1.395 5871.62 0.576 111.02 1.972 4187.31
    4096 32 4 16512 2.779 5896.15 0.665 192.52 3.444 4794.94
    4096 32 8 33024 5.549 5904.79 0.893 286.57 6.443 5125.79
    4096 32 16 66048 11.091 5908.83 1.340 381.98 12.432 5312.92
    4096 32 32 132096 22.149 5917.67 2.100 487.69 24.249 5447.50
    8192 32 1 8224 1.421 5764.46 0.472 67.75 1.893 4343.47
    8192 32 2 16448 2.826 5797.05 0.642 99.67 3.468 4742.27
    8192 32 4 32896 5.628 5821.92 0.799 160.14 6.428 5117.86
    8192 32 8 65792 11.250 5825.54 1.172 218.37 12.422 5296.38
    8192 32 16 131584 22.476 5831.69 1.902 269.20 24.378 5397.71
    8192 32 32 263168 44.913 5836.67 3.224 317.61 48.137 5467.02

GLM 4.5 Air

Model: https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main

  • llama-bench

    model size params fa mmap test t/s
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 pp2048 854.99 ± 1.55
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 tg32 22.98 ± 0.03
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 pp2048 @ d4096 768.20 ± 0.64
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 tg32 @ d4096 20.44 ± 0.00
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 pp2048 @ d8192 684.72 ± 2.02
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 tg32 @ d8192 19.30 ± 0.02
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 pp2048 @ d16384 571.49 ± 0.93
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 tg32 @ d16384 16.83 ± 0.01
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 pp2048 @ d32768 419.47 ± 0.88
    glm4moe 106B.A12B Q4_K 67.85 GiB 110.47 B 1 0 tg32 @ d32768 13.47 ± 0.01

    build: 73a48c9 (6845)

  • llama-batched-bench

    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    4096 32 1 4128 5.051 810.96 1.597 20.03 6.648 620.92
    4096 32 2 8256 9.818 834.40 2.725 23.48 12.543 658.20
    4096 32 4 16512 19.677 832.65 3.853 33.22 23.530 701.74
    4096 32 8 33024 39.335 833.04 6.459 39.63 45.795 721.13
    4096 32 16 66048 78.663 833.12 12.209 41.94 90.872 726.82
    8192 32 1 8224 10.431 785.35 1.780 17.98 12.211 673.48
    8192 32 2 16448 20.863 785.30 3.198 20.01 24.062 683.58
    8192 32 4 32896 41.682 786.15 4.570 28.01 46.252 711.23
    8192 32 8 65792 83.441 785.42 8.505 30.10 91.945 715.56
    8192 32 16 131584 166.869 785.48 18.279 28.01 185.148 710.70

More info

You must be logged in to vote

Replies: 15 comments 56 replies

Comment options

Thanks for the benchmark! I would like to request additional benchmark for a very popular model GLM-4.5-Air-FP8:
https://huggingface.co/zai-org/GLM-4.5-Air-FP8

and quants for it:

You must be logged in to vote
1 reply
Comment options

Saw the benchmark results. Thank you so much for the work! Appreciate very much.

Comment options

Hi. It would be great to see a Qwen Next 80B benchmark for these two models:

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
(Has acceptable t/s even on CPU... I'm not sure if this one runs on llama.cpp)

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
(Official quants)

Thanks.

You must be logged in to vote
2 replies
Comment options

Not support yet with open pr currently

Comment options

Hi. It would be great to see a Qwen Next 80B benchmark for these two models:

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (Has acceptable t/s even on CPU... I'm not sure if this one runs on llama.cpp)

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 (Official quants)

Thanks.

Yeah I really want to see the performance of a specific model comparing full 16 bit precision, Q8, Q4, FP4 and FP8.

None the less, thank you for the wonderful data!

Comment options

Getting similar performance with my Farmework Desktop. Thanks for helping my FOMO.

You must be logged in to vote
12 replies
Comment options

Someone please help explain this to me? I am not trying to bash on this machine, I am just trying to understand the justification for paying almost twice as much for the same performance with similar specs.

I'm sure the connectx-7 200GB networking has something to do with the pricing difference :)

Comment options

btw it's most likely a much better choice if you want GB10 to buy the Asus GB10 systems for 1ドルk less (at least that's what I did) - DGX Spark is more expensive but it's not the only choice

Interesting, the Asus GB10 seems to run with 240W power adapter, much higher than the DGX Spark. I wonder if you will get more performance given the higher power intake.

Comment options

btw it's most likely a much better choice if you want GB10 to buy the Asus GB10 systems for 1ドルk less (at least that's what I did) - DGX Spark is more expensive but it's not the only choice

Interesting, the Asus GB10 seems to run with 240W power adapter, much higher than the DGX Spark. I wonder if you will get more performance given the higher power intake.

I havent seen the specs. But its possible ASUS just used a power adapter with a high enough rating for the device? For example, I can plug a 90watt compatible power adapter into my 45watt laptop. It will pull what it needs to.

Comment options

@bartlettroscoe i benched gpt-oss 120b on Framework Desktop a couple months ago: geerlingguy/ai-benchmarks#21 (comment)

Comment options

with "correct" rocm and build I get:

model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp1 45.40 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2 57.58 ± 0.95
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp3 74.03 ± 2.34
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp4 90.93 ± 2.95
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp8 142.31 ± 5.57
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp12 173.14 ± 12.88
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp16 205.43 ± 6.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp24 235.43 ± 11.38
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp32 234.24 ± 10.83
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp48 216.49 ± 10.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp64 311.52 ± 7.33
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp96 386.08 ± 10.33
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp128 446.85 ± 6.77
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp192 509.42 ± 8.09
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp256 594.22 ± 9.46
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp384 698.31 ± 3.26
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp512 763.53 ± 4.88
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp768 845.23 ± 6.57
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp1024 927.17 ± 1.20
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp1536 987.73 ± 1.96
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp2048 1017.17 ± 4.10
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp3072 939.48 ± 2.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp4096 953.72 ± 1.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 tg16 45.43 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 999 2048 1 0 pp512+tg64 264.68 ± 0.82
Comment options

Can you run the classic llama 2 7B Q4_0 so it can be compared on the chart?

You must be logged in to vote
0 replies
Comment options

Super interesting, thanks for sharing, Georgi!

llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

Could you please help me understand: Does "-d" mean KV cache length before the "-p" prefill happens? What does "-ub" define, eg batch size?

You must be logged in to vote
1 reply
Comment options

ggerganov Oct 15, 2025
Maintainer Author

Does "-d" mean KV cache length before the "-p" prefill happens?

Yes.

What does "-ub" define, eg batch size?

Yes.

Comment options

Could you add llama2-7b result to #15013?

You must be logged in to vote
0 replies
Comment options

Awesome, thank you!
So for gpt-oss-120B around 35 tokens/s on dgx spark.
On vllm im getting with 131k context and at almost any length around 180 tokens/s on a 300W RTX6000 96gb Max-Q edition.

So whats the sense of a dgx spark? I mean sure it has 128gb memory, but i can offload bigger models between 96gb vram and the rest to normal Ram (CPU)...
So in the end i can run even bigger models and even faster as the dgx could.

Its too expensive for what it offers. If the DGX Spark would be around 2k, like the Ryzen Max 395+ Mini-PC's it would be fine and okay.
But for 4k usd/eur its absolutely senseless...

PS: And a Mac Mini/Studio is a much better option at 4k usd/eur, compared to a DGX Sparc.

You must be logged in to vote
9 replies
Comment options

Guys, please dont take fp4 or fp8 as a win.

Let me explain:
I do compare embedding models in different quantisations (for my project @work).

Comparing embedding Models is actually great, because you can simply query the resulting vector database and see the quantisation impacts.

From my tests, no matter which Model, be it Qwen3-Embedding or BGE-M3 or anything else, the impact of Quantisation is Huge!

FP32 is Amazing
BF16 is still Amazing
int8/Q8 = you see already a degradation because the results start to differ, but only 5-10% of the results are different.
Q4 = 50% of the results are different, almost unusable Model

So you Guys want to tell me that FP4 is a win?
In my Opinion FP8 is fine and usable, but FP4 will be unusable crap.
No Matter what the Marketing says, 1% quality loss is a huge lie!!!

I didnt tested fp4 tho, not even fp8, so i cant say for sure.
But from my experience with all other quantisations fp4 should be crap.

Cheers!

Comment options

It depends on the model. In many cases, in my experience FP4 does a fantastic job. Also NVFP4 has the potential to be amazing.

So is it situational? Sure, it can be. But I don't think it's something that can be ignored.

Also, FP8 is also great, I have found little reason to not use it.

Comment options

I'd agree that everyone should eval for their particular downstream tasks rather than just trusting perplexity or KLD. When running quants on my 405B model I ran JA MT-Bench evals and was surprised to find a bigger difference with FP8-Dynamic than IQ3_M.

@icsy7867 I know you're just theory-crafting instead of running tests, but see my PRO 6000 TensorRT/NVFP4 benchmark below, but there is zero throughput benefit from NVFP4. Maybe related to NVIDIA/TransformerEngine#2255 - I never use TensorRT and it's impossible to build so I just used the latest docker for my tests (tensorrt-llm/release:1.2.0rc0) but I've put my full scripts/details online so it's easy for anyone to rent any GPU they want to check any configuration/variation for themselves.

#16578 (reply in thread)

Comment options

yes it rely depend on model. for exemple I get for Mistral-SMAL:

BF16 Q8_0_L Q8_0 Q8_0 Q8_0 Q6_K Q6_K Q8_0 Q5_K_M Q4_K_M Q3_K_M
Mean PPL 5.377047 5.417646 5.428002 5.429658 5.433468 5.432926 5.448926 5.521099 5.798507
Mean KLD 0.008340 0.010369 0.010459 0.012241 0.012291 0.014935 0.027426 0.079385
Maximum KLD 2.048998 3.975800 1.263743 5.553815 5.662407 3.943127 4.050639 7.999546
99.9% KLD 0.204782 0.223453 0.219347 0.247532 0.250371 0.367634 0.993010 2.745419
99.0% KLD 0.078322 0.087357 0.087095 0.099235 0.099381 0.123670 0.250287 0.844125
95.0% KLD 0.032427 0.037600 0.038312 0.043401 0.043684 0.050811 0.088569 0.267027
90.0% KLD 0.019813 0.023899 0.024312 0.027942 0.028040 0.032904 0.055239 0.157323
Median KLD 0.003369 0.005111 0.005167 0.006354 0.006390 0.007717 0.013581 0.036258
10.0% KLD 0.000082 0.000128 0.000131 0.000159 0.000163 0.000188 0.000353 0.001116
5.0% KLD 0.000016 0.000027 0.000028 0.000036 0.000037 0.000043 0.000087 0.000311
1.0% KLD -0.000000 0.000001 0.000001 0.000003 0.000003 0.000003 0.000010 0.000045
0.1% KLD -0.000016 -0.000011 -0.000010 -0.000007 -0.000007 -0.000007 -0.000001 0.000008
Minimum KLD -0.000157 -0.000188 -0.000198 -0.000248 -0.000164 -0.000149 -0.000273 -0.000017
Same top p 95.971 94.905 94.947 94.457 94.394 94.030 92.372 88.237
Comment options

I'd agree that everyone should eval for their particular downstream tasks rather than just trusting perplexity or KLD. When running quants on my 405B model I ran JA MT-Bench evals and was surprised to find a bigger difference with FP8-Dynamic than IQ3_M.

@icsy7867 I know you're just theory-crafting instead of running tests, but see my PRO 6000 TensorRT/NVFP4 benchmark below, but there is zero throughput benefit from NVFP4. Maybe related to NVIDIA/TransformerEngine#2255 - I never use TensorRT and it's impossible to build so I just used the latest docker for my tests (tensorrt-llm/release:1.2.0rc0) but I've put my full scripts/details online so it's easy for anyone to rent any GPU they want to check any configuration/variation for themselves.

#16578 (reply in thread)

I appreciate the edit you did there. But you arent wrong, I wish I had a Blackwell gpu to test. But I am surprised the 6000 Pro doesnt have a speedup there from the FP4 tensor cores. Your data is much appreciated though, thanks.

Comment options

@ggerganov Are there llama.cpp benchmarks for the AGX Thor? It seems it's similar offering but Nvidia markets it as twice as fast.

There are no official detailed spec sheet for the DGX Spark to make a comparison to the Thor (2560 cuda cores and 92 tensor cores), but Nvidia claims 2PLOPS (sparse FP4) for the Thor and 1PFLOPS (sparse FP4) for the Spark.
I guess this might only affect batching, but it would be interesting to know given that Thor is cheaper than Spark.

You must be logged in to vote
5 replies
Comment options

ggerganov Oct 15, 2025
Maintainer Author

I'm not familiar with AGX Thor. But if you have one, you can easily run the same benchmarks on it.

Comment options

Quick tldr:

Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory. And no raytracing cores. While Spark is sm_121 with the full consumer Blackwell feature set.

Thor and Spark have relatively similar memory bandwidth. The Thor CPU is much slower.

Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.

Thor has 4 cursed Synopsys 25GbE NICs (set to 10GbE by default, see https://docs.nvidia.com/jetson/archives/r38.2/DeveloperGuide/SD/Kernel/Enable25GbEthernetOnQSFP.html as it doesn't have auto-negociation of the link rate) exposed via a QSFP connector providing 4x25GbE while Spark systems have regular ConnectX-7.

Thor uses a downstream L4T stack instead of regular NVIDIA drivers unlike Spark. But at least the CUDA SDK is the same unlike prior Tegras. Oh and you get less other IO too.

Side note: might be better to also consider GB10 systems from OEMs. Those are available for cheaper than AGX Thor devkits too.

Comment options

I'm not familiar with AGX Thor. But if you have one, you can easily run the same benchmarks on it.

I don't have one unfortunately, hoping whoever does will run those benchmarks.

Vector throughput on Thor is 1/3rd of the one on DGX Spark but you get twice the matrix throughput.

This is a very weird and interesting tradeoff.

Comment options

Thor is sm_110 (formerly sm_101) with the datacenter-style tensor cores - including tensor memory

@woachk does "tensor memory" here refer to TMEM?

Comment options

Yes.

Comment options

For those curious about Thor performance
(All models are the same as linked in the original benchmark with the same command)
llama.cpp git commit: f9fb33f
Jetpack 7.0 [L4T 38.2.2]
Docker container: nvcr.io/nvidia/pytorch:25.09-py3
MAXN and jetson_clocks enabled

gpt-oss-20b-gguf

# ./bin/llama-bench -m /workspace/models/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 | 2008.85 ± 4.18 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 60.85 ± 0.17 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1862.13 ± 4.80 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 55.03 ± 0.06 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1740.90 ± 3.24 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 53.58 ± 0.18 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 1446.75 ± 3.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 52.49 ± 1.94 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 1193.93 ± 0.72 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 48.33 ± 0.04 |
build: f9fb33f2 (6771)

Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

# ./bin/llama-bench -m /workspace/models/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1654.25 ± 1.80 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 | 44.26 ± 0.11 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1410.87 ± 2.22 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 39.46 ± 0.04 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1228.69 ± 1.78 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 36.88 ± 0.13 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 985.39 ± 7.04 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 33.55 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 686.45 ± 0.93 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 26.92 ± 0.05 |
build: f9fb33f2 (6771)

gpt-oss-120b

# ./bin/llama-bench -m /workspace/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 | 967.20 ± 6.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 42.00 ± 0.09 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 932.85 ± 2.33 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 38.81 ± 0.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 892.28 ± 2.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 39.22 ± 1.05 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 827.57 ± 1.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 37.77 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 677.70 ± 1.06 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 34.02 ± 0.02 |
build: f9fb33f2 (6771)
You must be logged in to vote
9 replies
Comment options

That commit only applies the change to if (prop.major == 12 && prop.minor == 1) {, wonder if also adding it to 11.0 changes things

Comment options

I did a quick one off build where I removed the conditional around the scheduling block to force spin and I do see a consistent improvement. Just looking at power draw there is a probably at least another 10-20% performance untapped on thor beyond moving it to the spin scheduler. Currently looks like we are mostly cpu bound.

llama_bench_comparison_dark llama_batched_comparison_dark

Llama-bench Test Results (Qwen3moe 30B)

schedule Default Spin Improvement (%)
test
pp2048 1654.25 1700.05 2.77
pp2048 @ d16384 985.39 992.37 0.71
pp2048 @ d32768 686.45 687.30 0.12
pp2048 @ d4096 1410.87 1446.22 2.51
pp2048 @ d8192 1228.69 1257.35 2.33
tg32 44.26 45.67 3.19
tg32 @ d16384 33.55 33.62 0.21
tg32 @ d32768 26.92 27.05 0.48
tg32 @ d4096 39.46 40.64 2.99
tg32 @ d8192 36.88 38.09 3.28

Average improvement: 1.86%
Best improvement: 3.28% (tg32 @ d8192)
Worst improvement: 0.12% (pp2048 @ d32768)

Llama-batched-bench Test Results

PP=4096:
Average throughput improvement: 2.03%
Best batch size improvement: B2 (4.48%)
Worst batch size improvement: B16 (0.06%)

PP=8192:
Average throughput improvement: 0.05%
Best batch size improvement: B32 (0.07%)
Worst batch size improvement: B16 (0.03%)

Spin schedule
 Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
Test: llama-bench
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1700.05 ± 2.02 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 | 45.67 ± 0.11 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1446.22 ± 3.54 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 40.64 ± 0.05 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1257.35 ± 0.75 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 38.09 ± 0.09 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 992.37 ± 1.89 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 33.62 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 687.30 ± 0.48 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 27.05 ± 0.03 |
Test: llama-batched-bench
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 4096 | 32 | 1 | 4128 | 2.537 | 1614.38 | 0.789 | 40.54 | 3.327 | 1240.92 |
| 4096 | 32 | 2 | 8256 | 4.949 | 1655.30 | 1.301 | 49.18 | 6.250 | 1320.87 |
| 4096 | 32 | 4 | 16512 | 9.887 | 1657.09 | 1.663 | 76.98 | 11.550 | 1429.62 |
| 4096 | 32 | 8 | 33024 | 19.739 | 1660.11 | 2.289 | 111.86 | 22.027 | 1499.25 |
| 4096 | 32 | 16 | 66048 | 39.464 | 1660.65 | 3.279 | 156.14 | 42.743 | 1545.23 |
| 4096 | 32 | 32 | 132096 | 78.936 | 1660.49 | 5.033 | 203.46 | 83.968 | 1573.16 |
| 8192 | 32 | 1 | 8224 | 5.314 | 1541.47 | 0.839 | 38.14 | 6.153 | 1336.50 |
| 8192 | 32 | 2 | 16448 | 10.614 | 1543.68 | 1.396 | 45.86 | 12.009 | 1369.61 |
| 8192 | 32 | 4 | 32896 | 21.220 | 1544.24 | 1.888 | 67.79 | 23.108 | 1423.59 |
| 8192 | 32 | 8 | 65792 | 42.394 | 1545.87 | 2.792 | 91.68 | 45.187 | 1456.01 |
| 8192 | 32 | 16 | 131584 | 84.800 | 1545.66 | 4.206 | 121.73 | 89.006 | 1478.37 |
| 8192 | 32 | 32 | 263168 | 169.577 | 1545.87 | 6.867 | 149.11 | 176.444 | 1491.51 |
Comment options

For prompt processing there's a lot more on the table but that means switching to tcgen05 MMA instructions. (Which is a separate instruction set than the regular tensor core one)

And there's also the matter of using lower precision MMAs in general

Comment options

I believe that Thor doesn't support tcgen05 because it doesn't have tensor-memory

Comment options

Thor does have tensor memory - it uses the data centre tensor cores (it's sm_110[a]), Spark does not.

See https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma

Comment options

Would love to see accuracy of the same models on main banchmarks running in DGX as they will vary on different HW & FW in addition to the speed.

As its clearly sing here https://artificialanalysis.ai/models/gpt-oss-120b/providers
CleanShot 2025年10月16日 at 16 41 40

You must be logged in to vote
0 replies
Comment options

Please bench the full Qwen3 coder model

You must be logged in to vote
2 replies
Comment options

ggerganov Oct 17, 2025
Maintainer Author

There isn't any measurable benefits in terms of quality compared to Q8_0, so don't think there is any point in benching that as it is most likely going to perform worse in terms of speed.

Comment options

I am just impressed that it might run at all. It's there any bench on fine-tuning?

Comment options

Would love to see this this cluster setup in the comparison table too
EXO Lab cluster with 2xDGX + MacStudio
https://blog.exolabs.net/nvidia-dgx-spark/

You must be logged in to vote
1 reply
Comment options

ggerganov Oct 17, 2025
Maintainer Author

AFAICT this is vaporware.

Comment options

On the subject of Spark and Thor, I have been looking for alternatives to TensorRT for python-free and community driven inference engine. I'm looking to leverage nvfp4 tensor cores , and wonder if there's any project or folks working to support those in llama.cpp?

You must be logged in to vote
6 replies
Comment options

The whole Blackwell product range, from the RTX 5050 onwards to the B200/300 through iGPUs

Comment options

Comment options

Just as an FYI, I don't have a Spark but I tested NVP4 on an RTX PRO 6000 (Llama 3.1 8B Instruct). NVP4 w/ TensorRT does not perform better than llama.cpp at bs=1, and at higher concurency, doesn't take a lead until c=32.

I didn't test quality loss, but from a pure throughput perspective, I don't think the current NVFP4 implementation is particularly good. Certainly not worth all the custom quanting and other hassles...

Config Req/s Prefill Tok/s Decode Tok/s Total Tok/s Max Out Tok/s TTFT mean TTFT med TTFT p99 TPOT mean TPOT med TPOT p99
llama.cpp.q4_k_m 1.65 1683.45 207.16 1890.61 223.00 74.17 75.75 85.71 4.36 4.22 8.40
sglang.fp8-auto 1.15 1173.85 142.83 1316.68 146.00 54.88 55.31 55.79 6.61 6.62 6.62
sglang.fp8-dynamic 1.04 1065.99 130.29 1196.28 132.00 55.91 56.30 57.13 7.28 7.29 7.29
sglang.w4a16 1.56 1590.93 194.85 1785.78 204.00 53.69 54.10 54.79 4.74 4.75 4.76
trt.fp8 0.59 605.67 74.33 680.01 76.00 39.94 40.24 40.76 13.24 13.24 13.27
trt.nvfp4 0.60 608.22 74.38 682.61 76.00 30.91 31.05 31.31 13.30 13.30 13.34
vllm.fp8-dynamic 0.77 789.55 94.90 884.45 98.00 34.94 35.12 36.43 10.34 10.34 10.36
vllm.w4a16 1.52 1549.83 189.81 1739.64 196.00 49.09 49.39 50.30 4.92 4.92 4.96
Comment options

@lhl what's the prefill sequence length in the profiles above?
my usecase is pre-fill only at seqlen > 300

Comment options

This is using a standard vLLM bench - ShareGPT w/ prefill 1024 and decode 128 I believe. If you have a specific use case it's probably best to just trying the device directly - I think they're available for a buck or two on Vast or Runpod.

I think the compute is particularly strong for a client card. For example, the PRO 6000 actually beats an H100 on our Whisper inference sweeps. (Still trains much slower though)

Here's my LLM sweep scripts (and raw results) btw: https://github.com/AUGMXNT/speed-benchmarking/tree/main/nvfp4

Comment options

@ggerganov - what flags did you use to compile for DGX Spark? Also, did you set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1?
I've just got the spark, and I'm not getting the same performance numbers as you. Also, the model loading is super slow. Not sure what's going on, I'm probably missing something.

It does seem to offload layers to GPU properly, but nvtop/nvidia-smi shows host memory utilization growing to quite large numbers (more than 100GB and then it all goes to GPU memory). In comparison, my Strix Halo PC loads the same model 5x faster.

My numbers:

Without GGML_CUDA_ENABLE_UNIFIED_MEMORY=1:

Model loading time - 1 minute 44 seconds using this command:

build/bin/llama-server -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -ngl 999 -ub 2048

Benchmarks:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
model size params backend ngl n_ubatch fa test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 1737.17 ± 81.66
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 45.87 ± 0.74
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 @ d4096 1777.81 ± 5.92
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 @ d4096 43.41 ± 0.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 @ d8192 1720.17 ± 8.49
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 @ d8192 41.52 ± 0.29
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 @ d16384 1512.23 ± 11.81
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 @ d16384 38.39 ± 0.15
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 @ d32768 1231.86 ± 6.14
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 @ d32768 34.29 ± 0.07

build: 03792ad (6816)

With GGML_CUDA_ENABLE_UNIFIED_MEMORY=1:

Model loading time: 49 seconds
Benchmarks:

model size params backend ngl n_ubatch fa test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 1672.33 ± 65.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 40.61 ± 0.38
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 @ d4096 1661.97 ± 8.73
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 @ d4096 38.29 ± 0.35
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 @ d8192 1587.22 ± 12.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 @ d8192 36.85 ± 0.42
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 @ d16384 1384.96 ± 6.77
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 @ d16384 34.62 ± 0.22
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 pp2048 @ d32768 1124.23 ± 4.65
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 tg32 @ d32768 30.47 ± 0.08

For comparison, from my GMKTek Evo X2 (AMD AI MAX+ 395), same llama.cpp build, compiled with HIP:

Model loading time: 25 seconds (8 seconds if still in caches!!!)

model size params backend ngl n_ubatch fa test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 pp2048 999.59 ± 4.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 tg32 47.49 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 pp2048 @ d4096 824.37 ± 1.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 tg32 @ d4096 44.23 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 pp2048 @ d8192 703.42 ± 1.54
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 tg32 @ d8192 42.52 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 pp2048 @ d16384 514.89 ± 3.86
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 tg32 @ d16384 39.71 ± 0.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 pp2048 @ d32768 348.59 ± 2.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 2048 1 tg32 @ d32768 35.39 ± 0.01

Any ideas? You benchmarks look closer to what I'd expect from this device. And long loading time makes me think that it is doing some extra mallocs/copying.

You must be logged in to vote
8 replies
Comment options

Re: GDS - not sure what's going on there, but:

eugr@spark:~$ /usr/local/cuda/gds/tools/gdscheck.py -p
 GDS release version: 1.15.1.6
 libcufile version: 2.12
 Platform: aarch64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe P2PDMA : Unsupported
 NVMe : Unsupported
 NVMeOF : Unsupported
 SCSI : Unsupported
 ScaleFlux CSD : Unsupported
 NVMesh : Unsupported
 DDN EXAScaler : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS : Unsupported
 BeeGFS : Unsupported
 ScaTeFS : Unsupported
 WekaFS : Unsupported
 Userspace RDMA : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library : Not Loaded (libcufile_rdma.so)
 --rdma devices : Not configured
 --rdma_device_status : Up: 0 Down: 0
Comment options

Hmm, something that I wonder about.

You should be able to rely on HMM (cudaDevAttrPageableMemoryAccess) on GB10 with "just" using the host memory mapping (even for mmap'd files)... and not dealing with any CUDA memory allocation APIs. Perf overhead will be there because of 4KB pages though, but wonder if that alleviates the loading times...

Comment options

I use to test to direct register mmap with hip on AMD APU and it use to work with no more penality that what I get with GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 but for AMD/HIP there is special config on alloc. I did not have NVIDIA APU to look what is needed for CUDA.
On AMD the gain/lose is not because of 4k page, but because of cache coherency from CPU/GPU by default.

Comment options

@ggerganov so, I've got really curious and decided to test the kernel theory.

Installed Fedora 43 beta on DGX Spark, nvidia-open drivers, CUDA 13 (used RHEL 10 package). Needed to patch CUDA's math-operations.h as rsqrt/rsqrtf signature wasn't matching the one in C++ 15 that comes with Fedora 43, but other than that was able to compile llama.cpp (and it was able to detect ARM features properly - something that didn't work on stock DGX OS!!!).

And lo and behold, loading gpt-oss-120b from cold takes 19.5 seconds - slightly faster than Strix Halo!!!! Big improvement compared to 56 seconds on DGX OS!

On a flip side, getting worse performance on token generation:

model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 1864.44 ± 3.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 41.79 ± 0.13
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d4096 1730.84 ± 4.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d4096 37.90 ± 0.04
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d8192 1628.49 ± 7.19
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d8192 36.38 ± 0.10
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d16384 1395.37 ± 8.78
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d16384 34.23 ± 0.01
Comment options

just in case strix halo can run even faster:

$ sudo hdparm -t --direct /dev/nvme0n1
/dev/nvme0n1:
 Timing O_DIRECT disk reads: 14690 MB in 3.00 seconds = 4896.16 MB/sec
Comment options

Throughput is not the only metric.

We need to take into account that different HW/FW produce different accuracy for the same model.
And can vary from a little to drastic difference.

Can someone test popular LLMs like gpt-oss?

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /