Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Performance of llama.cpp on Nvidia CUDA #15013

olegshulyakov started this conversation in Show and tell
Discussion options

This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on AMD ROCm(HIP) and Performance of llama.cpp with Vulkan, but for CUDA! I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our CUDA releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

Share your llama-bench results along with the git hash and CUDA info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device I'll prioritize newer commits with substantial CUDA updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip Memory pp512 t/s tg128 t/s Commit Thanks to
RTX 5090 32 GB / GDDR7 / 512 bit 14073.41 ± 115.16 290.02 ± 1.10 8cf6b42 @totaldev
RTX PRO 6000 Blackwell 96 GB / GDDR7 / 512 bit 14854.63 ± 22.73 274.20 ± 0.14 79c1160 @Tom94
H100 80 GB 80 GB / HBM3 / 5120 bit 9918.34 ± 176.97 267.81 ± 1.54 5143fa8 @Hedede
A100 80 GB 80 GB / HBM2e / 5120 bit 4849.53 ± 8.94 190.88 ± 0.33 5143fa8 @Hedede
RTX 4090 D 24 GB / GDDR6X / 384 bit 10293.86 ± 134.72 189.33 ± 0.19 79c1160 @autonomous-AI-lab
RTX 4090 24 GB / GDDR6X / 384 bit 11992.70 ± 107.99 186.21 ± 0.13 2241453 @lhl
RTX 5080 16 GB / GDDR7 / 256 bit 8297.36 ± 9.50 181.99 ± 0.42 8a4280c @Hedede
RTX 6000 Ada 48 GB / GDDR6 / 384 bit 9229.23 ± 101.78 176.07 ± 0.26 b8e09f0 @Hedede
RTX 3090 Ti 24 GB / GDDR6X / 384 bit 6567.49 ± 20.30 171.19 ± 3.98 9c35706 @slaren
RTX 3090 24 GB / GDDR6X / 384 bit 5174.69 ± 21.83 158.16 ± 0.21 c76b420 @m18coppola
L40 48 GB / GDDR6 / 384 bit 8870.49 ± 378.76 152.01 ± 0.28 ee09828 @Hedede
RTX 4080 SUPER 16 GB / GDDR6X / 256 bit 8125.15 ± 41.05 148.33 ± 0.20 81086cd @zacharyarnaise
RTX 4080 16 GB / GDDR6X / 256 bit 8031.64 ± 26.49 142.49 ± 0.16 20638e4 @Ristovski
RTX 3080 10 GB / GDDR6X / 320 bit 5013.86 ± 24.80 139.65 ± 0.99 9c35706 @slaren
RTX A6000 48 GB / GDDR6 / 384 bit 4913.93 ± 6.79 138.73 ± 2.75 4795c91 @Hedede
RTX 4070 Ti SUPER 16 GB / GDDR6X / 256 bit 6924.53 ± 13.87 132.26 ± 0.16 9c35706 @Ristovski
RTX A5000 24 GB / GDDR6 / 384 bit 4028.16 ± 19.14 130.07 ± 2.74 e5155e6 @Hedede
Tesla V100 32 GB / HBM2 / 4096 bit 3042.64 ± 40.71 129.08 ± 0.05 51f5a45 @Hedede
RTX 5070 12 GB / GDDR7 / 192 bit 5184.75 ± 18.70 127.54 ± 0.46 @Spyro000
Titan V 12 GB / HBM2 / 3072 bit 2617.46 ± 2.10 108.79 ± 0.05 e56abd2 @Hedede
RTX 2080 Ti 11 GB / GDDR6 / 352 bit 2890.66 ± 2.42 107.51 ± 0.21 9c35706 @ariya
Quadro RTX 6000 24 GB / GDDR6 / 384 bit 2751.18 ± 19.43 102.77 ± 0.04 b8e09f0 @Hedede
Quadro RTX 8000 48 GB / GDDR6 / 384 bit 2709.95 ± 3.35 102.68 ± 0.03 b8e09f0 @Hedede
RTX A4500 20 GB / GDDR6 / 320 bit 2827.20 ± 66.43 97.32 ± 2.80 5cdb27e @aleksyx
RTX 5060 Ti 16 GB / GDDR7 / 128 bit 3737.25 ± 6.79 90.94 ± 0.02 89d1029 @mike-llamacpp
RTX 2070 SUPER 8 GB / GDDR6 / 256 bit 2088.34 ± 1.94 88.06 ± 0.28 bc07349 @phstudy
RTX A4000 16 GB / GDDR6 / 256 bit 2684.06 ± 15.28 83.77 ± 0.37 65349f2 @TinyServal
Titan Xp 12 GB / GDDR5X / 384 bit 1154.96 ± 1.46 76.08 ± 0.08 c4510dc @Hedede
RTX 3060 12 GB / GDDR6 / 192 bit 2137.50 ± 10.12 75.57 ± 0.07 baa9255 @QuantiusBenignus
RTX 4060 Ti 8 GB / GDDR6 / 128 bit 3394.63 ± 7.44 63.86 ± 0.01 89d1029 @mike-llamacpp
GTX 1080 Ti 11 GB / GDDR5X / 352 bit 1084.41 ± 3.01 62.49 ± 0.06 9c35706 @ariya
RTX A4000 Ada 20 GB / GDDR6 / 160 bit 2779.77 ± 9.91 61.83 ± 0.04 a74a0d6 @sdwolfz
RTX 2060 SUPER 8 GB / GDDR6 / 256 bit 1420.24 ± 1.95 60.04 ± 0.01 5c0eb5e @ggerganov
DGX Spark 128 GB / LPDDR5x 3062.31 ± 11.02 57.21 ± 0.06 5acd455 @ggerganov
Tesla P40 24 GB / GDDR5 / 384 bit 1007.42 ± 1.23 54.74 ± 0.07 c76b420 @m18coppola
RTX 2000 Ada 16 GB / GDDR6 / 128 bit 1956.22 ± 7.74 50.62 ± 0.04 756cfea @DigitalRudeness
Tesla P100 16 GB / HBM2 / 4096 bit 703.27 ± 3.21 50.20 ± 0.01 9ef5369 @VinnyG9
GTX 1660 Ti Mobile 6 GB / GDDR5 / 192 bit 520.25 ± 2.00 46.46 ± 0.21 912ff8c @pt13762104
Tesla T4 16 GB / GDDR6 / 256 bit 1219.06 ± 4.18 46.38 ± 0.73 d32e03f @pt13762104
RTX 4050 Laptop 6 GB / GDDR6 / 96 bit 1725.85 + 17.85 43.72 + 0.41 d79d8f3 @TimCabbage
GTX 1660 6 GB / GDDR5 / 192 bit 148.91 ± 0.01 41.35 ± 0.02 9515c61 @ariya
GTX 1070 Ti 8 GB / GDDR5 / 256 bit 714.44 ± 2.04 37.82 ± 0.02 79c1160 @pebaryan
Tesla P4 8 GB / GDDR5 / 256 bit 514.53 ± 3.06 33.29 ± 0.00 c76b420 @m18coppola
P106-100 6 GB / GDDR5 / 192 bit 406.94 ± 0.25 30.40 ± 0.02 5fd160b @pebaryan
RTX 3500 Mobile Ada 12 GB / GDDR6 / 192 bit 1406.43 ± 52.64 30.23 ± 0.23 1062205 @luisaforozco
GTX 1060 6 GB / GDDR5 / 192 bit 416.85 ± 1.75 27.79 ± 0.02 5fd160b @pebaryan
Quadro T1000 4 GB / GDDR5 / 128 bit 79.44 ± 0.01 27.82 ± 0.18 f6da8cb @hanabu
Quadro P2000 5 GB / GDDR5 / 160 bit 309.30 ± 0.05 23.63 ± 0.00 baa9255 @TinyServal
Quadro P1000 4 GB / GDDR5 / 128 bit 183.40 ± 0.11 13.99 ± 0.13 1e74897 @aleksyx
Tesla K80 12 GB / GDDR5 / 384 bit 133.14 ± 0.55 13.80 ± 0.02 32732f2 @pebaryan

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip Memory pp512 t/s tg128 t/s Commit Thanks to
RTX 5090 32 GB / GDDR7 / 512 bit 14970.15 ± 381.06 300.40 ± 0.28 8cf6b42 @totaldev
RTX PRO 6000 Blackwell 96 GB / GDDR7 / 512 bit 16618.98 ± 20.66 281.11 ± 0.41 5143fa8 @Tom94
H100 80 GB 80 GB / HBM3 / 5120 bit 11263.29 ± 98.34 280.74 ± 1.17 5143fa8 @Hedede
A100 80 GB 80 GB / HBM2e / 5120 bit 5285.96 ± 6.58 200.90 ± 0.12 5143fa8 @Hedede
RTX 4090 D 24 GB / GDDR6X / 384 bit 12506.97 ± 11.51 191.57 ± 0.03 79c1160 @autonomous-AI-lab
RTX 4090 24 GB / GDDR6X / 384 bit 14770.63 ± 102.93 188.96 ± 0.05 2241453 @lhl
RTX 5080 16 GB / GDDR7 / 256 bit 9487.70 ± 21.89 184.68 ± 0.05 8a4280c @Hedede
RTX 6000 Ada 48 GB / GDDR6 / 384 bit 10576.85 ± 530.21 179.47 ± 0.32 b8e09f0 @Hedede
RTX 3090 Ti 24 GB / GDDR6X / 384 bit 6924.01 ± 10.76 172.26 ± 1.31 9c35706 @slaren
RTX 3090 24 GB / GDDR6X / 384 bit 5560.06 ± 16.28 161.89 ± 0.18 c76b420 @m18coppola
L40 48 GB / GDDR6 / 384 bit 10097.64 ± 671.22 153.76 ± 0.12 ee09828 @Hedede
RTX 4080 SUPER 16 GB / GDDR6X / 256 bit 9439.01 ± 56.75 147.48 ± 1.41 81086cd @zacharyarnaise
RTX 4080 16 GB / GDDR6X / 256 bit 9205.93 ± 22.31 143.47 ± 0.02 20638e4 @Ristovski
RTX A6000 48 GB / GDDR6 / 384 bit 5662.39 ± 13.87 144.87 ± 0.18 4795c91 @Hedede
RTX 3080 10 GB / GDDR6X / 320 bit 5569.56 ± 14.04 139.95 ± 0.95 9c35706 @slaren
RTX A5000 24 GB / GDDR6 / 384 bit 4552.15 ± 9.68 135.83 ± 0.11 e5155e6 @Hedede
Tesla V100 32 GB / HBM2 / 4096 bit 2973.78 ± 3.62 134.76 ± 0.02 51f5a45 @Hedede
RTX 4070 Ti SUPER 16 GB / GDDR6X / 256 bit 7612.32 ± 37.35 132.85 ± 0.31 9c35706 @Ristovski
RTX 5070 12 GB / GDDR7 / 192 bit 5783.44 ± 36.95 128.21 ± 2.52 @Spyro000
Titan V 12 GB / HBM2 / 3072 bit 2481.25 ± 1.31 112.17 ± 0.01 e56abd2 @Hedede
RTX 2080 Ti 11 GB / GDDR6 / 352 bit 3107.61 ± 4.34 109.17 ± 0.07 9c35706 @ariya
Quadro RTX 6000 24 GB / GDDR6 / 384 bit 3053.96 ± 1.37 104.38 ± 0.04 b8e09f0 @Hedede
Quadro RTX 8000 48 GB / GDDR6 / 384 bit 3052.35 ± 5.64 103.63 ± 0.02 b8e09f0 @Hedede
RTX A4500 20 GB / GDDR6 / 320 bit 3453.10 ± 49.19 103.00 ± 0.25 5cdb27e @aleksyx
RTX 5060 Ti 16 GB / GDDR7 / 128 bit 4195.53 ± 1.98 93.46 ± 0.01 89d1029 @mike-llamacpp
RTX 2070 SUPER 8 GB / GDDR6 / 256 bit 2293.29 ± 5.91 87.71 ± 0.29 bc07349 @phstudy
RTX A4000 16 GB / GDDR6 / 256 bit 2807.83 ± 52.44 85.17 ± 0.66 65349f2 @TinyServal
RTX 3060 12 GB / GDDR6 / 192 bit 2407.67 ± 3.73 76.92 ± 0.03 baa9255 @QuantiusBenignus
Titan Xp 12 GB / GDDR5X / 384 bit 1218.12 ± 1.82 73.84 ± 0.04 c4510dc @Hedede
RTX 4060 Ti 8 GB / GDDR6 / 128 bit 3803.45 ± 70.80 64.03 ± 0.53 89d1029 @mike-llamacpp
RTX A4000 Ada 20 GB / GDDR6 / 160 bit 3171.86 ± 4.34 61.37 ± 0.01 a74a0d6 @sdwolfz
GTX 1080 Ti 11 GB / GDDR5X / 352 bit 1138.14 ± 2.02 61.38 ± 0.03 9c35706 @ariya
RTX 2060 SUPER 8 GB / GDDR6 / 256 bit 1563.77 ± 0.51 61.13 ± 0.05 5c0eb5e @ggerganov
DGX Spark 128 GB / LPDDR5x 3661.37 ± 38.66 56.74 ± 0.03 5acd455 @ggerganov
Tesla P40 24 GB / GDDR5 / 384 bit 1079.66 ± 0.18 53.73 ± 0.05 c76b420 @m18coppola
RTX 2000 Ada 16 GB / GDDR6 / 128 bit 2250.14 ± 5.91 50.71 ± 0.01 756cfea @DigitalRudeness
Tesla P100 16 GB / HBM2 / 4096 bit 735.19 ± 3.72 51.08 ± 0.00 9ef5369 @VinnyG9
GTX 1660 Ti Mobile 6 GB / GDDR5 / 192 bit 635.21 ± 0.27 46.37 ± 0.07 912ff8c @pt13762104
Tesla T4 16 GB / GDDR6 / 256 bit 1309.73 ± 1.02 44.03 ± 0.57 d32e03f @pt13762104
GTX 1660 6 GB / GDDR5 / 192 bit 154.45 ± 0.52 41.43 ± 0.01 9515c61 @ariya
GTX 1070 Ti 8 GB / GDDR5 / 256 bit 790.52 ± 2.39 37.87 ± 0.00 79c1160 @pebaryan
Tesla P4 8 GB / GDDR5 / 256 bit 529.53 ± 2.12 33.12 ± 0.03 c76b420 @m18coppola
P106-100 6 GB / GDDR5 / 192 bit 438.49 ± 0.38 30.64 ± 0.06 5fd160b @pebaryan
RTX 3500 Mobile Ada 12 GB / GDDR6 / 192 bit 1610.14 ± 32.13 28.75 ± 0.21 1062205 @luisaforozco
GTX 1060 6 GB / GDDR5 / 192 bit 446.19 ± 0.81 28.18 ± 0.01 5fd160b @pebaryan
Quadro T1000 4 GB / GDDR5 / 128 bit 27.46 ± 0.23 27.46 ± 0.23 f6da8cb @hanabu
Quadro P2000 5 GB / GDDR5 / 160 bit 311.55 ± 0.19 23.76 ± 0.01 baa9255 @TinyServal
Tesla K80 12 GB / GDDR5 / 384 bit 133.36 ± 0.60 14.27 ± 0.32 32732f2 @pebaryan
Quadro P1000 4 GB / GDDR5 / 128 bit 173.82 ± 0.02 13.65 ± 0.14 1e74897 @aleksyx

More detailed test

The main idea of this test is to show a decrease in performance with increasing size.

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048
You must be logged in to vote

Replies: 60 comments 40 replies

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

Comment options

Here's the results for my devices. Not sure how to get a "cuda info string" though.

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip pp512 t/s tg128 t/s Commit
Tesla P4 514.53 ± 3.06 33.29 ± 0.00 c76b420
Tesla P40 1007.42 ± 1.23 54.74 ± 0.07 c76b420
RTX 3090 5174.69 ± 21.83 158.16 ± 0.21 c76b420

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip pp512 t/s tg128 t/s Commit
Tesla P4 529.53 ± 2.12 33.12 ± 0.03 c76b420
Tesla P40 1079.66 ± 0.18 53.73 ± 0.05 c76b420
RTX 3090 5560.06 ± 16.28 161.89 ± 0.18 c76b420
You must be logged in to vote
0 replies
Comment options

While technically not directly related, there may also be value in comparing AMD ROCM build here too, as ROCM acts a replacement (sometimes a directly compatible layer) for most CUDA calls.

I admit risk of confusion for Nvidia users in the thread if this path is taken.

You must be logged in to vote
1 reply
Comment options

As I know you cannot run ROCm on Nvidia GPU. If you would like to see compared results check Vulkan thread. You can find there results for Vulkan/CUDA and Vulkan/ROCm.

UPD: Created ROCm discussion.

Comment options

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 6567.49 ± 20.30
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 171.19 ± 3.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 6924.01 ± 10.76
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 172.26 ± 1.31

build: 9c35706 (6060)

Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5013.86 ± 24.80
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 139.65 ± 0.99
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 5569.56 ± 14.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 139.95 ± 0.95

build: 9c35706 (6060)

You must be logged in to vote
0 replies
Comment options

Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 6924.53 ± 13.87
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 132.26 ± 0.16
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 7612.32 ± 37.35
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 132.85 ± 0.31

build: 9c35706 (647)

You must be logged in to vote
3 replies
Comment options

@olegshulyakov One more benchmark for RTX 4080:

Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 8031.64 ± 26.49
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 142.49 ± 0.16
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 9205.93 ± 22.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 143.47 ± 0.02

build: 20638e4 (2)

Comment options

@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(

Comment options

@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(

Hmm indeed, I didn't give much thought to the score at first. It should be stock but not completely sure as that is one of our work machines. I didn't have much time to investigate today, will check again tomorrow!

Comment options

Device 0: 3090. Power limit to 250w

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 4175.47 ± 27.79
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 137.72 ± 0.46
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 4377.03 ± 89.10
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 138.34 ± 0.96

build: 9c35706 (6060)

Device 2: 5090. Power limit to 400w

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 12706.26 ± 13.30
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 236.73 ± 1.29
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 13823.36 ± 20.99
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 245.02 ± 1.08

build: 9c35706 (6060)

You must be logged in to vote
2 replies
Comment options

Can you please launch them without a limit on full power?

Comment options

Sure, results with defaults power limits:

3090 at 390W
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 5405.83 ± 5.80
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 151.04 ± 0.24
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 5932.44 ± 10.87
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 155.36 ± 0.09

build: 9c35706 (6060)

5090 at 600W
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 14751.98 ± 136.24
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 239.62 ± 0.37
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 16041.54 ± 85.27
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 248.57 ± 0.05

build: 9c35706 (6060)

Comment options

Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 100 pp512 1084.41 ± 3.01
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 100 tg128 62.49 ± 0.06

Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 100 1 pp512 1138.14 ± 2.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 100 1 tg128 61.38 ± 0.03

build: 9c35706 (6060)

You must be logged in to vote
0 replies
Comment options

@olegshulyakov To help users quickly understand the approximate largest models that can run on each GPU, I suggest adding a VRAM column next to the GPU name on the main scoreboard.

Example:

Chip VRAM pp512 t/s tg128 t/s Commit
RTX 3090 Ti 24 GB 6567.49 $\pm$ 20.30 171.19 $\pm$ 3.98 9c35706
RTX 3090 24 GB 5174.69 $\pm$ 21.83 158.16 $\pm$ 0.21 c76b420
RTX 3080 10 GB 5013.86 $\pm$ 24.80 139.65 $\pm$ 0.99 9c35706
You must be logged in to vote
1 reply
Comment options

Made it a little bit better 🙂

Comment options

Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 1420.24 ± 1.95
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 60.04 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 1563.77 ± 0.51
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 61.13 ± 0.05

build: 5c0eb5e (6075)

You must be logged in to vote
1 reply
Comment options

@ggerganov Can you please add "performance" label?

Comment options

@olegshulyakov I see you grabbed some of my numbers from the Vulkan thread. However, I flooded that post with a bunch of data that probably came across as noise. While you quoted my correct numbers for Non-FA, the FA results you grabbed were actually when run on two GPUs instead of one. To make things easier, here are the numbers from a single card:

RTX 5060 Ti 16 GB

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
 Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
 Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
 Device 2: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
 Device 3: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 none 0 pp512 3737.25 ± 6.79
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 none 0 tg128 90.94 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 none 1 pp512 4195.53 ± 1.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 none 1 tg128 93.46 ± 0.01

build: 89d10295 (6002)

And here's another GPU for the collection:

RTX 4060 Ti 8 GB

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 3394.63 ± 7.44
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 63.86 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 3803.45 ± 70.80
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 64.03 ± 0.53

build: 89d10295 (6002)

You must be logged in to vote
2 replies
Comment options

Nice 64GB VRAM setup you got there!

And here's another GPU for the collection:

We all be here showing off our GPU collections 😅

Comment options

Thanks. It isn't the fastest setup around, especially when working with 70B+ models, but it is completely usable for inference. There are also some benefits I like about these particular cards (Gigabyte Windforce):

  • Two slots thick and only ~200 mm in length makes them easy to fit in a wide variety of cases
  • Physical x8 PCI-e connector lets them fit in either x8 or x16 slots without modification (5060 TIs only use 8 lanes anyhow)
  • Quiet (Silent when idle)
  • Low idle power consumption (~5 watts per card)
  • Relatively low power draw under full load (<180W each), so easy to power all four with an inexpensive PSU
Comment options

Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 100 0 pp512 2890.66 ± 2.42
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 100 0 tg128 107.51 ± 0.21
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 100 1 pp512 3107.61 ± 4.34
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 100 1 tg128 109.17 ± 0.07

build: 9c35706 (6060)

You must be logged in to vote
0 replies
Comment options

Yeah also saw numbers for my 4090 taken from the Vulkan thread. Re-ran CUDA results so you can get the latest FA and non-FA results from same build:

FA:

❯ CUDA_VISIBLE_DEVICES=0 build/bin/llama-bench -m /models/llm/gguf/llama-2-7b.Q4_0.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 14770.63 ± 102.93
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 188.96 ± 0.05

Non-FA:

❯ CUDA_VISIBLE_DEVICES=0 build/bin/llama-bench -m /models/llm/gguf/llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 pp512 11992.70 ± 107.99
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 tg128 186.21 ± 0.13

build: 224145325 (6098)

nvidia-dkms 575.64.03-1

❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

You must be logged in to vote
0 replies
Comment options

NVIDIA P106-100
6GB VRAM
Win 11
Driver Version: 566.36 CUDA Version: 12.7

I ran two times, took the best on 2 different build

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA P106-100, compute capability 6.1, VMM: no
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 0 pp512 406.94 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 0 tg128 30.40 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 1 pp512 438.49 ± 0.38
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 1 tg128 30.64 ± 0.06

build: 5fd160b (6106)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA P106-100, compute capability 6.1, VMM: no
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 0 pp512 425.73 ± 0.82
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 0 tg128 29.42 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 1 pp512 436.90 ± 0.88
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 1 tg128 29.94 ± 0.03

build: 860a9e4 (5688)

Sadly, nvidia was not supporting this device for the vulkan driver

You must be logged in to vote
2 replies
Comment options

I just bricked my gtx 1070 Ti :( so i would not be able to reproduce the result with newer build

Comment options

@pebaryan I've taken the last build one.

Comment options

Would like to participate with a slightly exotic one from my cute server cube.. :-) (RTX 2000 Ada, 16GB, 75W)

I did two runs:

  1. pull/compilation of llama.cpp from yesterday:

gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 1956.22 ± 7.74
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 50.62 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 2250.14 ± 5.91
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 50.71 ± 0.01

build: 756cfea (6105)

  1. fresh pull/compilation of llama.cpp ~5min ago:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 1952.82 ± 7.35
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 50.59 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 2237.16 ± 6.18
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 50.67 ± 0.01

build: 1d72c84 (6109)

Seems to make no big difference... ^^

You must be logged in to vote
0 replies
Comment options

I finally got my hands on similar card as before (NP106) but with display output

NVIDIA GTX 1060
6GB GDDR5 192-bit
Driver 566.36

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 0 pp512 416.85 ± 1.75
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 0 tg128 27.79 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 1 pp512 446.19 ± 0.81
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 1 tg128 28.18 ± 0.01

build: 5fd160b (6106)

You must be logged in to vote
1 reply
Comment options

just realized i didn't use the latest build, not that difference though

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 0 pp512 413.59 ± 2.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 0 tg128 27.74 ± 0.06
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 1 pp512 443.66 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 1 tg128 28.08 ± 0.04

build: 79c1160 (6123)

Comment options

Quadro RTX 6000 (24GB / 384 bit)

Driver Version: 570.86.10
CUDA Version: 12.8

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:

Device 0: Quadro RTX 6000, compute capability 7.5, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 2751.18 ± 19.43
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 102.77 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 3053.96 ± 1.37
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 104.38 ± 0.04

build: b8e09f0 (6475)

You must be logged in to vote
2 replies
Comment options

Quadro RTX 8000 (24GB / 384 bit)

Pretty much the same.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro RTX 8000, compute capability 7.5, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 2709.95 ± 3.35
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 102.68 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 3052.35 ± 5.64
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 103.63 ± 0.02

build: b8e09f0 (6475)

Comment options

No surprise, chip seems to be the same.

Comment options

NVIDIA RTX 3500 Ada Generation Laptop GPU (12 GB)
Driver Version: 576.57
CUDA Version: 12.9
Architecture: Ada Lovelace

GPU capped at 40W

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA RTX 3500 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes
 
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 1406.43 ± 52.64 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 30.23 ± 0.23 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 1610.14 ± 32.13 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 28.75 ± 0.21 |
build: 10622056 (6477)

EDIT: Added a benchmark with Power mode set to "Best performance"
GPU capped at 105W

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA RTX 3500 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 3353.26 ± 20.38 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 81.17 ± 3.62 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 3757.32 ± 37.79 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 80.18 ± 1.34 |
build: 10622056 (6477)
You must be logged in to vote
0 replies
Comment options

Tesla V100 (32GB / HBM2 / 4096 bit)

Driver Version: 580.65.06
CUDA Version: 12.9 (nvidia-smi reports 13.0 but I had to use an older nvcc version because CUDA 13 dropped support for anything older than Turing)

Tested with a few different models. Quite respectable for such an old chip.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:

Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 3042.64 ± 40.71
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 129.08 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 2973.78 ± 3.62
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 134.76 ± 0.02
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 0 pp512 1857.73 ± 4.03
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 0 tg128 67.34 ± 0.03
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 1 pp512 1771.04 ± 1.70
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 1 tg128 68.94 ± 0.07
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B CUDA 99 0 pp512 1536.27 ± 2.28
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B CUDA 99 0 tg128 64.87 ± 0.03
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B CUDA 99 1 pp512 1474.29 ± 2.45
qwen3 14B Q4_K - Medium 8.38 GiB 14.77 B CUDA 99 1 tg128 67.03 ± 0.02
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B CUDA 99 0 pp512 642.41 ± 0.47
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B CUDA 99 0 tg128 30.74 ± 0.02
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B CUDA 99 1 pp512 613.13 ± 1.09
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B CUDA 99 1 tg128 31.80 ± 0.01

build: 51f5a45 (6533)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes

model size params backend ngl fa test t/s
qwen2 70B Q4_0 38.53 GiB 72.96 B CUDA 99 0 pp512 335.74 ± 0.45
qwen2 70B Q4_0 38.53 GiB 72.96 B CUDA 99 0 tg128 17.22 ± 0.00
qwen2 70B Q4_0 38.53 GiB 72.96 B CUDA 99 1 pp512 324.02 ± 0.38
qwen2 70B Q4_0 38.53 GiB 72.96 B CUDA 99 1 tg128 17.72 ± 0.00

build: 51f5a45 (6533)

You must be logged in to vote
2 replies
Comment options

@Hedede Have you tried to run benchmark on Vulkan?

Comment options

I tried but I couldn't get Vulkan working on the V100.

Comment options

Titan Xp (12GB / GDDR5X / 384 bit)

Driver Version: 570.172.08
CUDA Version: 12.8

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA TITAN Xp, compute capability 6.1, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 1154.96 ± 1.46
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 76.08 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 1218.12 ± 1.82
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 73.84 ± 0.04

build: c4510dc (6532)

You must be logged in to vote
0 replies
Comment options

RTX 6000 Ada Generation (48 GB / GDDR6/ 384 bit)

Driver Version: 575.64.03
CUDA Version: 12.9

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 9229.23 ± 101.78
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 176.07 ± 0.26
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 10576.85 ± 530.21
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 179.47 ± 0.32

build: b8e09f0 (6475)

You must be logged in to vote
0 replies
Comment options

These llama models are not really that useful. What about the gpt-oss models? Has anyone been able to get those models running on H100s using llama.cpp? See:

You must be logged in to vote
0 replies
Comment options

5090 has 15% more TG performance in newer builds.

Driver Version: 575.64.05
CUDA Version: 12.9

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 14142.08 ± 52.87
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 277.21 ± 0.96
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 16195.14 ± 141.11
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 286.51 ± 0.13

build: 54dbc37 (6594)

You must be logged in to vote
4 replies
Comment options

Any chance that's with overclocking? I don't observe this 15% perf difference on my RTX 6000 Blackwell which should be the same chip as the 5090, just with more cores active -- can't even match your numbers.

Comment options

@Tom94 What does nvidia-smi -q -d CLOCK report for RTX 6000? And which driver version are you running?

 Max Clocks
 Graphics : 3090 MHz
 SM : 3090 MHz
 Memory : 14001 MHz
 Video : 3090 MHz
Comment options

I've got the same max clocks as you, but those aren't reached in practice. Maximum boost is 2820 MHz. Even if I try to pin the clocks using nvidia-smi -lgc 3090,3090 && nvidia-smi -lmc 14001,14001, it's stuck at 2820. Raising the application clocks via nvidia-smi -ac 14001,3090 doesn't help either.

That said, what I read on the internet about the RTX 5090 is that it has similar restrictions and that it's supposedly only possible to reach as high as 3000 MHz by overclocking. But maybe that's only true for some models and you've got one that has more headroom.

Driver Version : 580.95.05
CUDA Version : 13.0
Attached GPUs : 1
GPU 00000000:01:00.0
 Clocks
 Graphics : 285 MHz
 SM : 285 MHz
 Memory : 405 MHz
 Video : 667 MHz
 Applications Clocks
 Graphics : 2617 MHz
 Memory : 14001 MHz
 Default Applications Clocks
 Graphics : 2617 MHz
 Memory : 14001 MHz
 Deferred Clocks
 Memory : N/A
 Max Clocks
 Graphics : 3090 MHz
 SM : 3090 MHz
 Memory : 14001 MHz
 Video : 3090 MHz
Comment options

Yes, I also get the maximum boost of 2820 MHz. And the memory clocks of 13801 MHz. According to nvidia-smi dmon, it consistently boosts to ~2800 MHz during shorter runs like -p 512, and during longer runs, e.g. -p 16384, it drops to 2200-2400 MHz.

Comment options

Hardware: Hetzner GEX44
OS: Ubuntu SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
GPU: NVIDIA RTX 4000 SFF Ada Generation
Driver: 580.65.06 CUDA 13.0
Runtime: Docker

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 2779.77 ± 9.91
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 61.83 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 3171.86 ± 4.34
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 61.37 ± 0.01

build: a74a0d6 (6638)

You must be logged in to vote
0 replies
Comment options

RTX 2070 SUPER (8 GB / GDDR6 / 256-bit)

Driver Version: 580.65.06
CUDA Version: 13.0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 2088.34 ± 1.94
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 88.06 ± 0.28
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 2293.29 ± 5.91
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 87.71 ± 0.29

build: bc07349 (6756)

You must be logged in to vote
0 replies
Comment options

DGX Spark (128 GB / LPDDR5x / Unified)

Driver Version: 580.95.05
CUDA Version: 13.0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 3062.31 ± 11.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 57.21 ± 0.06
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 3661.37 ± 38.66
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 56.74 ± 0.03

build: 5acd455 (6767)

You must be logged in to vote
2 replies
Comment options

@ggerganov Hi mate, just curious about the result.
Is the reduced performance—approximately half that of the 5070—caused by the bandwidth limitation of LPDDR5x? I thought it had similar computing power to the 5070.

Comment options

The text-generation (tg) performance is mostly dependent on the memory bandwidth. So for a 5070 with a bandwidth of 672 GB/s we can expect ~2.4x higher tg compared to DGX Spark with 273 GB/s.

Comment options

RTX 3070 Laptop GPU (8 GB / GDDR6 / 256 bit)

Driver Version: 580.76.05
CUDA Version: 13.0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 1729.14 ± 38.71
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 59.72 ± 0.78
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 1904.62 ± 16.19
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 57.36 ± 0.98

build: ceff6bb (6783)

Edit: re-ran the benchmark with the laptop sitting on a table instead of my lap... slightly better results.

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 1735.28 ± 36.61
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 62.17 ± 0.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 1918.79 ± 24.64
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 59.52 ± 0.34
You must be logged in to vote
0 replies
Comment options

Titan V (12 GB / HBM2 / 3072 bit)

Driver Version: 550.127.05
CUDA Version: 12.4

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA TITAN V, compute capability 7.0, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 2617.46 ± 2.10
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 108.79 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 2481.25 ± 1.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 112.17 ± 0.01

build: e56abd2 (6794)

You must be logged in to vote
0 replies
Comment options

NVIDIA GeForce RTX 4080 SUPER

OS: NixOS / Linux 6.16.11-xanmod1
Driver Version: 580.95.05
CUDA Version: 13.0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes
model size params backend ngl fa dev test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,Vulkan 99 0 CUDA0 pp512 8125.15 ± 41.05
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,Vulkan 99 0 CUDA0 tg128 148.33 ± 0.20
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,Vulkan 99 1 CUDA0 pp512 9439.01 ± 56.75
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,Vulkan 99 1 CUDA0 tg128 147.48 ± 1.41

build: 81086cd (6729)

You must be logged in to vote
0 replies
Comment options

L40 (48 GB / GDDR6 / 384 bit)

Driver Version: 570.153.02
CUDA Version: 12.8

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA L40, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 8870.49 ± 378.76
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 152.01 ± 0.28
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 10097.64 ± 671.22
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 153.76 ± 0.12

build: ee09828 (6795)

You must be logged in to vote
0 replies
Comment options

Zotac RTX 5090 Arctic Storm (450w)

./build/bin/llama-bench -t 12 -m ~/Downloads/llama-2-7b.Q4_0.gguf -fa 0,1 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 14073.41 ± 115.16
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 290.02 ± 1.10
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 14970.15 ± 381.06
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 300.40 ± 0.28

build: 8cf6b42 (6824)

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /