-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Performance of llama.cpp on Nvidia CUDA #15013
-
|
This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on AMD ROCm(HIP) and Performance of llama.cpp with Vulkan, but for CUDA! I think it's good to consolidate and discuss our results here. We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here. InstructionsEither run the commands below or download one of our CUDA releases. If you have multiple GPUs please run the test on a single GPU using Share your llama-bench results along with the git hash and CUDA info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard. If multiple entries are posted for the same device I'll prioritize newer commits with substantial CUDA updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed! CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)
CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)
More detailed testThe main idea of this test is to show a decrease in performance with increasing size. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 4
Replies: 60 comments 40 replies
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
|
Here's the results for my devices. Not sure how to get a "cuda info string" though. CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)
CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
While technically not directly related, there may also be value in comparing AMD ROCM build here too, as ROCM acts a replacement (sometimes a directly compatible layer) for most CUDA calls.
I admit risk of confusion for Nvidia users in the thread if this path is taken.
Beta Was this translation helpful? Give feedback.
All reactions
-
As I know you cannot run ROCm on Nvidia GPU. If you would like to see compared results check Vulkan thread. You can find there results for Vulkan/CUDA and Vulkan/ROCm.
UPD: Created ROCm discussion.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 9c35706 (6060) Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
build: 9c35706 (647) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
@olegshulyakov One more benchmark for RTX 4080: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
build: 20638e4 (2) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(
Beta Was this translation helpful? Give feedback.
All reactions
-
@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(
Hmm indeed, I didn't give much thought to the score at first. It should be stock but not completely sure as that is one of our work machines. I didn't have much time to investigate today, will check again tomorrow!
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Device 0: 3090. Power limit to 250w
build: 9c35706 (6060) Device 2: 5090. Power limit to 400w
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Can you please launch them without a limit on full power?
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Sure, results with defaults power limits: 3090 at 390W
build: 9c35706 (6060) 5090 at 600W
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
|
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
@olegshulyakov To help users quickly understand the approximate largest models that can run on each GPU, I suggest adding a VRAM column next to the GPU name on the main scoreboard. Example:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Made it a little bit better 🙂
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
build: 5c0eb5e (6075) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@ggerganov Can you please add "performance" label?
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
@olegshulyakov I see you grabbed some of my numbers from the Vulkan thread. However, I flooded that post with a bunch of data that probably came across as noise. While you quoted my correct numbers for Non-FA, the FA results you grabbed were actually when run on two GPUs instead of one. To make things easier, here are the numbers from a single card: RTX 5060 Ti 16 GB
And here's another GPU for the collection: RTX 4060 Ti 8 GB
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 3
-
Nice 64GB VRAM setup you got there!
And here's another GPU for the collection:
We all be here showing off our GPU collections 😅
Beta Was this translation helpful? Give feedback.
All reactions
-
😄 1
-
Thanks. It isn't the fastest setup around, especially when working with 70B+ models, but it is completely usable for inference. There are also some benefits I like about these particular cards (Gigabyte Windforce):
- Two slots thick and only ~200 mm in length makes them easy to fit in a wide variety of cases
- Physical x8 PCI-e connector lets them fit in either x8 or x16 slots without modification (5060 TIs only use 8 lanes anyhow)
- Quiet (Silent when idle)
- Low idle power consumption (~5 watts per card)
- Relatively low power draw under full load (<180W each), so easy to power all four with an inexpensive PSU
Beta Was this translation helpful? Give feedback.
All reactions
-
👀 1
-
|
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
Yeah also saw numbers for my 4090 taken from the Vulkan thread. Re-ran CUDA results so you can get the latest FA and non-FA results from same build: FA:
Non-FA:
nvidia-dkms 575.64.03-1 ❯ nvcc --version |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
NVIDIA P106-100 I ran two times, took the best on 2 different build
build: 5fd160b (6106)
build: 860a9e4 (5688) Sadly, nvidia was not supporting this device for the vulkan driver |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
I just bricked my gtx 1070 Ti :( so i would not be able to reproduce the result with newer build
Beta Was this translation helpful? Give feedback.
All reactions
-
@pebaryan I've taken the last build one.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Would like to participate with a slightly exotic one from my cute server cube.. :-) (RTX 2000 Ada, 16GB, 75W) I did two runs:
gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 756cfea (6105)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 1d72c84 (6109) Seems to make no big difference... ^^ |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
I finally got my hands on similar card as before (NP106) but with display output NVIDIA GTX 1060
build: 5fd160b (6106) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
just realized i didn't use the latest build, not that difference though
build: 79c1160 (6123) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Quadro RTX 6000 (24GB / 384 bit) Driver Version: 570.86.10 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Device 0: Quadro RTX 6000, compute capability 7.5, VMM: yes
build: b8e09f0 (6475) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
Quadro RTX 8000 (24GB / 384 bit) Pretty much the same. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: b8e09f0 (6475) |
Beta Was this translation helpful? Give feedback.
All reactions
-
No surprise, chip seems to be the same.
Beta Was this translation helpful? Give feedback.
All reactions
-
NVIDIA RTX 3500 Ada Generation Laptop GPU (12 GB)
Driver Version: 576.57
CUDA Version: 12.9
Architecture: Ada Lovelace
GPU capped at 40W
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 3500 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 1406.43 ± 52.64 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 30.23 ± 0.23 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 1610.14 ± 32.13 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 28.75 ± 0.21 |
build: 10622056 (6477)
EDIT: Added a benchmark with Power mode set to "Best performance"
GPU capped at 105W
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 3500 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | pp512 | 3353.26 ± 20.38 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 0 | tg128 | 81.17 ± 3.62 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 3757.32 ± 37.79 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 80.18 ± 1.34 |
build: 10622056 (6477)
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Tesla V100 (32GB / HBM2 / 4096 bit) Driver Version: 580.65.06 Tested with a few different models. Quite respectable for such an old chip. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
build: 51f5a45 (6533) ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 51f5a45 (6533) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@Hedede Have you tried to run benchmark on Vulkan?
Beta Was this translation helpful? Give feedback.
All reactions
-
I tried but I couldn't get Vulkan working on the V100.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Titan Xp (12GB / GDDR5X / 384 bit) Driver Version: 570.172.08 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: c4510dc (6532) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
RTX 6000 Ada Generation (48 GB / GDDR6/ 384 bit) Driver Version: 575.64.03 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: b8e09f0 (6475) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
These llama models are not really that useful. What about the gpt-oss models? Has anyone been able to get those models running on H100s using llama.cpp? See:
Beta Was this translation helpful? Give feedback.
All reactions
-
👎 1
-
|
5090 has 15% more TG performance in newer builds. Driver Version: 575.64.05 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 54dbc37 (6594) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Any chance that's with overclocking? I don't observe this 15% perf difference on my RTX 6000 Blackwell which should be the same chip as the 5090, just with more cores active -- can't even match your numbers.
Beta Was this translation helpful? Give feedback.
All reactions
-
@Tom94 What does nvidia-smi -q -d CLOCK report for RTX 6000? And which driver version are you running?
Max Clocks
Graphics : 3090 MHz
SM : 3090 MHz
Memory : 14001 MHz
Video : 3090 MHz
Beta Was this translation helpful? Give feedback.
All reactions
-
I've got the same max clocks as you, but those aren't reached in practice. Maximum boost is 2820 MHz. Even if I try to pin the clocks using nvidia-smi -lgc 3090,3090 && nvidia-smi -lmc 14001,14001, it's stuck at 2820. Raising the application clocks via nvidia-smi -ac 14001,3090 doesn't help either.
That said, what I read on the internet about the RTX 5090 is that it has similar restrictions and that it's supposedly only possible to reach as high as 3000 MHz by overclocking. But maybe that's only true for some models and you've got one that has more headroom.
Driver Version : 580.95.05
CUDA Version : 13.0
Attached GPUs : 1
GPU 00000000:01:00.0
Clocks
Graphics : 285 MHz
SM : 285 MHz
Memory : 405 MHz
Video : 667 MHz
Applications Clocks
Graphics : 2617 MHz
Memory : 14001 MHz
Default Applications Clocks
Graphics : 2617 MHz
Memory : 14001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 3090 MHz
SM : 3090 MHz
Memory : 14001 MHz
Video : 3090 MHz
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, I also get the maximum boost of 2820 MHz. And the memory clocks of 13801 MHz. According to nvidia-smi dmon, it consistently boosts to ~2800 MHz during shorter runs like -p 512, and during longer runs, e.g. -p 16384, it drops to 2200-2400 MHz.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
Hardware: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: a74a0d6 (6638) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
RTX 2070 SUPER (8 GB / GDDR6 / 256-bit) Driver Version: 580.65.06 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: bc07349 (6756) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
DGX Spark (128 GB / LPDDR5x / Unified) Driver Version: 580.95.05 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 5acd455 (6767) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 4
-
@ggerganov Hi mate, just curious about the result.
Is the reduced performance—approximately half that of the 5070—caused by the bandwidth limitation of LPDDR5x? I thought it had similar computing power to the 5070.
Beta Was this translation helpful? Give feedback.
All reactions
-
The text-generation (tg) performance is mostly dependent on the memory bandwidth. So for a 5070 with a bandwidth of 672 GB/s we can expect ~2.4x higher tg compared to DGX Spark with 273 GB/s.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2 -
❤️ 1
-
|
RTX 3070 Laptop GPU (8 GB / GDDR6 / 256 bit) Driver Version: 580.76.05 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: ceff6bb (6783) Edit: re-ran the benchmark with the laptop sitting on a table instead of my lap... slightly better results.
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Titan V (12 GB / HBM2 / 3072 bit) Driver Version: 550.127.05 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: e56abd2 (6794) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
NVIDIA GeForce RTX 4080 SUPEROS: NixOS / Linux 6.16.11-xanmod1
build: 81086cd (6729) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
L40 (48 GB / GDDR6 / 384 bit) Driver Version: 570.153.02 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: ee09828 (6795) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
Zotac RTX 5090 Arctic Storm (450w) ./build/bin/llama-bench -t 12 -m ~/Downloads/llama-2-7b.Q4_0.gguf -fa 0,1 -ngl 99
build: 8cf6b42 (6824) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2