-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Performance of llama.cpp on AMD ROCm (HIP) #15021
-
|
This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on Nvidia CUDA and Performance of llama.cpp with Vulkan, but for ROCm! I think it's good to consolidate and discuss our results here. We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here. InstructionsEither run the commands below or download one of our ROCm(HIP) releases. If you have multiple GPUs please run the test on a single GPU using Share your llama-bench results along with the git hash and ROCm info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard. If multiple entries are posted for the same device I'll prioritize newer commits with substantial ROCm updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed! ROCm Scoreboard for Llama 2 7B, Q4_0 (no FA)
ROCm Scoreboard for Llama 2 7B, Q4_0 (with FA)
More detailed testThe main idea of this test is to show a decrease in performance with increasing size. |
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 28 comments 41 replies
-
RX 7800 XT (Sapphire Pulse 280W)ggml_cuda_init: found 1 ROCm devices:
build: 00131d6 (6031) ggml_vulkan: Found 1 Vulkan devices:
build: baad948 (6056) Notes:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: e725a1a (6104) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Happy to replicate: ggml_cuda_init: found 1 ROCm devices:
build: 9c35706 (6060) On Linux |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
RX 7600 XTggml_cuda_init: found 1 ROCm devices:
build: 9c35706 (6060) Running on Linux 6.12.32, mainline amdgpu, ROCm 6.4.1. ggml_vulkan: Found 1 Vulkan devices:
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@olegshulyakov , the 7600 XT actually has a 128 bit memory bus.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
AMD MI60. Happy to contribute.
I will post FA=1 and vulkan results once I have time during the weekend. |
Beta Was this translation helpful? Give feedback.
All reactions
-
MI100Using ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
build: 9c35706 (6060) I'm running Ubuntu 24.04.2 and ROCm 6.4.1 |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
I expected it to be faster than RX 7800 XT because of HBM2... Have you tried to launch with a single device only?
Beta Was this translation helpful? Give feedback.
All reactions
-
Bandwidth utilization is still fairly low on the gcn/cdna parts (gcn/cdna = same thing for tg).
GCN/CDNA is quite difficult to get decent utilization on as they are very register starved and have very small caches.
mi100 also dosent really have 1.2TB/s bandwith, it is limited to a sustained 1024GB/s by its fabric bandwidth
Beta Was this translation helpful? Give feedback.
All reactions
-
AMD Instinct MI300Xroot@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: 2bf3fbf (6069) Ref: #14640 |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 3
-
I'm just referring to the rocWMMA flag from the build instructions: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip
To enhance flash attention performance on RDNA3+ or CDNA architectures, you can utilize the rocWMMA library by enabling the -DGGML_HIP_ROCWMMA_FATTN=ON option. This requires rocWMMA headers to be installed on the build system.
It should work for CDNA too but we have only tested with our RDNA3 cards (7900 XTX) and saw huge performance jumps in PP with FA on: #10879 (reply in thread)
Please try it out because 1/3rd the performance in PP with FA on is just... strange at best
Beta Was this translation helpful? Give feedback.
All reactions
-
|
With root@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: ee3a9fc (6090) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Performance is in the middle between RTX 4090 and RTX 5090.
Beta Was this translation helpful? Give feedback.
All reactions
-
So rocWMMA does work for CDNA in FA :)
Beta Was this translation helpful? Give feedback.
All reactions
-
not very well
Beta Was this translation helpful? Give feedback.
All reactions
-
Pro V620Why does FA slow down the V620 so much? Been a question I've been trying to answer for a while now.
build: 03d4698 (6074) Linux, ROCm 6.4.1 ( will try upgrading soon) |
Beta Was this translation helpful? Give feedback.
All reactions
-
@samteezy Can you please run per each device PRO V620/Pro WX 3200 and ROCm only backend?
Beta Was this translation helpful? Give feedback.
All reactions
-
|
@olegshulyakov The numbers come out the same. Forcing root@llama:~# /root/llama-builds/llama.cpp/bin/llama-bench -m /mnt/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
build: 5c0eb5e (6075) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Powercolor Hellhound RX 7900 XTX (400W power limit)Opensuse tumbleweed system with rocm packages from
build: 5c0eb5e (6075) Sapphire Nitro 7900 XTX (400W power limit)In a different PC unfortunately because these GPUs are too chonky to fit in a regular case
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
So what are recommend settings/what to do to get best performance on 7900 XTX ? I have SAPPHIRE NITRO+ AMD Radeon RX 7900 XTX Vapor-X 24GB and without changing anything I get WAY worse results.
build: b9be58d (1005) And this is what LACT showed while running it image |
Beta Was this translation helpful? Give feedback.
All reactions
-
At these high speeds with fast gpus the cpu gets important for results, your results and his would be within the expected variance for differing cpu performance.
Not that it really matters mutch as the cpu gets almost irrelevant when a model of a size to fill the device is used.
Benchmarking llama 7B Q4_0 is not really that great, as it dosent reflect actual usage much, this hurts the most on cdna devices which scale better than you would expect performance wise when increasing the number parameters.
Beta Was this translation helpful? Give feedback.
All reactions
-
Go for ROCm 7.0, it has officially been released now. And compile llama.cpp with ROCWMMA enabled, see #15021 (comment) . You should get much better results.
Beta Was this translation helpful? Give feedback.
All reactions
-
ROCm 7 is released? that's great news! I'll try it out as soon at is lands in my distro's package manager.
That said I don't think it'll give any performance improvements for the 7900 XTX. The reason 9070 XT gets a boost with rocm7 is because pre rocm7 WMMA is not implemented for RDNA4. But we'll see.
BTW FYI in my benchmarks, the powercolor 7900XTX was paired with a ryzen 2700 with 64GB RAM, and the sapphire 7900XTX was paired with a 5700X3d also 64GB RAM. I did not see an appreciable performance difference between the two setups in LLM inference.
Beta Was this translation helpful? Give feedback.
All reactions
-
I see. To me that 3573 vs 3053 seems big difference. I'm running this on AMD Threadripper 1920X (12 core) with PCIe 3.0 x16. What benchmark I could use to compare more accurate real world inference performance?
For ROCm 7 it's not yet in Arch repos so I'll have to wait a bit. Also it wouldn't be accurate comparison with this older result.
Now when I booted with amdgpu.ppfeaturemask=0xffffffff kernel parameter I can increase TDP up to 402W but I don't see any performance impact at all - 305W limit vs 402W gives same benchmark so looks like it doesn't matter.
Beta Was this translation helpful? Give feedback.
All reactions
-
Powercolor Red Devil 7900XTXAdrenalin 25.8.1 just came out, so time to test again
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
build: 2572689 (6099) Still lower than the historical highs on May 26th (3599 and 3743), and a loss and a win against July 22nd (3529 and 3598). |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
|
RX 7900 XTX (ASUS TUF)
build: 6c7e9a5 (6118) |
Beta Was this translation helpful? Give feedback.
All reactions
-
RX 6800 (16GB 203W)ROCm 6.3.4 on Ubuntu 24.04 in a Docker container
build: 79c1160 (6123) Bonus benchmarksI ran these to compare ROCm versions on various models. Obviously the results are specific to my RX 6800 and shouldn't be used to make any judgments about ROCm performance in general, especially on RDNA3 and later gpus. I use 6.3.4 because I don't care about LLama 3 8B. Note how fast the new MoE models are - gpt-oss-20B even at Q6_K_XL is faster than this 7B Q4_0 model. (Do make sure that you have a fixed version because the original gpt-oss releases had some issues - I used https://huggingface.co/unsloth/gpt-oss-20b-GGUF). ROCm 6.3.4
ROCm 6.4.3
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
RX 7900 XTX (ASUS TUF a bit overclocked for 100 mhz for core and VRAM) ./build/bin/llama-bench -m /home/vk/Downloads/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: 648ebcd (6146) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
RX 6900 XT AMD Reference Card (Stock clocks) Debian Testing llama.cpp version: gguf-v0.17.1-386-gfd1234cb
|
Beta Was this translation helpful? Give feedback.
All reactions
-
@tdjb Results are pretty low, can you re-test using llama.cpp standalone without Docker?
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Just a quick test, installed the llama.cpp build from Debian sid (was surprised to even find a build to be honest), which appears to be b5882 and the results came in quite similar. I tried the benchmark on both of my devices, as one is on a slower PCIe 4x slot, the results below are from the faster run. Why do you think the 6900 XT should perform better?
Happy to run further tests. |
Beta Was this translation helpful? Give feedback.
All reactions
-
It should be 10% better on my understanding, according to specs: RX 6900 XT and RX 6800 XT
Beta Was this translation helpful? Give feedback.
All reactions
-
the rdna2 results are mostly surprisingly high, given the hardware capabilities, not low.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
GigaByte R9700
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 6
-
So 7900 XTX has better performance, that's sad. Also weird that Vulkan performs way worse on pp than ROCm but better on tg.
Beta Was this translation helpful? Give feedback.
All reactions
-
We'll see when someone a bit more experienced gives it a shot. My benchmarks are about as vanilla as it gets. Threw it in an unraid server (12700k and 128GB DDR4-2133), made docker images and ran benchmarks. Many of the 7900 XTX results are baremetal, have factory overclock or are manually overclocked, installed additional drivers, and/or have raised power limits. I bet someone will beat my benchmarks shortly.
Beta Was this translation helpful? Give feedback.
All reactions
-
So 7900 XTX has better performance, that's sad. Also weird that Vulkan performs way worse on pp than ROCm but better on tg.
there is zero reason to expect 9070(xt) to perform better than the xtx
Beta Was this translation helpful? Give feedback.
All reactions
-
Radeon RX 9070 (non-XT)
build: 65349f2 (6183) I tried to enable the use of rocwmma with Still surprising that these numbers are better than the 9070 XT. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
EDIT: see my comment below I bought 2x MI50 32GB VRAM from Alibaba and for some reason I'm getting really poor performance on them... No idea why, even Vega 64 beats it and it's way slower than someone's elses MI50 16GB.
build: 21c17b5 (3)
build: 21c17b5 (3) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Found issue, my Vega RX 64 was crashing/locking up and that's why I had disabled power management
build: 21c17b5 (3)
build: 21c17b5 (3) But weird thing is that now my Vega RX 64 with "not disabled power management" performs worse... |
Beta Was this translation helpful? Give feedback.
All reactions
-
Can you please run long one?
llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048
It is interesting if large memory can help with degradation.
Beta Was this translation helpful? Give feedback.
All reactions
-
I have dissembled my system so I won't be able to test it for a while. But someone else from MI50 Discord run it
# HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ~/.cache/huggingface/llama/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp512 | 1048.27 ± 3.41 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp1024 | 927.94 ± 0.70 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp2048 | 681.32 ± 0.43 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp4096 | 470.72 ± 0.42 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp8192 | 365.29 ± 0.20 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp16384 | 236.79 ± 0.10 |
/workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:85: ROCm error
/workspace/llama.cpp/build/bin/libggml-base.so(+0x16ccb)[0x7d951ac1dccb]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21f)[0x7d951ac1e12f]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x152)[0x7d951ac1e302]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f90612)[0x7d951a1a9612]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f9f8c2)[0x7d951a1b88c2]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f98886)[0x7d951a1b1886]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f97fc2)[0x7d951a1b0fc2]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f9602f)[0x7d951a1af02f]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x3fd)[0x7d951ac376cd]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x99)[0x7d951ad4b899]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x105)[0x7d951ad4bc45]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x2d4)[0x7d951ad52314]
/workspace/llama.cpp/build/bin/libllama.so(llama_decode+0x10)[0x7d951ad53260]
./build/bin/llama-bench(+0x1a92a)[0x60e3f404692a]
./build/bin/llama-bench(+0x14cb4)[0x60e3f4040cb4]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7d951a6cad90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7d951a6cae40]
./build/bin/llama-bench(+0x19945)[0x60e3f4045945]
Aborted (core dumped)
Beta Was this translation helpful? Give feedback.
All reactions
-
it is interesting how Vulkan out performs ROCm by a large margin when flash attention is on. FA is a big gain, so ig i'll be running vulkan over ROCm lol.
I was looking to buy 2 (maybe 3) mi50's bc hardware specs seem really good on paper for such a low price, but from all of the benchmarks I've seen it seems like the software just isn't there yet to really squeeze out all the performance it has to offer. seems like 20% is still left on the table. Hopefully the support only gets better, but they are old cards, so i doubt it.
Beta Was this translation helpful? Give feedback.
All reactions
-
RX 9070 XT (Powercolor Red Devil)OS: Ubuntu 24.04 ROCm 6.4.3
build: 043fb27 (6264) Vulkan
build: c9a24fb (6262) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Vulkan with FA enabled has impressive pp and TG gains. What version of vulkan are you using? I see it is radv but is this the latest vulkan driver? I haven't seen such a speed up in MI50 cards with vulkan yet. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions
-
This is becasue wmma fattn is currently disabled on gfx12 until rocm 7.0 (or when force enabled)
Beta Was this translation helpful? Give feedback.
All reactions
-
$ vulkaninfo
WARNING: radv is not a conformant Vulkan implementation, testing use only.
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.275
...
I installed the AMD drivers at the same time I installed ROCm 6.4.3
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
With rocwmma_fattn, thermals at 50C
build: d82f6aa (6321) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
With rocwmma_fattn, thermals at 51.9C
build: 9c979f865 (6248) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
RX 9070 XT (XFX Mercury) OS: Ubuntu 24.04 ROCm 7.0 RC1
build: 2c8dac72 (6367) I should mention this is by using a llama.cpp compiled from source to enable ROCWMMA following the updated instructions in the manuel_instructions.md from this pull request. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
RX 6900 XT ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: a972fae (6428) ggml_vulkan: Found 1 Vulkan devices:
build: a972fae (6428) Vulkan's 25% faster token generation makes it the only viable option, ROCM isn't there yet. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
Tested MI300X with the latest build. Looks like there's a performance regression with FA compared to previous results?
build: ae355f6 (6432) I also tested with ROCm 7.0.0-rc1:
build: ae355f6 (6432) |
Beta Was this translation helpful? Give feedback.
All reactions
-
AMD Radeon RX 9060 XT (16 GB)Operating System: Microsoft Windows 11 24H2 (Build 26100.4061) ROCm (HIP)
build: a0e13dc (6470) Vulkan
build: a0e13dc (6470) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
|
Ryzen Al Max+ 395 (128GB memory) ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 1d0125b (6552) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
I took a build of rocm 7 and tried again (with LD_PRELOAD since Arch hasn't packages rocm7 yet). Haven't been able to get ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: f505bd8 (6560) |
Beta Was this translation helpful? Give feedback.
All reactions
-
🚀 1
-
|
version llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0
build: f505bd8 (6560) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
❤️ 1
-
|
did a second build with latest rocm version, 7.0.1 from sep 4th, hip version 7.0.51831-a3e329ad8 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: ca71fb9 (6692) |
Beta Was this translation helpful? Give feedback.
All reactions
-
AI Max+ 395Using ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: e60f241 (6755) I'm running Ubuntu 24.04.3 and ROCm 7.0.2 |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
AMD Radeon RX 7900 GRE (Sapphire Pulse 7900 GRE)System details
ROCm results
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 6fa3b55 (1027) Vulkan results
ggml_vulkan: Found 2 Vulkan devices:
build: 5bed329 (1022) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
RX9070 (XFX Quicksilver 9070 OC)System details lemonade llama.cpp-rocm with custom build patch, against ROCM 7.10.0a (therock-dist-windows-gfx120X-all-7.10.0a20251022)llama-bench.exe -m ../llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: dd62dcf (1) official hip build (6827), HIP 6.4llama-bench.exe -m ../llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: d0660f2 (6827) |
Beta Was this translation helpful? Give feedback.