Search code, repositories, users, issues, pull requests...

olegshulyakov Aug 8, 2025
Author

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-hip.dll
load_backend: loaded RPC backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-cpu-icelake.dll

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp512	2109.38 + 15.79
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp1024	1749.56 + 12.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp2048	1165.15 + 1.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp4096	997.83 + 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp8192	789.89 + 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp16384	196.02 + 0.96
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg128	99.55 + 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg256	98.16 + 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg512	90.29 + 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg1024	80.56 + 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg2048	62.58 + 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp512	2296.72 + 3.40
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp1024	2225.68 + 2.84
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp2048	2069.86 + 2.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp4096	1814.41 + 2.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp8192	1423.62 + 0.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp16384	992.13 + 0.81
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg128	96.80 + 1.30
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg256	95.98 + 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg512	95.92 + 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg1024	91.30 + 0.74
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg2048	85.44 + 0.29

build: e725a1a (6104)

AdamNiederer
Aug 1, 2025

Happy to replicate:

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	2967.12 ± 31.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	116.00 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3163.24 ± 4.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	112.75 ± 0.04

build: 9c35706 (6060)

On Linux

0 replies

wbruna
Aug 1, 2025

RX 7600 XT

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	1099.64 ± 2.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	48.58 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	1199.16 ± 1.07
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	47.65 ± 0.06

build: 9c35706 (6060)

Running on Linux 6.12.32, mainline amdgpu, ROCm 6.4.1.

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	606.24 ± 0.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	52.84 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	612.33 ± 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	53.70 ± 0.33

build: 9c35706 (6060)

1 reply

@wbruna

wbruna Aug 5, 2025

@olegshulyakov , the 7600 XT actually has a 128 bit memory bus.

Said-Akbar
Aug 2, 2025

AMD MI60.

Happy to contribute.
I am on Ubuntu 24.04 and ROCm 6.3.4. GPU is connected at 8x PCIE4.0 speed. AMD 5950x CPU with 96GB RAM at 3200Mhz. Flash attention is disabled (FA=0).

model	size	params	backend	ngl	sm	test	t/s	build
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	none	pp512	1289.11 ± 0.62	`504af20` (4476)
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	none	tg128	91.46 ± 0.13	`504af20` (4476)

I will post FA=1 and vulkan results once I have time during the weekend.

0 replies

firefox42
Aug 2, 2025

MI100

Using ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	2732.83 ± 1.98
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	110.48 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	2755.00 ± 3.68
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	104.71 ± 0.10

build: 9c35706 (6060)

I'm running Ubuntu 24.04.2 and ROCm 6.4.1

2 replies

olegshulyakov Aug 2, 2025
Author

I expected it to be faster than RX 7800 XT because of HBM2... Have you tried to launch with a single device only?

IMbackK Aug 18, 2025
Collaborator

Bandwidth utilization is still fairly low on the gcn/cdna parts (gcn/cdna = same thing for tg).
GCN/CDNA is quite difficult to get decent utilization on as they are very register starved and have very small caches.
mi100 also dosent really have 1.2TB/s bandwith, it is limited to a sustained 1024GB/s by its fabric bandwidth

yeahdongcn
Aug 2, 2025
Collaborator

AMD Instinct MI300X

root@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	11476.40 ± 72.79
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	218.87 ± 0.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	4037.07 ± 8.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	158.12 ± 0.21

build: 2bf3fbf (6069)

Ref: #14640

7 replies

@rohan-sircar

rohan-sircar Aug 5, 2025

I'm just referring to the rocWMMA flag from the build instructions: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip

To enhance flash attention performance on RDNA3+ or CDNA architectures, you can utilize the rocWMMA library by enabling the -DGGML_HIP_ROCWMMA_FATTN=ON option. This requires rocWMMA headers to be installed on the build system.

It should work for CDNA too but we have only tested with our RDNA3 cards (7900 XTX) and saw huge performance jumps in PP with FA on: #10879 (reply in thread)

Please try it out because 1/3rd the performance in PP with FA on is just... strange at best

@yeahdongcn

yeahdongcn Aug 5, 2025
Collaborator

With -DGGML_HIP_ROCWMMA_FATTN=ON:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	11021.13 ± 210.87
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	232.92 ± 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	11945.97 ± 54.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	218.53 ± 0.09

build: ee3a9fc (6090)

olegshulyakov Aug 5, 2025
Author

Performance is in the middle between RTX 4090 and RTX 5090.

@rohan-sircar

rohan-sircar Aug 5, 2025

So rocWMMA does work for CDNA in FA :)

IMbackK Aug 7, 2025
Collaborator

not very well

samteezy
Aug 2, 2025

Pro V620

Why does FA slow down the V620 so much? Been a question I've been trying to answer for a while now.

root@llama:/mnt/models# /root/llama-builds/llama.cpp/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
 Device 0: AMD Radeon PRO V620, gfx1030 (0x1030), VMM: no, Wave Size: 32
 Device 1: AMD Radeon (TM) Pro WX 3200 Series, gfx803 (0x803), VMM: no, Wave Size: 64
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) Pro WX 3200 Series (RADV POLARIS12) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon PRO V620 (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	threads	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	0	pp512	1801.16 ± 3.33
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	0	tg128	74.48 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	1	pp512	1258.12 ± 0.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	1	tg128	70.74 ± 0.02

build: 03d4698 (6074)

Linux, ROCm 6.4.1 ( will try upgrading soon)

2 replies

olegshulyakov Aug 2, 2025
Author

@samteezy Can you please run per each device PRO V620/Pro WX 3200 and ROCm only backend?

@samteezy

samteezy Aug 2, 2025

@olegshulyakov The numbers come out the same. Forcing -sm none mg 0 ensures only the V620 is running. I don't benchmark the WX 3200.

root@llama:~# /root/llama-builds/llama.cpp/bin/llama-bench -m /mnt/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon PRO V620, gfx1030 (0x1030), VMM: no, Wave Size: 32
Device 1: AMD Radeon (TM) Pro WX 3200 Series, gfx803 (0x803), VMM: no, Wave Size: 64

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	1803.65 ± 2.54
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	74.66 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	1256.86 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	70.83 ± 0.02

build: 5c0eb5e (6075)

rohan-sircar
Aug 5, 2025

Powercolor Hellhound RX 7900 XTX (400W power limit)

Opensuse tumbleweed system with rocm packages from AMD ROCm repository installed

Information for package rocm-hip:
---------------------------------
Repository : AMD ROCm (openSUSE_Factory)
Name : rocm-hip
Version : 6.4.1-6.5
Arch : x86_64
Vendor : obs://build.opensuse.org/science
Installed Size : 25.5 MiB
Installed : Yes
Status : up-to-date
Source package : rocclr-6.4.1-6.5.src
Upstream URL : https://github.com/ROCm/clr
Summary : ROCm HIP platform and device tool
Description : 
 HIP is a C++ Runtime API and Kernel Language that allows developers to create
 portable applications for AMD and NVIDIA GPUs from the same source code.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3243.15 ± 10.32
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	125.84 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3557.68 ± 13.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	122.71 ± 0.11

build: 5c0eb5e (6075)

Sapphire Nitro 7900 XTX (400W power limit)

In a different PC unfortunately because these GPUs are too chonky to fit in a regular case
So no TP for now but it serves my use case of running an LLM on one and STT/TTS on the other card to get a fully local voice-to-voice chatbot (Just tried with Amica and it works great! Very entertaining!)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3369.65 ± 10.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	122.06 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3573.30 ± 14.31
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	118.71 ± 0.14

build: 9c35706 (6060)

12 replies

davispuh Sep 17, 2025

So what are recommend settings/what to do to get best performance on 7900 XTX ? I have SAPPHIRE NITRO+ AMD Radeon RX 7900 XTX Vapor-X 24GB and without changing anything I get WAY worse results.
Using Arch Linux with everything updated to latest (ROCm 6.4.3) and freshly compiled llama.cpp

$ llama-bench -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	pp512	2817.81 ± 26.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	tg128	112.18 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	pp512	3053.77 ± 17.38
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	tg128	110.70 ± 0.08

build: b9be58d (1005)

And this is what LACT showed while running it

image

IMbackK Sep 17, 2025
Collaborator

At these high speeds with fast gpus the cpu gets important for results, your results and his would be within the expected variance for differing cpu performance.

Not that it really matters mutch as the cpu gets almost irrelevant when a model of a size to fill the device is used.

Benchmarking llama 7B Q4_0 is not really that great, as it dosent reflect actual usage much, this hurts the most on cdna devices which scale better than you would expect performance wise when increasing the number parameters.

@fassn

fassn Sep 18, 2025

Go for ROCm 7.0, it has officially been released now. And compile llama.cpp with ROCWMMA enabled, see #15021 (comment) . You should get much better results.

@rohan-sircar

rohan-sircar Sep 18, 2025

ROCm 7 is released? that's great news! I'll try it out as soon at is lands in my distro's package manager.

That said I don't think it'll give any performance improvements for the 7900 XTX. The reason 9070 XT gets a boost with rocm7 is because pre rocm7 WMMA is not implemented for RDNA4. But we'll see.

BTW FYI in my benchmarks, the powercolor 7900XTX was paired with a ryzen 2700 with 64GB RAM, and the sapphire 7900XTX was paired with a 5700X3d also 64GB RAM. I did not see an appreciable performance difference between the two setups in LLM inference.

davispuh Sep 23, 2025

I see. To me that 3573 vs 3053 seems big difference. I'm running this on AMD Threadripper 1920X (12 core) with PCIe 3.0 x16. What benchmark I could use to compare more accurate real world inference performance?
For ROCm 7 it's not yet in Arch repos so I'll have to wait a bit. Also it wouldn't be accurate comparison with this older result.

Now when I booted with amdgpu.ppfeaturemask=0xffffffff kernel parameter I can increase TDP up to 402W but I don't see any performance impact at all - 305W limit vs 402W gives same benchmark so looks like it doesn't matter.

Diablo-D3
Aug 6, 2025

Powercolor Red Devil 7900XTX

Adrenalin 25.8.1 just came out, so time to test again
Ryzen 9800x3D
Windows 11 24H2 26100.4652

llama-win-hip/llama-bench.exe -m ./models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -r 100 -fa 0,1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	0	pp512	3434.01 ± 38.33
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	0	tg128	153.91 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	1	pp512	3633.86 ± 10.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	1	tg128	145.23 ± 0.10

build: 2572689 (6099)

Still lower than the historical highs on May 26th (3599 and 3743), and a loss and a win against July 22nd (3529 and 3598).

0 replies

totaldev
Aug 8, 2025

RX 7900 XTX (ASUS TUF)
Ubuntu 24.04.2
Rocm 6.4.2

./build/bin/llama-bench -m /home/vk/Downloads/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3386.75 ± 5.33
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	128.25 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3674.25 ± 11.35
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	124.61 ± 0.06

build: 6c7e9a5 (6118)

0 replies

MrLavender
Aug 11, 2025

RX 6800 (16GB 203W)

ROCm 6.3.4 on Ubuntu 24.04 in a Docker container

llama-bench --prio 1 -m /llama-cpp/models/local/llama-2-7b-Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp512	1447.07 ± 1.36
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	tg128	83.92 ± 0.03

build: 79c1160 (6123)

Bonus benchmarks

I ran these to compare ROCm versions on various models. Obviously the results are specific to my RX 6800 and shouldn't be used to make any judgments about ROCm performance in general, especially on RDNA3 and later gpus. I use 6.3.4 because I don't care about LLama 3 8B.

Note how fast the new MoE models are - gpt-oss-20B even at Q6_K_XL is faster than this 7B Q4_0 model. (Do make sure that you have a fixed version because the original gpt-oss releases had some issues - I used https://huggingface.co/unsloth/gpt-oss-20b-GGUF).

ROCm 6.3.4

~4% performance regression in Llama 3 8B prompt processing. This is noted as a known issue

Lower than expected performance may be observed while running Llama 3 8B inference workloads with Llama.cpp

ROCm 6.4.3

The Llama 3 8B issue still exists
~8% performance regression in qwen2 14B Q6_K and qwen3 14B Q6_K prompt processing

Screenshot_20250811_160000

0 replies

totaldev
Aug 13, 2025

RX 7900 XTX (ASUS TUF a bit overclocked for 100 mhz for core and VRAM)
Ubuntu 24.04.2
Rocm 7.0-rc1

./build/bin/llama-bench -m /home/vk/Downloads/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3473.24 ± 12.30
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	132.17 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3698.73 ± 17.60
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	127.43 ± 0.04

build: 648ebcd (6146)

0 replies

tdjb
Aug 15, 2025

RX 6900 XT AMD Reference Card (Stock clocks)
Ryzen 7 5800X3D with 32GB 3600MHz C18 ram

Debian Testing
Using Docker image rocm/rocm-terminal with additions.

llama.cpp version: gguf-v0.17.1-386-gfd1234cb

./src/llama.cpp/build/bin/llama-bench -m models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32 Device 1: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	1824.47 ± 1.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	83.02 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	1250.68 ± 0.73
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	80.45 ± 0.02

4 replies

olegshulyakov Aug 15, 2025
Author

@tdjb Results are pretty low, can you re-test using llama.cpp standalone without Docker?

@tdjb

tdjb Aug 15, 2025

Just a quick test, installed the llama.cpp build from Debian sid (was surprised to even find a build to be honest), which appears to be b5882 and the results came in quite similar. I tried the benchmark on both of my devices, as one is on a slower PCIe 4x slot, the results below are from the faster run.

Why do you think the 6900 XT should perform better?
Seeing the 6800 XT results above being a little slower made mine seem reasonable.
While reading the post again, I saw those were also being run using Docker.

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	1835.54 ± 2.20
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	74.90 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	1314.84 ± 0.77
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	68.28 ± 0.07

Happy to run further tests.

olegshulyakov Aug 15, 2025
Author

It should be 10% better on my understanding, according to specs: RX 6900 XT and RX 6800 XT

IMbackK Aug 18, 2025
Collaborator

the rdna2 results are mostly surprisingly high, given the hardware capabilities, not low.

TheyreEatingTheGeese
Aug 16, 2025

GigaByte R9700
build: e2c1bff (6177) | llama.cpp vulkan and rocm docker containers

note both vulkan and rocm results below
vulkan benchmarks showed WARNING: radv is not a conformant Vulkan implementation, testing use only.

llama-cli --bench --model /models/Qwen3-32B-Q4_K_M.gguf -ngl 100 -fa 0 -p 512,1024,2048,4096,8192,16384,30720 -n 128,256,512,1024

Vulkan 32K prompt ran out of memory so changed it to 30K
ROCM, 16K+ prompt also had errors (though not out of memory)

model	size	params	backend	ngl	test	t/s
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp512	196.90 ± 0.43
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp1024	193.73 ± 0.22
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp2048	191.62 ± 0.36
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp4096	184.77 ± 0.14
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp8192	171.50 ± 0.08
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp16384	149.20 ± 0.11
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp30720	118.38 ± 1.08
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp512	203.35 ± 0.47
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	tg128	28.20 ± 0.03
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	tg256	28.14 ± 0.01
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	tg512	27.96 ± 0.01
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	tg1024	27.67 ± 0.01
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp512	498.66 ± 0.59
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp1024	473.24 ± 0.84
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp2048	435.33 ± 0.62
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp4096	380.48 ± 0.39
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp8192	304.56 ± 0.15
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp512	501.91 ± 0.66
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	tg128	24.03 ± 0.04
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	tg256	24.06 ± 0.02
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	tg512	23.67 ± 0.02
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	tg1024	22.88 ± 0.01

llama-cli --bench --model /models/llama-2-7b.Q4\_0.gguf -ngl 100 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024

ROCM, 32K prompt had errors

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	1943.56 ± 6.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp1024	1879.03 ± 6.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp2048	1758.15 ± 2.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp4096	1507.73 ± 2.83
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp8192	1078.38 ± 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp16384	832.26 ± 0.67
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp32768	466.09 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	124.13 ± 0.95
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg256	123.30 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg512	119.96 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg1024	114.71 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1863.64 ± 6.66
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp1024	1780.54 ± 7.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp2048	1640.52 ± 3.72
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp4096	1417.17 ± 4.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp8192	1119.76 ± 0.41
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp16384	786.26 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp32768	490.12 ± 0.47
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	124.65 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg256	124.72 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg512	122.66 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg1024	119.27 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp512	2746.39 ± 57.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp1024	2672.60 ± 7.19
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp2048	2475.62 ± 9.50
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp4096	2059.84 ± 0.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp8192	1333.60 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp16384	1014.06 ± 0.35
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp24576	769.31 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg128	92.29 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg256	92.34 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg512	90.28 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg1024	86.91 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	1300.26 ± 3.04
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp1024	1009.69 ± 1.54
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp2048	695.68 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp4096	428.36 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp8192	242.06 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp16384	129.46 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp24576	88.34 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	93.28 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg256	93.22 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg512	91.31 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg1024	88.87 ± 0.35

3 replies

davispuh Aug 16, 2025

So 7900 XTX has better performance, that's sad. Also weird that Vulkan performs way worse on pp than ROCm but better on tg.

@TheyreEatingTheGeese

TheyreEatingTheGeese Aug 16, 2025

We'll see when someone a bit more experienced gives it a shot. My benchmarks are about as vanilla as it gets. Threw it in an unraid server (12700k and 128GB DDR4-2133), made docker images and ran benchmarks. Many of the 7900 XTX results are baremetal, have factory overclock or are manually overclocked, installed additional drivers, and/or have raised power limits. I bet someone will beat my benchmarks shortly.

IMbackK Aug 18, 2025
Collaborator

So 7900 XTX has better performance, that's sad. Also weird that Vulkan performs way worse on pp than ROCm but better on tg.

there is zero reason to expect 9070(xt) to perform better than the xtx

prototypicall
Aug 17, 2025

Radeon RX 9070 (non-XT)

ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp512	2361.10 ± 0.88
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg128	99.39 ± 0.57
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	1147.66 ± 1.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	97.10 ± 0.33

build: 65349f2 (6183)

I tried to enable the use of rocwmma with DGGML_HIP_ROCWMMA_FATTN=ON but I don't think it worked. cmake complained that it couldn't find the header so provided the include path but didn't check if the compiler was able to use that.

Still surprising that these numbers are better than the 9070 XT.

0 replies

davispuh
Aug 17, 2025

EDIT: see my comment below

I bought 2x MI50 32GB VRAM from Alibaba and for some reason I'm getting really poor performance on them... No idea why, even Vega 64 beats it and it's way slower than someone's elses MI50 16GB.

$ llama-bench -sm none -mg 0 -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
 Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
 Device 1: AMD Radeon RX Vega, gfx900:xnack- (0x900), VMM: no, Wave Size: 64

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp512	193.40 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg128	18.81 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp512	108.82 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg128	16.80 ± 0.01

build: 21c17b5 (3)

$ llama-bench -sm none -mg 1 -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX Vega (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	ngl	main_gpu	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	99	1	none	0	pp512	155.42 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	99	1	none	0	tg128	16.43 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	99	1	none	1	pp512	133.19 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	99	1	none	1	tg128	16.86 ± 0.00

build: 21c17b5 (3)

4 replies

davispuh Aug 18, 2025

Found issue, my Vega RX 64 was crashing/locking up and that's why I had disabled power management amdgpu.dpm=0 amdgpu.runpm=0 but looks like MI50 this really doesn't like and it was running on very low wattage hence poor performance. Now after removing this I get same results as other MI50

$ llama-bench -sm none -mg 0 -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
 Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
 Device 1: AMD Radeon RX Vega, gfx900:xnack- (0x900), VMM: no, Wave Size: 64

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp512	1040.64 ± 1.20
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg128	87.40 ± 0.48
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp512	446.60 ± 1.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg128	76.44 ± 0.04

build: 21c17b5 (3)

$ llama-bench -sm none -mg 1 -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX Vega (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	ngl	main_gpu	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	99	1	none	0	pp512	829.87 ± 3.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	99	1	none	0	tg128	80.44 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	99	1	none	1	pp512	724.31 ± 0.68
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,RPC	99	1	none	1	tg128	82.21 ± 0.23

build: 21c17b5 (3)

But weird thing is that now my Vega RX 64 with "not disabled power management" performs worse...

olegshulyakov Aug 18, 2025
Author

Can you please run long one?

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

It is interesting if large memory can help with degradation.

davispuh Aug 20, 2025

I have dissembled my system so I won't be able to test it for a while. But someone else from MI50 Discord run it

# HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ~/.cache/huggingface/llama/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp512 | 1048.27 ± 3.41 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp1024 | 927.94 ± 0.70 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp2048 | 681.32 ± 0.43 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp4096 | 470.72 ± 0.42 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp8192 | 365.29 ± 0.20 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp16384 | 236.79 ± 0.10 |
/workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:85: ROCm error
/workspace/llama.cpp/build/bin/libggml-base.so(+0x16ccb)[0x7d951ac1dccb]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21f)[0x7d951ac1e12f]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x152)[0x7d951ac1e302]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f90612)[0x7d951a1a9612]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f9f8c2)[0x7d951a1b88c2]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f98886)[0x7d951a1b1886]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f97fc2)[0x7d951a1b0fc2]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f9602f)[0x7d951a1af02f]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x3fd)[0x7d951ac376cd]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x99)[0x7d951ad4b899]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x105)[0x7d951ad4bc45]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x2d4)[0x7d951ad52314]
/workspace/llama.cpp/build/bin/libllama.so(llama_decode+0x10)[0x7d951ad53260]
./build/bin/llama-bench(+0x1a92a)[0x60e3f404692a]
./build/bin/llama-bench(+0x14cb4)[0x60e3f4040cb4]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7d951a6cad90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7d951a6cae40]
./build/bin/llama-bench(+0x19945)[0x60e3f4045945]
Aborted (core dumped)

@Master-Pr0grammer

Master-Pr0grammer Aug 23, 2025

it is interesting how Vulkan out performs ROCm by a large margin when flash attention is on. FA is a big gain, so ig i'll be running vulkan over ROCm lol.

I was looking to buy 2 (maybe 3) mi50's bc hardware specs seem really good on paper for such a low price, but from all of the benchmarks I've seen it seems like the software just isn't there yet to really squeeze out all the performance it has to offer. seems like 20% is still left on the table. Hopefully the support only gets better, but they are old cards, so i doubt it.

xog64
Aug 24, 2025

RX 9070 XT (Powercolor Red Devil)

OS: Ubuntu 24.04
CPU: Ryzen 7 5700+
PCI-E: 16x 3.0
RAM: 32GB

ROCm 6.4.3
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	2908.66 ± 8.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	83.63 ± 0.36
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	1341.03 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	85.43 ± 0.03

build: 043fb27 (6264)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	pp512	2055.15 ± 11.24
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	tg128	128.80 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	pp512	1964.92 ± 2.13
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	tg128	130.12 ± 0.24

build: c9a24fb (6262)

3 replies

@Said-Akbar

Said-Akbar Aug 25, 2025

Vulkan with FA enabled has impressive pp and TG gains. What version of vulkan are you using? I see it is radv but is this the latest vulkan driver? I haven't seen such a speed up in MI50 cards with vulkan yet. Thanks!