Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Performance of llama.cpp on AMD ROCm (HIP) #15021

olegshulyakov started this conversation in Show and tell
Discussion options

This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on Nvidia CUDA and Performance of llama.cpp with Vulkan, but for ROCm! I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our ROCm(HIP) releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

Share your llama-bench results along with the git hash and ROCm info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device I'll prioritize newer commits with substantial ROCm updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!

ROCm Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip Memory pp512 t/s tg128 t/s Commit Thanks to
Instinct MI300X 192 GB / HBM3 / 8192 bit 11476.40 ± 72.79 232.92 ± 0.53 ee3a9fc @yeahdongcn
RX 7900 XTX 24 GB / GDDR6 / 384 bit 3529.10 ± 56.47 153.91 ± 0.18 2572689 @Diablo-D3
Pro W7900 48 GB / GDDR6 / 384 bit 2824.22 ± 47.55 117.65 ± 0.21 9c35706 @65a
RX 7900 XT 20 GB / GDDR6 / 320 bit 3098.38 ± 24.02 116.15 ± 0.06 1e15bfd @AdamNiederer
Instinct MI210 64 GB / HBM2e / 4096 bit 2468.25 ± 9.11 113.71 ± 0.05 d82f6aa @65a
Instinct MI100 32 GB / HBM2 / 4096 bit 2732.83 ± 1.98 110.48 ± 0.14 9c35706 @firefox42
RX 7800 XT 16 GB / GDDR6 / 256 bit 2151.81 + 17.94 100.94 + 0.10 00131d6 @olegshulyakov
RX 9070 16 GB / GDDR6 / 256 bit 2361.10 ± 0.88 99.39 ± 0.57 65349f2 @prototypicall
RX 7900 GRE 16 GB / GDDR6 / 256 bit 1456.98 ± 12.39 96.07 ± 0.10 6fa3b55 @MihaiBojescu
AI PRO R9700 32 GB / GDDR6 / 256 bit 2746.39 ± 57.09 92.29 ± 0.25 e2c1bff @TheyreEatingTheGeese
Instinct MI60 32 GB / HBM2 / 4096 bit 1289.11 ± 0.62 91.46 ± 0.13 504af20 @Said-Akbar
RX 9070 XT 16 GB / GDDR6 / 256 bit 4065.43 ± 5.08 89.90 ± 0.22 f505bd8 @Hadrianneue
RX 6900 XT 16 GB / GDDR6 / 256 bit 1889.84 ± 31.21 88.49 ± 0.00 a972fae @notgood
Pro VII 16 GB / HBM2 / 4096 bit 1064.99 ± 1.18 87.45 ± 0.04 2739a71 @8XXD8
Instinct MI50 32 GB / HBM2 / 4096 bit 1040.64 ± 1.20 87.40 ± 0.48 21c17b5 @davispuh
RX 6800 XT 16 GB / GDDR6 / 256 bit 1447.07 ± 1.36 83.92 ± 0.03 79c1160 @MrLavender
Instinct MI50 16 GB / HBM2 / 4096 bit 1086.77 ± 0.28 77.12 ± 0.17 89d604f Reddit
Pro V620 32 GB / GDDR6 / 256 bit 1803.65 ± 2.54 74.66 ± 0.01 5c0eb5e @samteezy
RX 9060 XT 16 GB / GDDR6 / 256 bit 1419.67 ± 3.64 67.58 ± 0.24 a0e13dc @lcy0321
RX 5700 XT 8 GB / GDDR6 / 256 bit 354.17 ± 0.18 67.55 ± 0.04 c05e8c9 @daniandtheweb
Instinct MI25 16 GB / HBM2 / 2048 bit 409.83 ± 0.23 63.94 ± 0.06 2739a71 @8XXD8
AI Max+ 395 128 GB / LPDDR5 911.36 ± 1.79 50.01 ± 0.07 e60f241 @firefox42
RX 7600 XT 16 GB / GDDR6 / 128 bit 1099.64 ± 2.05 48.58 ± 0.06 9c35706 @wbruna
RX Vega 64 8 GB / HBM2 / 2048 bit 240.68 ± 0.09 48.46 ± 0.09 ec428b0 @davispuh
Radeon 8060S System Shared / DDR5 351.36 ± 0.67 47.97 ± 0.33 1d0125b @hspak

ROCm Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip Memory pp512 t/s tg128 t/s Commit Thanks to
Instinct MI300X 192 GB / HBM3 / 8192 bit 11945.97 ± 54.29 218.53 ± 0.09 ee3a9fc @yeahdongcn
RX 7900 XTX 24 GB / GDDR6 / 384 bit 3633.86 ± 10.29 145.23 ± 0.10 2572689 @Diablo-D3
Pro W7900 48 GB / GDDR6 / 384 bit 3062.83 ± 13.54 115.23 ± 0.19 9c35706 @65a
RX 7900 XT 20 GB / GDDR6 / 320 bit 3261.75 ± 9.09 112.30 ± 0.06 1e15bfd @AdamNiederer
Instinct MI100 32 GB / HBM2 / 4096 bit 2755.00 ± 3.68 104.71 ± 0.10 9c35706 @firefox42
Instinct MI210 64 GB / HBM2e / 4096 bit 2540.90 ± 3.05 98.23 ± 0.04 d82f6aa @65a
RX 7900 GRE 16 GB / GDDR6 / 256 bit 1598.79 ± 11.48 97.53 ± 0.06 6fa3b55 @MihaiBojescu
RX 9070 16 GB / GDDR6 / 256 bit 1147.66 ± 1.06 97.10 ± 0.33 65349f2 @prototypicall
RX 7800 XT 16 GB / GDDR6 / 256 bit 2304.63 + 2.85 95.99 + 0.21 00131d6 @olegshulyakov
AI PRO R9700 32 GB / GDDR6 / 256 bit 1300.26 ± 3.04 93.28 ± 0.45 e2c1bff @TheyreEatingTheGeese
RX 9070 XT 16 GB / GDDR6 / 256 bit 4027.14 ± 4.65 89.17 ± 0.17 fd1234c @Hadrianneue
RX 6900 XT 16 GB / GDDR6 / 256 bit 1948.31 ± 13.51 85.04 ± 0.02 a972fae @notgood
Instinct MI50 32 GB / HBM2 / 4096 bit 446.60 ± 1.25 76.44 ± 0.04 21c17b5 @davispuh
Pro V620 32 GB / GDDR6 / 256 bit 1256.86 ± 0.55 70.83 ± 0.02 5c0eb5e @samteezy
Instinct MI50 16 GB / HBM2 / 4096 bit 769.74 ± 0.56 68.41 ± 0.01 89d604f Reddit
RX 9060 XT 16 GB / GDDR6 / 256 bit 1479.27 ± 0.71 65.42 ± 0.19 a0e13dc @lcy0321
RX 5700 XT 8 GB / GDDR6 / 256 bit 314.17 ± 0.29 62.02 ± 0.05 c05e8c9 @daniandtheweb
AI Max+ 395 128 GB / LPDDR5 1003.53 ± 2.91 49.87 ± 0.02 e60f241 @firefox42
Radeon 8060S System Shared / DDR5 366.08 ± 1.44 48.97 ± 0.15 1d0125b @hspak
RX 7600 XT 16 GB / GDDR6 / 128 bit 1199.16 ± 1.07 47.65 ± 0.06 9c35706 @wbruna
RX Vega 64 8 GB / HBM2 / 2048 bit 153.17 ± 0.72 42.46 ± 0.40 ec428b0 @davispuh

More detailed test

The main idea of this test is to show a decrease in performance with increasing size.

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048
You must be logged in to vote

Replies: 28 comments 41 replies

Comment options

RX 7800 XT (Sapphire Pulse 280W)

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 0 pp512 2151.81 + 17.94
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 0 tg128 100.94 + 0.10
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 1 pp512 2304.63 + 2.85
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 1 tg128 95.99 + 0.21

build: 00131d6 (6031)


ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 0 pp512 2145.60 + 23.14
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 0 tg128 96.89 + 0.22
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 1 pp512 2063.66 + 2.92
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 1 tg128 96.03 + 0.09

build: baad948 (6056)

Notes:

  • Sapphire RX 7800 XT Pulse (Power Limit +15% - 280W)
  • Windows 10.
  • Drivers - Radeon Pro.
You must be logged in to vote
1 reply
Comment options

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-hip.dll
load_backend: loaded RPC backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-cpu-icelake.dll

model size params backend ngl sm fa mmap test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 pp512 2109.38 + 15.79
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 pp1024 1749.56 + 12.69
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 pp2048 1165.15 + 1.02
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 pp4096 997.83 + 0.53
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 pp8192 789.89 + 0.46
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 pp16384 196.02 + 0.96
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 tg128 99.55 + 0.09
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 tg256 98.16 + 0.15
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 tg512 90.29 + 0.07
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 tg1024 80.56 + 0.11
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 0 tg2048 62.58 + 0.18
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 pp512 2296.72 + 3.40
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 pp1024 2225.68 + 2.84
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 pp2048 2069.86 + 2.06
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 pp4096 1814.41 + 2.23
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 pp8192 1423.62 + 0.94
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 pp16384 992.13 + 0.81
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 tg128 96.80 + 1.30
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 tg256 95.98 + 0.60
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 tg512 95.92 + 0.27
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 tg1024 91.30 + 0.74
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 0 tg2048 85.44 + 0.29

build: e725a1a (6104)

Comment options

Happy to replicate:

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 2967.12 ± 31.25
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 116.00 ± 0.10
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 3163.24 ± 4.06
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 112.75 ± 0.04

build: 9c35706 (6060)

On Linux

You must be logged in to vote
0 replies
Comment options

RX 7600 XT

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 1099.64 ± 2.05
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 48.58 ± 0.06
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 1199.16 ± 1.07
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 47.65 ± 0.06

build: 9c35706 (6060)

Running on Linux 6.12.32, mainline amdgpu, ROCm 6.4.1.


ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 606.24 ± 0.31
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 52.84 ± 0.44
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 612.33 ± 0.53
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 53.70 ± 0.33

build: 9c35706 (6060)

You must be logged in to vote
1 reply
Comment options

@olegshulyakov , the 7600 XT actually has a 128 bit memory bus.

Comment options

AMD MI60.

Happy to contribute.
I am on Ubuntu 24.04 and ROCm 6.3.4. GPU is connected at 8x PCIE4.0 speed. AMD 5950x CPU with 96GB RAM at 3200Mhz. Flash attention is disabled (FA=0).

model size params backend ngl sm test t/s build
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 none pp512 1289.11 ± 0.62 504af20 (4476)
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 none tg128 91.46 ± 0.13 504af20 (4476)

I will post FA=1 and vulkan results once I have time during the weekend.

You must be logged in to vote
0 replies
Comment options

MI100

Using ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64

model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 pp512 2732.83 ± 1.98
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 tg128 110.48 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 pp512 2755.00 ± 3.68
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 tg128 104.71 ± 0.10

build: 9c35706 (6060)

I'm running Ubuntu 24.04.2 and ROCm 6.4.1

You must be logged in to vote
2 replies
Comment options

I expected it to be faster than RX 7800 XT because of HBM2... Have you tried to launch with a single device only?

Comment options

IMbackK Aug 18, 2025
Collaborator

Bandwidth utilization is still fairly low on the gcn/cdna parts (gcn/cdna = same thing for tg).
GCN/CDNA is quite difficult to get decent utilization on as they are very register starved and have very small caches.
mi100 also dosent really have 1.2TB/s bandwith, it is limited to a sustained 1024GB/s by its fabric bandwidth

Comment options

AMD Instinct MI300X

root@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 11476.40 ± 72.79
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 218.87 ± 0.61
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 4037.07 ± 8.61
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 158.12 ± 0.21

build: 2bf3fbf (6069)

Ref: #14640

You must be logged in to vote
7 replies
Comment options

I'm just referring to the rocWMMA flag from the build instructions: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip

To enhance flash attention performance on RDNA3+ or CDNA architectures, you can utilize the rocWMMA library by enabling the -DGGML_HIP_ROCWMMA_FATTN=ON option. This requires rocWMMA headers to be installed on the build system.

It should work for CDNA too but we have only tested with our RDNA3 cards (7900 XTX) and saw huge performance jumps in PP with FA on: #10879 (reply in thread)

Please try it out because 1/3rd the performance in PP with FA on is just... strange at best

Comment options

With -DGGML_HIP_ROCWMMA_FATTN=ON:

root@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 11021.13 ± 210.87
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 232.92 ± 0.53
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 11945.97 ± 54.29
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 218.53 ± 0.09

build: ee3a9fc (6090)

Comment options

Performance is in the middle between RTX 4090 and RTX 5090.

Comment options

So rocWMMA does work for CDNA in FA :)

Comment options

IMbackK Aug 7, 2025
Collaborator

not very well

Comment options

Pro V620

Why does FA slow down the V620 so much? Been a question I've been trying to answer for a while now.

root@llama:/mnt/models# /root/llama-builds/llama.cpp/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
 Device 0: AMD Radeon PRO V620, gfx1030 (0x1030), VMM: no, Wave Size: 32
 Device 1: AMD Radeon (TM) Pro WX 3200 Series, gfx803 (0x803), VMM: no, Wave Size: 64
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) Pro WX 3200 Series (RADV POLARIS12) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon PRO V620 (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none 
model size params backend threads sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,Vulkan,BLAS 10 none 0 pp512 1801.16 ± 3.33
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,Vulkan,BLAS 10 none 0 tg128 74.48 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,Vulkan,BLAS 10 none 1 pp512 1258.12 ± 0.69
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,Vulkan,BLAS 10 none 1 tg128 70.74 ± 0.02

build: 03d4698 (6074)

Linux, ROCm 6.4.1 ( will try upgrading soon)

You must be logged in to vote
2 replies
Comment options

@samteezy Can you please run per each device PRO V620/Pro WX 3200 and ROCm only backend?

Comment options

@olegshulyakov The numbers come out the same. Forcing -sm none mg 0 ensures only the V620 is running. I don't benchmark the WX 3200.

root@llama:~# /root/llama-builds/llama.cpp/bin/llama-bench -m /mnt/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon PRO V620, gfx1030 (0x1030), VMM: no, Wave Size: 32
Device 1: AMD Radeon (TM) Pro WX 3200 Series, gfx803 (0x803), VMM: no, Wave Size: 64

model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 pp512 1803.65 ± 2.54
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 tg128 74.66 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 pp512 1256.86 ± 0.55
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 tg128 70.83 ± 0.02

build: 5c0eb5e (6075)

Comment options

Powercolor Hellhound RX 7900 XTX (400W power limit)

Opensuse tumbleweed system with rocm packages from AMD ROCm repository installed

Information for package rocm-hip:
---------------------------------
Repository : AMD ROCm (openSUSE_Factory)
Name : rocm-hip
Version : 6.4.1-6.5
Arch : x86_64
Vendor : obs://build.opensuse.org/science
Installed Size : 25.5 MiB
Installed : Yes
Status : up-to-date
Source package : rocclr-6.4.1-6.5.src
Upstream URL : https://github.com/ROCm/clr
Summary : ROCm HIP platform and device tool
Description : 
 HIP is a C++ Runtime API and Kernel Language that allows developers to create
 portable applications for AMD and NVIDIA GPUs from the same source code.
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 3243.15 ± 10.32
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 125.84 ± 0.11
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 3557.68 ± 13.45
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 122.71 ± 0.11

build: 5c0eb5e (6075)

Sapphire Nitro 7900 XTX (400W power limit)

In a different PC unfortunately because these GPUs are too chonky to fit in a regular case
So no TP for now but it serves my use case of running an LLM on one and STT/TTS on the other card to get a fully local voice-to-voice chatbot (Just tried with Amica and it works great! Very entertaining!)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 3369.65 ± 10.61
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 122.06 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 3573.30 ± 14.31
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 118.71 ± 0.14

build: 9c35706 (6060)

You must be logged in to vote
12 replies
Comment options

So what are recommend settings/what to do to get best performance on 7900 XTX ? I have SAPPHIRE NITRO+ AMD Radeon RX 7900 XTX Vapor-X 24GB and without changing anything I get WAY worse results.
Using Arch Linux with everything updated to latest (ROCm 6.4.3) and freshly compiled llama.cpp

$ llama-bench -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 0 pp512 2817.81 ± 26.69
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 0 tg128 112.18 ± 0.45
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 1 pp512 3053.77 ± 17.38
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 1 tg128 110.70 ± 0.08

build: b9be58d (1005)

And this is what LACT showed while running it

image
Comment options

IMbackK Sep 17, 2025
Collaborator

At these high speeds with fast gpus the cpu gets important for results, your results and his would be within the expected variance for differing cpu performance.

Not that it really matters mutch as the cpu gets almost irrelevant when a model of a size to fill the device is used.

Benchmarking llama 7B Q4_0 is not really that great, as it dosent reflect actual usage much, this hurts the most on cdna devices which scale better than you would expect performance wise when increasing the number parameters.

Comment options

Go for ROCm 7.0, it has officially been released now. And compile llama.cpp with ROCWMMA enabled, see #15021 (comment) . You should get much better results.

Comment options

ROCm 7 is released? that's great news! I'll try it out as soon at is lands in my distro's package manager.

That said I don't think it'll give any performance improvements for the 7900 XTX. The reason 9070 XT gets a boost with rocm7 is because pre rocm7 WMMA is not implemented for RDNA4. But we'll see.

BTW FYI in my benchmarks, the powercolor 7900XTX was paired with a ryzen 2700 with 64GB RAM, and the sapphire 7900XTX was paired with a 5700X3d also 64GB RAM. I did not see an appreciable performance difference between the two setups in LLM inference.

Comment options

I see. To me that 3573 vs 3053 seems big difference. I'm running this on AMD Threadripper 1920X (12 core) with PCIe 3.0 x16. What benchmark I could use to compare more accurate real world inference performance?
For ROCm 7 it's not yet in Arch repos so I'll have to wait a bit. Also it wouldn't be accurate comparison with this older result.

Now when I booted with amdgpu.ppfeaturemask=0xffffffff kernel parameter I can increase TDP up to 402W but I don't see any performance impact at all - 305W limit vs 402W gives same benchmark so looks like it doesn't matter.

Comment options

Powercolor Red Devil 7900XTX

Adrenalin 25.8.1 just came out, so time to test again
Ryzen 9800x3D
Windows 11 24H2 26100.4652

llama-win-hip/llama-bench.exe -m ./models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -r 100 -fa 0,1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 100 0 pp512 3434.01 ± 38.33
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 100 0 tg128 153.91 ± 0.18
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 100 1 pp512 3633.86 ± 10.29
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 100 1 tg128 145.23 ± 0.10

build: 2572689 (6099)

Still lower than the historical highs on May 26th (3599 and 3743), and a loss and a win against July 22nd (3529 and 3598).

You must be logged in to vote
0 replies
Comment options

RX 7900 XTX (ASUS TUF)
Ubuntu 24.04.2
Rocm 6.4.2

./build/bin/llama-bench -m /home/vk/Downloads/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 3386.75 ± 5.33
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 128.25 ± 0.24
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 3674.25 ± 11.35
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 124.61 ± 0.06

build: 6c7e9a5 (6118)

You must be logged in to vote
0 replies
Comment options

RX 6800 (16GB 203W)

ROCm 6.3.4 on Ubuntu 24.04 in a Docker container

llama-bench --prio 1 -m /llama-cpp/models/local/llama-2-7b-Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 pp512 1447.07 ± 1.36
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 tg128 83.92 ± 0.03

build: 79c1160 (6123)

Bonus benchmarks

I ran these to compare ROCm versions on various models. Obviously the results are specific to my RX 6800 and shouldn't be used to make any judgments about ROCm performance in general, especially on RDNA3 and later gpus. I use 6.3.4 because I don't care about LLama 3 8B.

Note how fast the new MoE models are - gpt-oss-20B even at Q6_K_XL is faster than this 7B Q4_0 model. (Do make sure that you have a fixed version because the original gpt-oss releases had some issues - I used https://huggingface.co/unsloth/gpt-oss-20b-GGUF).

ROCm 6.3.4

Lower than expected performance may be observed while running Llama 3 8B inference workloads with Llama.cpp

ROCm 6.4.3

  • The Llama 3 8B issue still exists
  • ~8% performance regression in qwen2 14B Q6_K and qwen3 14B Q6_K prompt processing
Screenshot_20250811_160000
You must be logged in to vote
0 replies
Comment options

RX 7900 XTX (ASUS TUF a bit overclocked for 100 mhz for core and VRAM)
Ubuntu 24.04.2
Rocm 7.0-rc1

./build/bin/llama-bench -m /home/vk/Downloads/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 3473.24 ± 12.30
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 132.17 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 3698.73 ± 17.60
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 127.43 ± 0.04

build: 648ebcd (6146)

You must be logged in to vote
0 replies
Comment options

RX 6900 XT AMD Reference Card (Stock clocks)
Ryzen 7 5800X3D with 32GB 3600MHz C18 ram

Debian Testing
Using Docker image rocm/rocm-terminal with additions.

llama.cpp version: gguf-v0.17.1-386-gfd1234cb

./src/llama.cpp/build/bin/llama-bench -m models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32 Device 1: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32

model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 pp512 1824.47 ± 1.02
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 tg128 83.02 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 pp512 1250.68 ± 0.73
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 tg128 80.45 ± 0.02
You must be logged in to vote
4 replies
Comment options

@tdjb Results are pretty low, can you re-test using llama.cpp standalone without Docker?

Comment options

Just a quick test, installed the llama.cpp build from Debian sid (was surprised to even find a build to be honest), which appears to be b5882 and the results came in quite similar. I tried the benchmark on both of my devices, as one is on a slower PCIe 4x slot, the results below are from the faster run.

Why do you think the 6900 XT should perform better?
Seeing the 6800 XT results above being a little slower made mine seem reasonable.
While reading the post again, I saw those were also being run using Docker.

model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 pp512 1835.54 ± 2.20
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 tg128 74.90 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 pp512 1314.84 ± 0.77
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 tg128 68.28 ± 0.07

Happy to run further tests.

Comment options

It should be 10% better on my understanding, according to specs: RX 6900 XT and RX 6800 XT

Comment options

IMbackK Aug 18, 2025
Collaborator

the rdna2 results are mostly surprisingly high, given the hardware capabilities, not low.

Comment options

GigaByte R9700
build: e2c1bff (6177) | llama.cpp vulkan and rocm docker containers

note both vulkan and rocm results below
vulkan benchmarks showed WARNING: radv is not a conformant Vulkan implementation, testing use only.

llama-cli --bench --model /models/Qwen3-32B-Q4_K_M.gguf -ngl 100 -fa 0 -p 512,1024,2048,4096,8192,16384,30720 -n 128,256,512,1024

  • Vulkan 32K prompt ran out of memory so changed it to 30K
  • ROCM, 16K+ prompt also had errors (though not out of memory)
model size params backend ngl test t/s
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 pp512 196.90 ± 0.43
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 pp1024 193.73 ± 0.22
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 pp2048 191.62 ± 0.36
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 pp4096 184.77 ± 0.14
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 pp8192 171.50 ± 0.08
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 pp16384 149.20 ± 0.11
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 pp30720 118.38 ± 1.08
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 pp512 203.35 ± 0.47
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 tg128 28.20 ± 0.03
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 tg256 28.14 ± 0.01
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 tg512 27.96 ± 0.01
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B Vulkan 100 tg1024 27.67 ± 0.01
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 pp512 498.66 ± 0.59
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 pp1024 473.24 ± 0.84
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 pp2048 435.33 ± 0.62
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 pp4096 380.48 ± 0.39
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 pp8192 304.56 ± 0.15
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 pp512 501.91 ± 0.66
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 tg128 24.03 ± 0.04
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 tg256 24.06 ± 0.02
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 tg512 23.67 ± 0.02
qwen3 32B Q4_K - Medium 18.40 GiB 32.76 B ROCm 100 tg1024 22.88 ± 0.01

llama-cli --bench --model /models/llama-2-7b.Q4\_0.gguf -ngl 100 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024

  • ROCM, 32K prompt had errors
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp512 1943.56 ± 6.92
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp1024 1879.03 ± 6.97
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp2048 1758.15 ± 2.78
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp4096 1507.73 ± 2.83
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp8192 1078.38 ± 0.53
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp16384 832.26 ± 0.67
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp32768 466.09 ± 0.19
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg128 124.13 ± 0.95
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg256 123.30 ± 0.19
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg512 119.96 ± 0.13
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg1024 114.71 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 1863.64 ± 6.66
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp1024 1780.54 ± 7.25
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp2048 1640.52 ± 3.72
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp4096 1417.17 ± 4.65
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp8192 1119.76 ± 0.41
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp16384 786.26 ± 0.83
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp32768 490.12 ± 0.47
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg128 124.65 ± 0.13
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg256 124.72 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg512 122.66 ± 0.09
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg1024 119.27 ± 0.11
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 pp512 2746.39 ± 57.09
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 pp1024 2672.60 ± 7.19
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 pp2048 2475.62 ± 9.50
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 pp4096 2059.84 ± 0.94
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 pp8192 1333.60 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 pp16384 1014.06 ± 0.35
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 pp24576 769.31 ± 0.37
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 tg128 92.29 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 tg256 92.34 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 tg512 90.28 ± 0.13
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 tg1024 86.91 ± 0.10
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp512 1300.26 ± 3.04
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp1024 1009.69 ± 1.54
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp2048 695.68 ± 0.34
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp4096 428.36 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp8192 242.06 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp16384 129.46 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp24576 88.34 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 tg128 93.28 ± 0.45
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 tg256 93.22 ± 0.12
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 tg512 91.31 ± 0.09
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 tg1024 88.87 ± 0.35
You must be logged in to vote
3 replies
Comment options

So 7900 XTX has better performance, that's sad. Also weird that Vulkan performs way worse on pp than ROCm but better on tg.

Comment options

We'll see when someone a bit more experienced gives it a shot. My benchmarks are about as vanilla as it gets. Threw it in an unraid server (12700k and 128GB DDR4-2133), made docker images and ran benchmarks. Many of the 7900 XTX results are baremetal, have factory overclock or are manually overclocked, installed additional drivers, and/or have raised power limits. I bet someone will beat my benchmarks shortly.

Comment options

IMbackK Aug 18, 2025
Collaborator

So 7900 XTX has better performance, that's sad. Also weird that Vulkan performs way worse on pp than ROCm but better on tg.

there is zero reason to expect 9070(xt) to perform better than the xtx

Comment options

Radeon RX 9070 (non-XT)

ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 pp512 2361.10 ± 0.88
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 0 tg128 99.39 ± 0.57
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp512 1147.66 ± 1.06
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 tg128 97.10 ± 0.33

build: 65349f2 (6183)

I tried to enable the use of rocwmma with DGGML_HIP_ROCWMMA_FATTN=ON but I don't think it worked. cmake complained that it couldn't find the header so provided the include path but didn't check if the compiler was able to use that.

Still surprising that these numbers are better than the 9070 XT.

You must be logged in to vote
0 replies
Comment options

EDIT: see my comment below

I bought 2x MI50 32GB VRAM from Alibaba and for some reason I'm getting really poor performance on them... No idea why, even Vega 64 beats it and it's way slower than someone's elses MI50 16GB.

$ llama-bench -sm none -mg 0 -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
 Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
 Device 1: AMD Radeon RX Vega, gfx900:xnack- (0x900), VMM: no, Wave Size: 64
model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 pp512 193.40 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 tg128 18.81 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 pp512 108.82 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 tg128 16.80 ± 0.01

build: 21c17b5 (3)

$ llama-bench -sm none -mg 1 -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX Vega (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
model size params backend ngl main_gpu sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,RPC 99 1 none 0 pp512 155.42 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,RPC 99 1 none 0 tg128 16.43 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,RPC 99 1 none 1 pp512 133.19 ± 0.16
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,RPC 99 1 none 1 tg128 16.86 ± 0.00

build: 21c17b5 (3)

You must be logged in to vote
4 replies
Comment options

Found issue, my Vega RX 64 was crashing/locking up and that's why I had disabled power management amdgpu.dpm=0 amdgpu.runpm=0 but looks like MI50 this really doesn't like and it was running on very low wattage hence poor performance. Now after removing this I get same results as other MI50

$ llama-bench -sm none -mg 0 -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
 Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
 Device 1: AMD Radeon RX Vega, gfx900:xnack- (0x900), VMM: no, Wave Size: 64
model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 pp512 1040.64 ± 1.20
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 0 tg128 87.40 ± 0.48
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 pp512 446.60 ± 1.25
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 none 1 tg128 76.44 ± 0.04

build: 21c17b5 (3)

$ llama-bench -sm none -mg 1 -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX Vega (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
model size params backend ngl main_gpu sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,RPC 99 1 none 0 pp512 829.87 ± 3.43
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,RPC 99 1 none 0 tg128 80.44 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,RPC 99 1 none 1 pp512 724.31 ± 0.68
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,RPC 99 1 none 1 tg128 82.21 ± 0.23

build: 21c17b5 (3)

But weird thing is that now my Vega RX 64 with "not disabled power management" performs worse...

Comment options

Can you please run long one?

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

It is interesting if large memory can help with degradation.

Comment options

I have dissembled my system so I won't be able to test it for a while. But someone else from MI50 Discord run it

# HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m ~/.cache/huggingface/llama/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp512 | 1048.27 ± 3.41 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp1024 | 927.94 ± 0.70 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp2048 | 681.32 ± 0.43 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp4096 | 470.72 ± 0.42 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp8192 | 365.29 ± 0.20 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 0 | pp16384 | 236.79 ± 0.10 |
/workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:85: ROCm error
/workspace/llama.cpp/build/bin/libggml-base.so(+0x16ccb)[0x7d951ac1dccb]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21f)[0x7d951ac1e12f]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x152)[0x7d951ac1e302]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f90612)[0x7d951a1a9612]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f9f8c2)[0x7d951a1b88c2]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f98886)[0x7d951a1b1886]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f97fc2)[0x7d951a1b0fc2]
/workspace/llama.cpp/build/bin/libggml-hip.so(+0x1f9602f)[0x7d951a1af02f]
/workspace/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x3fd)[0x7d951ac376cd]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x99)[0x7d951ad4b899]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x105)[0x7d951ad4bc45]
/workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x2d4)[0x7d951ad52314]
/workspace/llama.cpp/build/bin/libllama.so(llama_decode+0x10)[0x7d951ad53260]
./build/bin/llama-bench(+0x1a92a)[0x60e3f404692a]
./build/bin/llama-bench(+0x14cb4)[0x60e3f4040cb4]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7d951a6cad90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7d951a6cae40]
./build/bin/llama-bench(+0x19945)[0x60e3f4045945]
Aborted (core dumped)
Comment options

it is interesting how Vulkan out performs ROCm by a large margin when flash attention is on. FA is a big gain, so ig i'll be running vulkan over ROCm lol.

I was looking to buy 2 (maybe 3) mi50's bc hardware specs seem really good on paper for such a low price, but from all of the benchmarks I've seen it seems like the software just isn't there yet to really squeeze out all the performance it has to offer. seems like 20% is still left on the table. Hopefully the support only gets better, but they are old cards, so i doubt it.

Comment options

RX 9070 XT (Powercolor Red Devil)

OS: Ubuntu 24.04
CPU: Ryzen 7 5700+
PCI-E: 16x 3.0
RAM: 32GB

ROCm 6.4.3
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 2908.66 ± 8.00
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 83.63 ± 0.36
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 1341.03 ± 0.28
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 85.43 ± 0.03

build: 043fb27 (6264)

Vulkan
WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 9070 XT (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 0 pp512 2055.15 ± 11.24
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 0 tg128 128.80 ± 0.22
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 1 pp512 1964.92 ± 2.13
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 1 tg128 130.12 ± 0.24

build: c9a24fb (6262)

You must be logged in to vote
3 replies
Comment options

Vulkan with FA enabled has impressive pp and TG gains. What version of vulkan are you using? I see it is radv but is this the latest vulkan driver? I haven't seen such a speed up in MI50 cards with vulkan yet. Thanks!

Comment options

IMbackK Aug 25, 2025
Collaborator

This is becasue wmma fattn is currently disabled on gfx12 until rocm 7.0 (or when force enabled)

Comment options

@Said-Akbar

$ vulkaninfo 
WARNING: radv is not a conformant Vulkan implementation, testing use only.
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.275
...

I installed the AMD drivers at the same time I installed ROCm 6.4.3

Comment options

With rocwmma_fattn, thermals at 50C
./bin/llama-bench -m /tmp/llama.gguf -ngl 99 -fa 0,1 sm none -mg 0
Device 0: AMD Instinct MI210, gfx90a:sramecc+:xnack- (0x90a), VMM: no, Wave Size: 64

model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 pp512 2468.25 ± 9.11
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 0 tg128 113.71 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 pp512 2540.90 ± 3.05
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 none 1 tg128 98.23 ± 0.04

build: d82f6aa (6321)

You must be logged in to vote
0 replies
Comment options

With rocwmma_fattn, thermals at 51.9C
./bin/llama-bench -m /tmp/llama.gguf -ngl 99 -fa 0,1
Device 0: AMD Radeon Pro W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 2824.22 ± 47.55
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 117.65 ± 0.21
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 3062.83 ± 13.54
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 115.23 ± 0.19

build: 9c979f865 (6248)

You must be logged in to vote
0 replies
Comment options

RX 9070 XT (XFX Mercury)

OS: Ubuntu 24.04
CPU: Ryzen 9 5900X
PCI-E: 16x 4.0
RAM: 32GB DDR4

ROCm 7.0 RC1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 3538.39 ± 513.67
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 87.43 ± 0.30
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 4131.22 ± 6.50
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 87.57 ± 0.99

build: 2c8dac72 (6367)

I should mention this is by using a llama.cpp compiled from source to enable ROCWMMA following the updated instructions in the manuel_instructions.md from this pull request.

You must be logged in to vote
0 replies
Comment options

RX 6900 XT
#SKU#: 11308-03-20G SAPPHIRE NITRO+ AMD Radeon RX 6900 XT SE Gaming Graphics Card with 16GB GDDR6, AMD RDNA 2
ROCM build with GGML_HIP_ROCWMMA_FATTN=ON

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 1889.84 ± 31.21
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 88.49 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 1948.31 ± 13.51
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 85.04 ± 0.02

build: a972fae (6428)

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 1901.20 ± 36.70
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 108.00 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1761.93 ± 4.75
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 106.15 ± 0.04

build: a972fae (6428)

Vulkan's 25% faster token generation makes it the only viable option, ROCM isn't there yet.

You must be logged in to vote
0 replies
Comment options

Tested MI300X with the latest build. Looks like there's a performance regression with FA compared to previous results?

root@6-4-0-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# hipconfig --version
6.4.43483-a187df25croot@6-4-0-gpu-mi300x1-192gb-devcloud-atl1
root@6-4-0-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 10893.18 ± 227.48
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 241.98 ± 1.45
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 12019.43 ± 55.59
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 169.73 ± 0.08

build: ae355f6 (6432)

I also tested with ROCm 7.0.0-rc1:

root@ubuntu-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# hipconfig --version
7.0.51830-2e4b99775
root@ubuntu-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 11055.31 ± 168.03
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 242.25 ± 0.29
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 11882.32 ± 68.27
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 168.40 ± 0.06

build: ae355f6 (6432)

You must be logged in to vote
0 replies
Comment options

AMD Radeon RX 9060 XT (16 GB)

Operating System: Microsoft Windows 11 24H2 (Build 26100.4061)
CPU: AMD Ryzen 5 5600


ROCm (HIP)

...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
 Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32
...
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 0 pp512 1419.67 ± 3.64
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 0 tg128 67.58 ± 0.24
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 1 pp512 1479.27 ± 0.71
llama 7B Q4_0 3.56 GiB 6.74 B ROCm,RPC 99 1 tg128 65.42 ± 0.19

build: a0e13dc (6470)


Vulkan

...
ggml_vulkan: 0 = AMD Radeon RX 9060 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | 
...
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 0 pp512 2056.43 ± 2.23
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 0 tg128 71.95 ± 1.32
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 1 pp512 1840.64 ± 3.76
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 1 tg128 72.24 ± 0.13

build: a0e13dc (6470)

You must be logged in to vote
0 replies
Comment options

Ryzen Al Max+ 395 (128GB memory)
linux kernel: 6.16.8

$ hipconfig --version
6.4.43484-123eb5128

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 351.36 ± 0.67
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 47.97 ± 0.33
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 366.08 ± 1.44
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 48.97 ± 0.15

build: 1d0125b (6552)

(My vulkan results)

You must be logged in to vote
1 reply
Comment options

I took a build of rocm 7 and tried again (with LD_PRELOAD since Arch hasn't packages rocm7 yet). Haven't been able to get -DGGML_HIP_ROCWMMA_FATTN=ON working yet.

$ bin/hipconfig --version
7.1.25381-434739eb3f

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 870.00 ± 4.69
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 49.83 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 964.00 ± 2.41
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 49.29 ± 0.02

build: f505bd8 (6560)

Comment options

version 7.0.51831-a3e329ad8 built with -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201

llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 4065.43 ± 5.08
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 89.90 ± 0.22
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 4027.14 ± 4.65
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 89.17 ± 0.17

build: f505bd8 (6560)

You must be logged in to vote
1 reply
Comment options

did a second build with latest rocm version, 7.0.1 from sep 4th, hip version 7.0.51831-a3e329ad8

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 4055.36 ± 11.83
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 93.26 ± 0.22
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 4029.45 ± 10.03
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 88.96 ± 0.10

build: ca71fb9 (6692)

Comment options

AI Max+ 395

Using ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 911.36 ± 1.79
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 50.01 ± 0.07
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 1003.53 ± 2.91
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 49.87 ± 0.02

build: e60f241 (6755)

I'm running Ubuntu 24.04.3 and ROCm 7.0.2

You must be logged in to vote
0 replies
Comment options

AMD Radeon RX 7900 GRE (Sapphire Pulse 7900 GRE)

System details

  • CPU: AMD Ryzen 7 7700x
  • RAM: 64GB DDR5
  • OS: Arch Linux x86-64
  • ROCm: 6.4.3-1

ROCm results

$ HIP_VISIBLE_DEVICES=0 llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 GRE, gfx1100 (0x1100), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 1456.98 ± 12.39
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 96.07 ± 0.10
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 1598.79 ± 11.48
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 97.53 ± 0.06

build: 6fa3b55 (1027)

Vulkan results

$ llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RAPHAEL_MENDOCINO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 7900 GRE (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 1604.81 ± 7.74
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 111.31 ± 1.00
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1632.95 ± 6.15
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 112.59 ± 0.45

build: 5bed329 (1022)

You must be logged in to vote
0 replies
Comment options

RX9070 (XFX Quicksilver 9070 OC)

System details
CPU: AMD Ryzen 5700x3d
OS: Win 10
GPU: slight memory OC (2640 MHz)

lemonade llama.cpp-rocm with custom build patch, against ROCM 7.10.0a (therock-dist-windows-gfx120X-all-7.10.0a20251022)

llama-bench.exe -m ../llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 3901.21 ± 44.65
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 115.11 ± 0.51
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 4074.98 ± 3.27
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 115.84 ± 0.71

build: dd62dcf (1)

official hip build (6827), HIP 6.4

llama-bench.exe -m ../llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from k:\llm\llamacpp\ggml-hip.dll
load_backend: loaded RPC backend from k:\llm\llamacpp\ggml-rpc.dll
load_backend: loaded CPU backend from k:\llm\llamacpp\ggml-cpu-haswell.dll

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 pp512 2381.77 ± 3.68
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 0 tg128 114.48 ± 0.60
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 2452.68 ± 1.33
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 115.32 ± 0.52

build: d0660f2 (6827)

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /