Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Performance of llama.cpp with Vulkan #10879

netrunnereve started this conversation in General
Discussion options

This is similar to the Apple Silicon benchmark thread, but for Vulkan! We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make
./bin/llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 (add any extra options here)

Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same setup I'll prioritize newer commits with substantial Vulkan updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that the memory speed and number of channels will greatly affect your inference speed!

Vulkan Scoreboard (Click on the headings to expand the section)

Llama 2 7B, Q4_0, no FA
Chip pp512 t/s tg128 t/s Commit Comments
Nvidia RTX 5090 10381.64 ± 508.84 263.63 ± 0.91 ca71fb9 coopmat2
Nvidia RTX 4090 9452.03 ± 187.70 187.97 ± 0.21 4ae88d0 coopmat2
AMD Radeon RX 7900 XTX 2254.00 ± 9.48 162.16 ± 0.11 7a50cf3
Nvidia RTX 3090 4298.97 ± 10.59 160.13 ± 0.25 4ae88d0 coopmat2
Nvidia RTX 4080 Super 7101.18 ± 269.79 147.13 ± 5.64 81086cd coopmat2
Nvidia RTX A5000 3641.55 ± 9.05 139.89 ± 0.69 4ae88d0 coopmat2
AMD Radeon RX 9070 XT 3895.62 ± 3.17 138.84 ± 0.18 3ecb2f6
Nvidia RTX 5070 Ti 6213.63 ± 27.72 135.63 ± 0.18 d13d0f6 coopmat2
Nvidia RTX 4070 Ti Super 6099.18 ± 154.30 129.45 ± 0.18 4ae88d0 coopmat2
AMD Radeon RX 7900 XT 2941.58 ± 17.17 123.18 ± 0.40 71e74a3
Nvidia A100 (80GB) 3103.32 ± 4.21 121.83 ± 0.54 d394a9a
AMD Radeon RX 9070 3164.10 ± 66.84 119.71 ± 3.40 21c17b5
AMD Radeon RX 7800 XT 2053.11 ± 6.42 116.97 ± 0.34 0889589
Apple M3 Ultra Mac Studio 1116.83 ± 0.55 115.54 ± 0.78 2d451c8 MoltenVK
AMD Radeon RX 7900 GRE 1420.20 ± 8.85 111.32 ± 0.05 2b3efea
AMD Radeon RX 6900 XT 1901.20 ± 36.70 108.00 ± 0.03 a972fae
AMD Radeon Pro VII 912.47 ± 1.06 106.03 ± 0.89 N/A
Nvidia Titan V 796.29 ± 5.84 105.06 ± 0.27 e56abd2
AMD Radeon RX 6800 XT 1752.92 ± 1.71 100.32 ± 0.97 N/A
Nvidia RTX 2080 Ti 1888.24 ± 9.20 97.58 ± 6.60 N/A
Nvidia RTX 4070 3179.37 ± 46.16 92.29 ± 0.28 9a48399
AMD Radeon PRO W6800X 510.80 ± 0.13 86.47 ± 0.46 13b4548 MoltenVK
AMD Radeon PRO W6800X Duo 519.14 ± 0.13 87.56 ± 0.19 13b4548 MoltenVK
AMD Radeon RX 6700 XT 1051.20 ± 0.98 83.88 ± 0.08 6d75883
AMD Radeon RX 6750 XT 1040.58 ± 0.35 81.98 ± 0.03 228f34c
AMD Radeon Pro V620 1595.32 ± 1.59 81.78 ± 0.06 03d4698
Nvidia RTX 5060 Ti 3211.73 ± 24.44 81.48 ± 3.50 658987c coopmat2
Nvidia RTX 3070 2113.02 ± 7.38 78.71 ± 0.13 1b8fb81
AMD Radeon Instinct MI60 369.26 ± 2.48 78.16 ± 1.40 504af20
Apple M4 Max Macbook Pro 724.77 ± 20.93 75.02 ± 0.14 1ece0cb6
Nvidia Tesla T10 1692.70 ± 2.05 75.01 ± 0.21 7f76692 coopmat2
AMD Radeon RX 5700 XT 492.13 ± 0.35 71.58 ± 0.24 0889589
AMD Radeon Instinct MI50 387.37 ± 0.33 71.46 ± 0.10 d5fe4e8
AMD Radeon RX 9060 XT 2141.67 ± 6.87 70.54 ± 0.74 ed52f36
Intel Arc B580 620.94 ± 15.33 70.14 ± 0.28 7f76692
AMD Radeon Pro W5700 504.20 ± 0.14 67.18 ± 0.08 4265a87
Nvidia RTX 3060 1730.33 ± 1.68 65.63 ± 2.47 e288693 coopmat2
Nvidia GTX 1080 Ti 540.69 ± 0.71 64.99 ± 0.08 360d653
Nvidia RTX 2070 Super 1199.13 ± 7.70 64.64 ± 0.20 b7552cf
Nvidia RTX 3070 Mobile 1689.40 ± 19.57 63.64 ± 0.39 ceff6bb coopmat2
Nvidia Tesla P40 488.06 ± 0.27 59.36 ± 0.16 N/A
AMD Radeon RX 6650 XT 735.64 ± 3.12 59.22 ± 0.11 228f34c
AMD Radeon RX 7600 XT 632.88 ± 0.70 58.44 ± 0.01 3b24d26
Nvidia GTX 1660 Ti Mobile 511.67 ± 2.85 56.60 ± 0.07 b43556e
AMD Radeon Instinct MI25 439.42 ± 0.34 54.69 ± 0.03 2739a71
AMD Radeon RX 6600 XT 574.65 ± 0.86 53.92 ± 0.11 091592d
AMD Ryzen Al Max+ 395 1288.96 ± 6.49 53.59 ± 0.38 7f76692
Intel Arc A770 903.49 + 3.16 50.90 + 0.06 5bb4a3e
AMD BC-250 331.58 ± 0.06 49.76 ± 0.06 cf2270e
Intel Arc B570 913.95 ± 0.90 49.64 ± 0.03 7f76692
Nvidia RTX 3060 Mobile 1059.76 ± 3.54 49.03 ± 0.13 dbb3a47
AMD Radeon RX 6800M 861.99 ± 7.67 48.71 ± 0.71 8e6f8bc
AMD Radeon RX 6600 617.85 ± 0.28 48.52 ± 0.06 4227c9b
AMD Radeon RX 6600M 605.59 ± 0.65 48.21 ± 0.07 fe5b78c
AMD Radeon RX Vega 64 356.08 ± 0.09 45.73 ± 0.18 ec428b0
Nvidia RTX A2000 1245.19 ± 8.76 45.52 ± 0.54 b1afcab coopmat2
AMD Radeon RX 7600M XT 459.39 ± 2.34 45.28 ± 0.10 b9ab0a4 eGPU
Nvidia GTX 1070 Ti 297.50 ± 0.54 42.86 ± 1.20 860a9e4 eGPU
Nvidia RTX 4050 Mobile 1154.28 + 15.76 41.89 + 0.10 d79d8f3
Nvidia GTX 1070 317.07 ± 0.26 41.61 ± 0.16 360d653
Intel Arc A750 665.38 ± 5.49 41.43 ± 0.03 21c17b5
AMD Radeon RX 580 258.03 ± 0.71 39.32 ± 0.03 de4c07f
AMD Radeon RX 470 218.07 ± 0.56 38.63 ± 0.21 e288693
AMD Radeon Pro W5500 315.39 ± 3.76 36.82 ± 0.38 860a9e4
AMD Radeon RX 480 248.66 ± 0.28 34.71 ± 0.14 3b15924
Nvidia GTX 980 186.24 ± 0.09 33.90 ± 0.51 860a9e4
AMD FirePro W8100 155.22 ± 0.17 29.52 ± 0.05 4536363
AMD Radeon RX 6500 XT 255.25 ± 0.35 27.81 ± 0.10 g9fdfcd
Apple M3 MacBook Pro 263.70 ± 0.02 26.39 ± 0.14 b9ab0a4 MoltenVK
AMD FirePro S10000 94.78 ± 0.02 25.32 ± 0.02 914a82d Split across two GPUs
AMD Ryzen AI 9 300 Series 479.07 ± 0.41 22.41 ± 0.18 N/A
AMD Ryzen 5 6000 Series 240.89 ± 0.52 21.26 ± 0.08 ee09828
Apple M2 Pro Mac Mini 62.70 ± 0.03 20.95 ± 0.11 1fe0029 Asahi Linux
Intel Core Ultra 7 258V 418.08 ± 6.02 20.53 ± 0.53 d1d8241
AMD Ryzen 7 8000 Series 245.79 ± 2.97 20.10 ± 0.07 19d3c82
AMD Ryzen 7 7000 Series 281.62 ± 1.56 19.91 ± 0.07 ebce03e
AMD Ryzen Z1 Extreme 199.36 ± 7.02 18.77 ± 0.02 53ff6b9
AMD Ryzen 5 8000 Series 183.35 ± 1.73 16.99 ± 0.02 9ecf3e6
AMD FirePro D700 69.95 ± 0.04 16.62 ± 0.01 d3bd719 MoltenVK, running in FP16 mode on FP32 only chip
AMD Radeon Pro WX 4100 78.79 ± 0.10 16.05 ± 0.07 860a9e4
Apple M1 Mac Mini 31.31 ± 0.01 12.41 ± 0.05 1fe0029 Asahi Linux
Apple M2 MacBook Air 38.67 ± 0.03 11.07 ± 0.04 017cc5f Asahi Linux
AMD Ryzen 7 5000 Series 90.55 ± 0.08 10.98 ± 0.07 d84635b
AMD Ryzen 5 5000 Series 83.02 ± 0.01 10.87 ± 0.01 5d195f1
Nvidia Tesla K80 89.46 ± 0.10 9.39 ± 0.06 5d46bab Running on single GPU
MediaTek Dimensity 9400 38.36 ± 15.15 8.92 ± 0.06 b9ab0a4 GPU supports coopmat but pp512 is faster with it turned off
Intel Core Ultra 7 100 Series 185.51 ± 0.22 8.21 ± 0.07 1d72c84
Intel Core i7 1100 Series 42.02 ± 0.07 7.28 ± 0.24 ff3fcab
AMD Ryzen 5 3000 Series 48.63 ± 0.10 8.49 ± 0.01 1fe0029
AMD Ryzen 5 4000 Series 52.11 + 0.11 7.35 + 0.30 N/A
Intel Core i7 1000 Series 25.58 ± 0.00 4.25 ± 0.18 N/A
Intel Core i7 8000 Series 25.43 ± 0.17 3.35 ± 0.03 c4df49a
Intel Core i5 8000 Series 25.28 ± 0.00 3.23 ± 0.00 f26c874
Intel N150 28.84 ± 0.02 2.93 ± 0.00 4f63cd7
Llama 2 7B, Q4_0, FA enabled
Chip pp512 t/s tg128 t/s Commit Comments
Nvidia RTX 5090 11796.38 ± 601.36 273.68 ± 0.52 ca71fb9 coopmat2
Nvidia RTX 4090 10830.41 ± 36.25 190.10 ± 0.31 4ae88d0 coopmat2
AMD Radeon RX 7900 XTX 2281.62 ± 25.37 165.08 ± 0.27 7a50cf3
Nvidia RTX 3090 4732.33 ± 4.80 162.28 ± 0.21 4ae88d0 coopmat2
Nvidia RTX 4080 Super 8007.37 ± 46.03 150.20 ± 0.26 81086cd coopmat2
Nvidia RTX A5000 4071.22 ± 13.13 140.43 ± 0.22 4ae88d0 coopmat2
AMD Radeon RX 9070 XT 3525.21 ± 29.21 138.45 ± 0.27 3ecb2f6
Nvidia RTX 4070 Ti Super 6801.18 ± 40.12 135.81 ± 4.29 4ae88d0 coopmat2
Nvidia RTX 5070 Ti 6614.86 ± 8.32 133.94 ± 0.02 d13d0f6 coopmat2
AMD Radeon RX 7900 XT 2701.13 ± 8.75 120.62 ± 0.36 71e74a3
Nvidia A100 (80GB) 3164.55 ± 5.00 120.53 ± 0.41 d394a9a
AMD Radeon RX 9070 2859.98 ± 31.53 119.51 ± 0.13 21c17b5
AMD Radeon RX 7800 XT 1967.78 ± 8.20 115.71 ± 0.25 0889589
Nvidia Titan V 792.74 ± 4.30 109.21 ± 0.72 e56abd2
AMD Radeon Pro VII 783.94 ± 0.77 108.45 ± 0.48 N/A
AMD Radeon RX 6900 XT 1761.93 ± 4.75 106.15 ± 0.04 a972fae
AMD Radeon RX 7900 GRE 1465.45 ± 13.53 101.73 ± 0.02 2b3efea
Nvidia RTX 2080 Ti 1936.25 ± 32.08 100.99 ± 0.24 N/A
AMD Radeon RX 6800 XT 1704.79 ± 0.71 100.50 ± 0.06 N/A
Nvidia RTX 4070 4293.57 ± 27.70 91.49 ± 0.89 9a48399 coopmat2
Nvidia RTX 5060 Ti 3492.22 ± 15.73 83.26 ± 2.03 658987c coopmat2
AMD Radeon RX 6750 XT 997.05 ± 0.45 82.29 ± 0.06 228f34c
AMD Radeon RX 6700 XT 1010.90 ± 12.89 81.86 ± 0.19 6d75883
AMD Radeon RX 6800 662.87 ± 0.74 80.17 ± 0.12 97340b4
AMD Radeon Pro V620 1556.31 ± 2.82 79.24 ± 0.09 03d4698
Nvidia Tesla T10 1840.14 ± 1.22 76.05 ± 0.13 7f76692 coopmat2
Intel Arc B580 419.49 ± 3.37 72.00 ± 0.24 7f76692
Apple M4 Max Macbook Pro 557.46 ± 26.87 71.79 ± 4.16 1ece0cb6
AMD Radeon RX 5700 XT 474.89 ± 0.23 71.66 ± 0.05 0889589
AMD Radeon RX 9060 XT 1915.41 ± 7.90 70.52 ± 0.16 ed52f36
Nvidia RTX 3060 1715.17 ± 21.13 66.12 ± 1.74 e288693 coopmat2
Nvidia GTX 1080 Ti 529.96 ± 0.38 64.63 ± 0.10 360d653
Nvidia RTX 3070 Mobile 1832.07 ± 57.14 62.92 ± 0.37 ceff6bb coopmat2
Nvidia Tesla P40 484.37 ± 0.27 59.22 ± 0.15 N/A
AMD Radeon RX 6650 XT 730.64 ± 0.25 59.18 ± 0.44 228f34c
AMD Radeon RX 7600 XT 586.16 ± 2.43 59.02 ± 0.03 3b24d26
Nvidia GTX 1660 Ti Mobile 514.34 ± 0.88 57.30 ± 0.42 b43556e
AMD Ryzen Al Max+ 395 1357.07 ± 10.94 53.00 ± 0.13 7f76692
Intel Arc B570 288.51 ± 0.09 50.49 ± 0.05 7f76692
AMD Radeon RX 6800M 784.16 ± 2.76 49.06 ± 0.34 8e6f8bc
AMD Radeon RX 6600 622.72 ± 0.20 48.31 ± 0.04 4227c9b
Intel Arc A770 363.82 + 0.95 48.30 + 0.09 5bb4a3e
AMD Radeon RX Vega 64 320.12 ± 0.22 47.06 ± 0.01 ec428b0
Nvidia RTX A2000 1361.85 ± 3.26 45.69 ± 0.20 b1afcab coopmat2
Nvidia GTX 1070 Ti 292.85 ± 0.23 43.42 ± 0.34 860a9e4 eGPU
Intel Arc A750 312.51 ± 1.74 42.15 ± 0.08 21c17b5
Nvidia GTX 1070 321.81 ± 0.16 40.82 ± 0.86 360d653
AMD Radeon RX 480 194.52 ± 0.61 37.23 ± 0.09 0bcb40b
AMD Radeon RX 470 197.94 ± 2.94 35.14 ± 1.93 e288693
Nvidia GTX 980 180.97 ± 0.74 34.16 ± 0.10 860a9e4
AMD FirePro W8100 140.52 ± 0.34 29.28 ± 0.14 4536363
AMD Ryzen AI 9 300 Series 532.59 ± 3.55 22.31 ± 0.06 N/A
AMD Ryzen 5 6000 Series 277.91 ± 0.37 21.15 ± 0.09 ee09828
Apple M2 Pro Mac Mini 58.86 ± 0.02 20.97 ± 0.03 1fe0029 Asahi Linux
AMD Ryzen 7 7000 Series 312.85 ± 2.51 20.09 ± 0.35 835b2b9
AMD Ryzen 5 8000 Series 188.84 ± 0.73 16.57 ± 0.26 9ecf3e6
AMD Radeon Pro WX 4100 75.59 ± 0.19 16.56 ± 0.04 860a9e4
Apple M1 Mac Mini 28.65 ± 0.00 12.38 ± 0.03 1fe0029 Asahi Linux
AMD Ryzen 5 5000 Series 79.06 ± 0.01 10.75 ± 0.00 5d195f1
AMD Ryzen 7 5000 Series 76.53 ± 0.12 10.09 ± 0.01 860a9e4
Nvidia Tesla K80 88.26 ± 0.19 9.49 ± 0.01 5d46bab Running on single GPU
AMD Ryzen 5 3000 Series 47.41 ± 0.14 8.47 ± 0.01 1fe0029
Intel Core Ultra 7 100 Series 77.66 ± 2.75 7.75 ± 0.05 2e89f76
AMD Ryzen 5 4000 Series 41.54 + 0.05 7.41 + 0.06 N/A
Intel Core i7 8000 Series 25.55 ± 0.04 3.35 ± 0.02 c4df49a
Intel N150 25.59 ± 0.00 2.91 ± 0.00 4f63cd7
Intel Core i7 1100 Series 84.19 ± 3.31 2.87 ± 0.01 860a9e4 Slow memory
You must be logged in to vote

Replies: 189 comments 288 replies

Comment options

netrunnereve
Dec 18, 2024
Collaborator Author

AMD FirePro W8100

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)
model size params backend ngl threads sm test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 none pp512 137.10 ± 0.44
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 none tg128 28.51 ± 0.12
You must be logged in to vote
2 replies
Comment options

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)
model size params backend ngl threads sm test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 none pp512 154.96 ± 0.60
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 none tg128 28.55 ± 0.17
Comment options

netrunnereve Aug 22, 2025
Collaborator Author

With FA:

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 45363632c (6249)
model size params backend ngl threads sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 none 0 pp512 155.22 ± 0.17
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 none 0 tg128 29.52 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 none 1 pp512 140.52 ± 0.34
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 none 1 tg128 29.28 ± 0.14
Comment options

netrunnereve
Dec 18, 2024
Collaborator Author

AMD RX 470

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)
model size params backend ngl threads main_gpu sm test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 1 none pp512 161.47 ± 0.43
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 1 none tg128 33.45 ± 0.04
You must be logged in to vote
3 replies
Comment options

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)
model size params backend ngl threads main_gpu sm test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 1 none pp512 185.48 ± 1.17
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 1 none tg128 33.94 ± 0.06
Comment options

netrunnereve Aug 22, 2025
Collaborator Author

With FA:

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 45363632c (6249)
model size params backend ngl threads main_gpu sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 1 none 0 pp512 185.73 ± 0.69
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 1 none 0 tg128 34.89 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 1 none 1 pp512 179.01 ± 0.65
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 1 none 1 tg128 34.65 ± 0.17
Comment options

i got the mining edition 8 gb ran slightly better

ggml_vulkan: 0 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 0 pp512 218.07 ± 0.56
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 0 tg128 38.63 ± 0.21
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 1 pp512 197.94 ± 2.94
llama 7B Q4_0 3.56 GiB 6.74 B RPC,Vulkan 99 1 tg128 35.14 ± 1.93

build: e288693 (6242)

Comment options

ubuntu 24.04, vulkan and cuda installed from official APT packages.

ggml_vulkan: 0 = NVIDIA GeForce RTX 3080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 1706.07 ± 139.33
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 62.16 ± 1.98

build: 4da69d1 (4351)

vs CUDA on the same build/setup

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 pp512 4499.47 ± 60.66
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 tg128 131.01 ± 0.43

build: 4da69d1 (4351)

You must be logged in to vote
0 replies
Comment options

Macbook Air M2 on Asahi Linux

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 (G14G B0) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 38.67 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 11.07 ± 0.04

[build build: 017cc5f](build: 017cc5f)

You must be logged in to vote
3 replies
Comment options

For the record I think this is slow on the HoneyKrisp side rather than llama.cpp

Comment options

Can you share how you got vulkan to build on Asahi? I can't seem to get cmake to notice it.

cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+nosme 
CMake Error at /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
 Could NOT find Vulkan (missing: Vulkan_LIBRARY) (found version "1.3.296")
Call Stack (most recent call first):
 /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
 /usr/share/cmake-3.30/Modules/FindVulkan.cmake:595 (find_package_handle_standard_args)
 ggml/src/ggml-vulkan/CMakeLists.txt:4 (find_package)
-- Configuring incomplete, errors occurred!
Comment options

Spoke too soon, got it working! cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1 -DVulkan_LIBRARY=/usr/lib64/libvulkan.so.1

Comment options

Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 199.36 ± 7.02
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 18.77 ± 0.02

[build build: 53ff6b9](build: 53ff6b9)

You must be logged in to vote
0 replies
Comment options

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 1545.39 ± 6.58
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 88.12 ± 1.06

[build build: 53ff6b9](build: 53ff6b9)

You must be logged in to vote
4 replies
Comment options

0cc4m Jan 8, 2025
Collaborator

Cool setup! Could you also post the result of 1, 2 and 3 7900 XTX GPUs? You can use only the first GPU with export GGML_VK_VISIBLE_DEVICES=0, the first two with export GGML_VK_VISIBLE_DEVICES=0,1 and so on.

Comment options

env GGML_VK_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 2022.59 ± 10.08
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 136.24 ± 0.30

env GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 2039.24 ± 18.08
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 140.68 ± 2.09

env GGML_VK_VISIBLE_DEVICES=2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 2062.17 ± 5.36
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 143.99 ± 0.23

env GGML_VK_VISIBLE_DEVICES=3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 1997.04 ± 5.78
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 136.98 ± 1.73

env GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 1668.19 ± 12.78
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 100.62 ± 0.66

env GGML_VK_VISIBLE_DEVICES=0,1,2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 1566.38 ± 8.01
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 97.96 ± 1.13

env GGML_VK_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 1484.04 ± 6.01
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 91.48 ± 0.63
Comment options

netrunnereve Jan 8, 2025
Collaborator Author

For this multi GPU case getting Vulkan to support #6017 pipeline parallelism might help improve the prompt processing speed.

Comment options

@netrunnereve I updated the commit id in all my result.

Comment options

build: 0d52a69 (4439)

NVIDIA GeForce RTX 3090 (NVIDIA)

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 3301.47 ± 33.76
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 123.72 ± 0.14

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 863.03 ± 0.70
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 91.59 ± 0.40

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 312.02 ± 0.97
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 70.17 ± 0.25

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 95.52 ± 0.12
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 44.49 ± 0.03
You must be logged in to vote
0 replies
Comment options

@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release

You must be logged in to vote
2 replies
Comment options

netrunnereve Jan 8, 2025
Collaborator Author

I've added -DCMAKE_BUILD_TYPE=Release to the post, but honestly I've always built without this flag for both Vulkan and CPU backends and never noticed a difference in performance. Having Release set might strip the debug symbols but it shouldn't affect the compiler optimizations.

My release numbers for the RX 470 are basically identical to the ones I posted earlier without the flag.

model size params backend ngl threads main_gpu sm test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 1 none pp512 160.08 ± 0.38
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 1 none tg128 33.41 ± 0.15
Comment options

0cc4m Jan 8, 2025
Collaborator

Maybe not in your case, but some other results are suspiciously low in tg (for example the RTX 3080)

Comment options

Build: 8d59d91 (4450)
ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 88.86 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 27.57 ± 0.03

Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
Compared to SYCL:

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 pp512 1616.11 ± 5.28
llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 tg128 36.64 ± 0.05

edit: retested both with the default batch size.

You must be logged in to vote
8 replies
Comment options

0cc4m Jan 10, 2025
Collaborator

They do have vtune but it needs a third party kernel module to run which I don't like tbh.

Also, I don't know whether it supports Vulkan apps or not. But it does seem to support opencl.

I put my A770 into a Windows PC and gave Intel GPA and vtune a shot: GPA just crashes most of the time, I couldn't get it to trace anything useful. vtune works, but does not support Vulkan. It just shows some high-level metrics in that case, not really useful sadly.

Comment options

Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.

I did build it with cmake with build type Release.

Comment options

0cc4m Jan 11, 2025
Collaborator

In that case it's something else, cause it should be performing similarly to my A770. I suspect the mesa version, there was something in newer mesa versions that slowed down tg on Intel.

Comment options

A750 has 448 CUs, A770 has 512 CUs I think. Personally, I am not worried about tg. I am worried about pp here. The gemm batch quickly saturates my GPU.

Comment options

Comment options

Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5.

build: 914a82d (4452)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
model size params backend ngl threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 pp512 94.78 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 8 tg128 25.32 ± 0.02
You must be logged in to vote
1 reply
Comment options

netrunnereve Jan 9, 2025
Collaborator Author

Very interesting, and looks like it's pretty close to the W8100 in tg despite being a dual GPU card. Your backend scales pretty well with layer splitting which is why I find it worthwhile to run my RX470 and W8100 together (I end up getting results that are close to the average of both cards).

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
model size params backend ngl threads main_gpu test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 1 pp512 147.84 ± 0.38
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 1 tg128 30.77 ± 0.00
Comment options

Latest arch with Vulkan Instance Version: 1.4.303 on a i7-1185G7 laptop. The config is not completely stock, I had to deal with thermals ages ago to boost the performance, so it doesn't throttle.

For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason cmake doesn't want to clean everything):

kill -STOP -1
timeout 240s $COMMAND
kill -CONT -1

Vulkan only:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 42.02 ± 0.07
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 7.28 ± 0.24

build: ff3fcab (4459)

Vulkan and OpenBLAS w/ default 4 threads:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 4 pp512 42.05 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 4 tg128 7.35 ± 0.26

This bit seems to underutilise both GPU and CPU in real conditions based on top activities.

Vulkan and OpenBLAS w/ default 8 threads:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 8 pp512 41.89 ± 0.06
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 8 tg128 7.22 ± 0.20
You must be logged in to vote
4 replies
Comment options

0cc4m Jan 10, 2025
Collaborator

Unless you reduce the number of GPU layers, threads and openblas/non-openblas is not gonna make any difference. Try it with ngl 0, then only prompt processing is accelerated using Vulkan, the rest runs on CPU. This is often a good setting for integrated GPUs.

Comment options

That's something I didn't think about, with -ngl 0 it goes like this:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 4 pp512 30.51 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 4 tg128 9.87 ± 0.05

build: ba8a1f9 (4460)

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 8 pp512 32.11 ± 0.45
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 8 tg128 9.49 ± 0.18
Comment options

It seems latest patches has improved the results a bit:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model size params backend ngl threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 50.86 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg128 8.30 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 2 pp512 50.90 ± 0.01
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 2 tg128 8.11 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 4 pp512 50.91 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 4 tg128 7.99 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 pp512 50.89 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 8 tg128 7.92 ± 0.24
Comment options

A few months later and I get:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 106.19 ± 0.40
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 5.89 ± 0.20
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 0 pp512 73.26 ± 1.55
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 0 tg128 5.24 ± 0.02

build: f3a4b16 (5568)

I run it on Linux (Arch with lama.cpp-vulkan-git package compiled by GCC 15). From my tests, only Vulkan backend (1.4.313) provides visible gains on i7-1185G7 processor, when comparing to other methods (I tried different combinations of GCC and Intel DPC++ compilers and backends: BLIS, OpenBLAS, oneMKL, SYCL, Vulkan).

I'm curious why I cannot go over 6 t/s. Is this an issue with the newer llama.cpp version or with my OS configuration?

Comment options

Intel ARC A770 on Windows:

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 314.24 ± 1.04
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 45.22 ± 0.25

build: ba8a1f9 (4460)

You must be logged in to vote
0 replies
Comment options

Single GPU Vulkan

Radeon Instinct MI25

ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 439.42 ± 0.34
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 54.69 ± 0.03

build: 2739a71 (4461)

Radeon PRO VII

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 329.86 ± 0.80
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 75.22 ± 0.05

build: 2739a71 (4461)

Multi GPU Vulkan

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 324.55 ± 0.55
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 38.39 ± 0.09

build: 2739a71 (4461)

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 3 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 4 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model size params backend ngl test t/s
llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 100 pp512 32.29 ± 0.04
llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 100 tg128 4.75 ± 0.00

build: 2739a71 (4461)

Single GPU Rocm

Device 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 pp512 409.83 ± 0.23
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 tg128 63.94 ± 0.06

build: 2739a71 (4461)

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 pp512 1064.99 ± 1.18
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 tg128 87.45 ± 0.04

build: 2739a71 (4461)

Multi GPU Rocm

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 pp512 1061.87 ± 0.26
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 tg128 81.49 ± 0.41

build: 2739a71 (4461)

Layer split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model size params backend ngl test t/s
llama 70B Q5_K - Medium 46.51 GiB 70.55 B ROCm 100 pp512 16.36 ± 0.02
llama 70B Q5_K - Medium 46.51 GiB 70.55 B ROCm 100 tg128 6.43 ± 0.01

build: 2739a71 (4461)

Row split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model size params backend ngl sm test t/s
llama 70B Q5_K - Medium 46.51 GiB 70.55 B ROCm 100 row pp512 30.86 ± 0.03
llama 70B Q5_K - Medium 46.51 GiB 70.55 B ROCm 100 row tg128 12.52 ± 0.21

build: 2739a71 (4461)

Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split.

You must be logged in to vote
2 replies
Comment options

What is the power profile for this MI25? Mine is 110W but its running slower than yours on git from today.

Comment options

Mine defaults to 220w.
You can increase the power with rocm-smi --setpoweroverdrive 220

Comment options

AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
build: c05e8c9 (4462)

Vulkan:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 439.42 ± 0.28
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 70.13 ± 0.05

HIP:

 Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 pp512 354.17 ± 0.18
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 tg128 67.55 ± 0.04

I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).

Vulkan FA:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 214.48 ± 2.31
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg128 23.21 ± 0.08

HIP FA:

 Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 pp512 314.17 ± 0.29
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 100 1 tg128 62.02 ± 0.05
You must be logged in to vote
2 replies
Comment options

0cc4m Jan 12, 2025
Collaborator

There is no Vulkan flash attention support (except with coopmat2 on very new nvidia drivers). What you're measuring here is a CPU fallback.

Comment options

I see, I was sure about the CPU fallback but didn't know there was no flash attention support at all.

Comment options

I tried but there's nothing after 1 hrs , ok, might be 40 mins...

Anyway I run the llama_cli for a sample eval...

build: 4419 (46e3556e)

./llama-cli -m ~/storage/llama-2-7b.Q4_0.gguf -p "can u" -ngl 100 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Mali-G57 (Mali-G57) | uma: 1 | fp16: 1 | warp size: 16 | matrix cores: none build: 4419 (46e3556e) with clang version 19.1.6 for aarch64-unknown-linux-android24
llama_perf_sampler_print: sampling time = 3.31 ms / 24 runs ( 0.14 ms per token, 7242.00 tokens per second) llama_perf_context_print: load time = 28544.85 ms llama_perf_context_print: prompt eval time = 3788.63 ms / 3 tokens ( 1262.88 ms per token, 0.79 tokens per second) llama_perf_context_print: eval time = 23248.44 ms / 20 runs ( 1162.42 ms per token, 0.86 tokens per second) llama_perf_context_print: total time = 27591.65 ms / 23 tokens

Meanwhile OpenBLAS

llama_perf_sampler_print: sampling time = 5.00 ms / 43 runs ( 0.12 ms per token, 8608.61 tokens per second) llama_perf_context_print: load time = 10871.74 ms llama_perf_context_print: prompt eval time = 1228.38 ms / 3 tokens ( 409.46 ms per token, 2.44 tokens per second) llama_perf_context_print: eval time = 17010.39 ms / 39 runs ( 436.16 ms per token, 2.29 tokens per second) llama_perf_context_print: total time = 18639.62 ms / 42 tokens
You must be logged in to vote
2 replies
Comment options

netrunnereve Jan 12, 2025
Collaborator Author

Even at below 1t/s llama-bench shouldn't run for an hour. The support just isn't there atm for Vulkan on Android.

Comment options

Truth is ...

(0.79 tokens per second),

3788.63 ms / 3 tokens

So it's not even...it just slower...

Comment options

@ ~/git/llama.cpp/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
lsfg-vk: Configuration entry disappeared, disabling.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | pp512 | 1997.19 ± 7.76 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | tg128 | 163.00 ± 0.76 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | pp512 | 2044.99 ± 14.06 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | tg128 | 166.25 ± 0.19 |
build: 3c3635d2 (6400)
@ ~/git/llama.cpp/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
lsfg-vk: Configuration entry disappeared, disabling.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | pp512 | 2254.00 ± 9.48 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | tg128 | 162.16 ± 0.11 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | pp512 | 2281.62 ± 25.37 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | tg128 | 165.08 ± 0.27 |
build: 7a50cf388 (6779)

Newer builds seem to have better tg performance than pp performance or I'm confused as to what I am doing wrong to get lower pp performance on my XTX.

This mesa radv merge request massively improves prompt processing

I'll have to do an upgrade later.

You must be logged in to vote
0 replies
Comment options

Titan V (12 GB / HBM2 / 3072 bit)

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA TITAN V (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 796.29 _ 5.84
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 105.06 ± 0.27
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 792.74 ± 4.30
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 109.21 ± 0.72

build: e56abd2 (6794)

I wonder if there's a way to actually utilize the matrix cores.

You must be logged in to vote
0 replies
Comment options

NVIDIA GeForce RTX 4080 SUPER

OS: NixOS / Linux 6.16.11-xanmod1
Mesa: 25.2.4 (ab462ae6)

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 SUPER (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp512 7101.18 ± 269.79
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg128 147.13 ± 5.64
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 8007.37 ± 46.03
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg128 150.20 ± 0.26

build: 81086cd (6729)

You must be logged in to vote
0 replies
Comment options

Radeon 660M (Ryzen 5 6600H), dual channel DDR5 4800MT/s
Debian 13, kernel 6.12.48, mesa 25.0.7

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV REMBRANDT) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 138.61 ± 0.15
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 16.64 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 148.05 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 16.56 ± 0.00

build: ee09828 (6795)

You must be logged in to vote
0 replies
Comment options

Radeon 680M (Ryzen 5 6800H), LPDDR5 6400 MT/s
Ubuntu 25.04, 6.17.3-061703-generic, amd mesa 25.0.7-0ubuntu0.25.04.2

./llama-bench -m ~/llm/models/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
load_backend: loaded RPC backend from /home/kii/llama.cpp/b6795/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV REMBRANDT) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/kii/llama.cpp/b6795/libggml-vulkan.so
load_backend: loaded CPU backend from /home/kii/llama.cpp/b6795/libggml-cpu-haswell.so
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp512 240.89 ± 0.52
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg128 21.26 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 277.91 ± 0.37
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg128 21.15 ± 0.09

build: ee09828 (6795)

You must be logged in to vote
0 replies
Comment options

RTX 3070 Laptop GPU (8 GB / GDDR6 / 256 bit)

Driver Version: 580.76.05

./build-vk/bin/llama-bench -m llama-2-7b.Q4_0.gguf -fa ,1 -sm none -mg 0 
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3070 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Intel(R) Graphics (ADL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
model size params backend ngl sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 none 0 pp512 1689.40 ± 19.57
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 none 0 tg128 63.64 ± 0.39
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 none 1 pp512 1832.07 ± 57.14
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 none 1 tg128 62.92 ± 0.37

build: ceff6bb (6783)

You must be logged in to vote
0 replies
Comment options

RX Vega 56

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX Vega (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 427.01 ± 2.94
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 55.84 ± 0.20
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 379.04 ± 0.26
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 56.66 ± 0.06

build: 66b0dbc (6791)

You must be logged in to vote
1 reply
Comment options

netrunnereve Oct 25, 2025
Collaborator Author

I know that's an old run but interestingly this is beating the Vega 64 by quite a bit.

Comment options

Strix Halo 395+ Debian GNU/Linux @ 6.16.12+deb14+1-amd64

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp512 972.06 ± 3.92
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg128 52.28 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 1054.53 ± 8.67
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg128 51.62 ± 0.04

build: 0bf47a1db (6829)

You must be logged in to vote
0 replies
Comment options

AMD AI HX 370 + 890m iGPU Running Windows 11 Pro

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) 890M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp512 479.07 ± 0.41
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg128 22.41 ± 0.18
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 532.59 ± 3.55
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg128 22.31 ± 0.06
You must be logged in to vote
0 replies
Comment options

Another Strix Halo 395+ (128GB). Also with the optimizations alluded to in the Strix Halo benchmarking guide.

Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS
Release: 24.04
Codename: noble
Linux framework 6.17.1-061701-generic #202510060945 SMP PREEMPT_DYNAMIC Mon Oct 6 12:03:14 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

radv, mesa 26.0.0


ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | pp512 | 1120.83 ± 2.57 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | tg128 | 53.21 ± 0.38 |
build: f8f071fad (6830)
framework@framework:~/workshop/llm/llama.cpp$ AMD_VULKAN_ICD=RADV /home/framework/workshop/llm/llama.cpp/build/bin/llama-bench -m /home/framework/workshop/llm/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | 1 | pp512 | 1243.31 ± 2.12 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | 1 | tg128 | 52.69 ± 0.10 |
build: f8f071fad (6830)

amdvulkan

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | pp512 | 1264.60 ± 2.36 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | tg128 | 52.50 ± 0.13 |
build: f8f071fad (6830)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | 1 | pp512 | 1305.74 ± 2.58 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | 1 | tg128 | 52.03 ± 0.05 |
build: f8f071fad (6830)
You must be logged in to vote
0 replies
Comment options

AMD Radeon RX 480

$ llama-bench --device vulkan2 -ngl 100 -m llama-2-7b.Q4_0.gguf -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA P104-100, compute capability 6.1, VMM: yes
Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1070 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = NVIDIA P104-100 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon RX 480 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend ngl fa dev test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,Vulkan 100 0 Vulkan2 pp512 201.82 ± 0.73
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,Vulkan 100 0 Vulkan2 tg128 36.49 ± 0.06
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,Vulkan 100 1 Vulkan2 pp512 194.52 ± 0.61
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,Vulkan 100 1 Vulkan2 tg128 37.23 ± 0.09

build: 0bcb40b (6833)

You must be logged in to vote
0 replies
Comment options

AMD Ryzen 5 5600H with Vega 7 iGPU. The package is capped at 35W. 2x 32GB DDR4 3200MHz (generic).
mesa-vulkan-drivers:amd64 25.2.3-1ubuntu1.

llama.cpp/build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | pp512 | 83.02 ± 0.01 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | tg128 | 10.87 ± 0.01 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | pp512 | 79.06 ± 0.01 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | tg128 | 10.75 ± 0.00 |

Slightly more surprising is this:

llama.cpp/build/bin/llama-bench -ngl 100 -fa 0,1 -m /mnt/models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 0 | pp512 | 110.79 ± 1.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 0 | tg128 | 15.42 ± 0.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 1 | pp512 | 106.54 ± 0.99 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 1 | tg128 | 15.49 ± 0.01 |
build: 5d195f17b (6839)
You must be logged in to vote
6 replies
Comment options

@pt13762104 Yes, most likely. Memory bandwidth is a limiting factor here I think. However those models make local LLMs feasible on platforms that were never thought to be sufficient for it.

Here's another example, 7B MoE, running completely on the iGPU and staying relatively cool.

$ llama-bench -ngl 100 -m /mnt/models/unsloth/granite-4.0-h-tiny-GGUF/granite-4.0-h-tiny-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid 7B.A1B Q4_0 | 3.73 GiB | 6.94 B | Vulkan | 100 | pp512 | 193.02 ± 0.95 |
| granitehybrid 7B.A1B Q4_0 | 3.73 GiB | 6.94 B | Vulkan | 100 | tg128 | 29.40 ± 0.07 |
Comment options

AMD Ryzen 5 5600H with Vega 7 iGPU. The package is capped at 35W. 2x 32GB DDR4 3200MHz (generic). mesa-vulkan-drivers:amd64 25.2.3-1ubuntu1.

You can uncap TDP for package / iGPU.

https://github.com/JamesCJ60/Universal-x86-Tuning-Utility/releases

I use this, and on my humble R5 4650U I GPU runs so fast you can play BG3 in full HD with decent FPS.

Comment options

You can uncap TDP for package / iGPU.

I know, but (a) this is a tiny mini PC that fits in my palm - it simply doesn't provide the cooling or voltage regulators; (b) the wall wart delivers 55W max; and (c) this isn't Windows (but POR can be edited in the BIOS).

Comment options

(b) the wall wart delivers 55W max

Wall wart hahahahahaha made me laugh aloud

Comment options

I get ~80 tps on a MX150 with 13 layers offloaded on my old laptop, it's quite simmilar to this, but tg is only 3 tps due to the laptop having only 1 channels of DDR4

Comment options

b6840, macOS Sequoia

AMD Radeon RX 6900 XT, eGPU, TB3, iMac Pro

./llama.cpp/build/bin/llama-bench -m ./GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 -sm none -mg 0
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro Vega 64 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend threads sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 none 0 pp512 116.28 ± 0.36
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 none 0 tg128 71.96 ± 0.29
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 none 1 pp512 111.82 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 none 1 tg128 72.39 ± 0.20

AMD Radeon Pro Vega 64, internal, iMac Pro

./llama.cpp/build/bin/llama-bench -m ./GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 -sm none -mg 1
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro Vega 64 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend threads main_gpu sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 1 none 0 pp512 23.90 ± 0.32
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 1 none 0 tg128 36.72 ± 2.79
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 1 none 1 pp512 21.66 ± 0.21
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 1 none 1 tg128 37.33 ± 0.08

FA ALL:
Results and cards in same order as above:

./llama/llama-bench -m ./GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa all -sm none -mg 0
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro Vega 64 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend threads sm test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 none pp512 118.09 ± 0.39
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 none tg128 72.31 ± 0.17

./llama/llama-bench -m ./GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa all -sm none -mg 1
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro Vega 64 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend threads main_gpu sm test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 1 none pp512 23.61 ± 0.61
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 10 1 none tg128 37.59 ± 0.41
You must be logged in to vote
0 replies
Comment options

Radeon Vega 6 APU 
(AMD Ryzen 5 PRO 4650U)
Win 11, latest build, 64 GB system RAM

.\llama-bench.exe -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 pp512 52.11 + 0.11
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 0 tg128 7.35 + 0.30
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 pp512 41.54 + 0.05
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 1 tg128 7.41 + 0.06

FA ALL

.\llama-bench.exe -m llama-2-7b.Q4_0.gguf -ngl 100 -fa all
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 52.20 + 0.04
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 7.49 + 0.02
You must be logged in to vote
0 replies
Comment options

AMD Radeon 880M (Ryzen AI 9 365)

Tested with RADV and AMDVLK.
Mesa 25.2.4

% AMD_VULKAN_ICD=AMDVLK ./build-vk/bin/llama-bench -m llama-2-7b.Q4_0.gguf -fa 0,1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 293.41 ± 3.16
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 8.31 ± 0.05
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 315.24 ± 3.80
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 8.30 ± 0.04

% AMD_VULKAN_ICD=RADV ./build-vk/bin/llama-bench -m llama-2-7b.Q4_0.gguf -fa 0,1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 pp512 201.59 ± 6.84
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 0 tg128 14.94 ± 0.07
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 246.65 ± 2.12
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 15.11 ± 0.46

build: c55d53a (6854)

Edit: it's reported as 890M but it's 880M.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /