-
Couldn't load subscription status.
- Fork 13.4k
Performance of llama.cpp with Vulkan #10879
-
|
This is similar to the Apple Silicon benchmark thread, but for Vulkan! We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here. InstructionsEither run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard. If multiple entries are posted for the same setup I'll prioritize newer commits with substantial Vulkan updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that the memory speed and number of channels will greatly affect your inference speed! Vulkan Scoreboard (Click on the headings to expand the section)Llama 2 7B, Q4_0, no FA
Llama 2 7B, Q4_0, FA enabled
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
Replies: 189 comments 288 replies
-
|
AMD FirePro W8100
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
With the latest updates:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
With FA:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
AMD RX 470
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
With the latest updates:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
With FA:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
i got the mining edition 8 gb ran slightly better
build: e288693 (6242) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
ubuntu 24.04, vulkan and cuda installed from official APT packages.
build: 4da69d1 (4351) vs CUDA on the same build/setup
build: 4da69d1 (4351) |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 5
-
|
Macbook Air M2 on Asahi Linux ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
For the record I think this is slow on the HoneyKrisp side rather than llama.cpp
Beta Was this translation helpful? Give feedback.
All reactions
-
Can you share how you got vulkan to build on Asahi? I can't seem to get cmake to notice it.
cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+nosme
CMake Error at /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
Could NOT find Vulkan (missing: Vulkan_LIBRARY) (found version "1.3.296")
Call Stack (most recent call first):
/usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.30/Modules/FindVulkan.cmake:595 (find_package_handle_standard_args)
ggml/src/ggml-vulkan/CMakeLists.txt:4 (find_package)
-- Configuring incomplete, errors occurred!
Beta Was this translation helpful? Give feedback.
All reactions
-
Spoke too soon, got it working! cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1 -DVulkan_LIBRARY=/usr/lib64/libvulkan.so.1
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
ggml_vulkan: Found 4 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Cool setup! Could you also post the result of 1, 2 and 3 7900 XTX GPUs? You can use only the first GPU with export GGML_VK_VISIBLE_DEVICES=0, the first two with export GGML_VK_VISIBLE_DEVICES=0,1 and so on.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 3
-
For this multi GPU case getting Vulkan to support #6017 pipeline parallelism might help improve the prompt processing speed.
Beta Was this translation helpful? Give feedback.
All reactions
-
@netrunnereve I updated the commit id in all my result.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
build: 0d52a69 (4439) NVIDIA GeForce RTX 3090 (NVIDIA)
AMD Radeon RX 6800 XT (RADV NAVI21) (radv)
AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)
Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)
|
Beta Was this translation helpful? Give feedback.
All reactions
-
@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
Beta Was this translation helpful? Give feedback.
All reactions
-
|
I've added My release numbers for the RX 470 are basically identical to the ones I posted earlier without the flag.
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Maybe not in your case, but some other results are suspiciously low in tg (for example the RTX 3080)
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Build: 8d59d91 (4450)
Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
edit: retested both with the default batch size. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👀 1
-
They do have vtune but it needs a third party kernel module to run which I don't like tbh.
Also, I don't know whether it supports Vulkan apps or not. But it does seem to support opencl.
I put my A770 into a Windows PC and gave Intel GPA and vtune a shot: GPA just crashes most of the time, I couldn't get it to trace anything useful. vtune works, but does not support Vulkan. It just shows some high-level metrics in that case, not really useful sadly.
Beta Was this translation helpful? Give feedback.
All reactions
-
😕 1
-
Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.
I did build it with cmake with build type Release.
Beta Was this translation helpful? Give feedback.
All reactions
-
In that case it's something else, cause it should be performing similarly to my A770. I suspect the mesa version, there was something in newer mesa versions that slowed down tg on Intel.
Beta Was this translation helpful? Give feedback.
All reactions
-
A750 has 448 CUs, A770 has 512 CUs I think. Personally, I am not worried about tg. I am worried about pp here. The gemm batch quickly saturates my GPU.
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2 -
👀 1
-
|
Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5. build: 914a82d (4452)
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Very interesting, and looks like it's pretty close to the W8100 in tg despite being a dual GPU card. Your backend scales pretty well with layer splitting which is why I find it worthwhile to run my RX470 and W8100 together (I end up getting results that are close to the average of both cards).
|
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
|
Latest arch with For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason kill -STOP -1 timeout 240s $COMMAND kill -CONT -1
ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: ff3fcab (4459)
This bit seems to underutilise both GPU and CPU in real conditions based on
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Unless you reduce the number of GPU layers, threads and openblas/non-openblas is not gonna make any difference. Try it with ngl 0, then only prompt processing is accelerated using Vulkan, the rest runs on CPU. This is often a good setting for integrated GPUs.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
That's something I didn't think about, with
build: ba8a1f9 (4460)
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
It seems latest patches has improved the results a bit: ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
A few months later and I get:
build: f3a4b16 (5568) I run it on Linux (Arch with lama.cpp-vulkan-git package compiled by GCC 15). From my tests, only Vulkan backend ( I'm curious why I cannot go over 6 t/s. Is this an issue with the newer llama.cpp version or with my OS configuration? |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Intel ARC A770 on Windows:
build: ba8a1f9 (4460) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Single GPU VulkanRadeon Instinct MI25 ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Radeon PRO VII ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Multi GPU Vulkanggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Single GPU RocmDevice 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
build: 2739a71 (4461) Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Multi GPU RocmDevice 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Layer split
build: 2739a71 (4461) Row split
build: 2739a71 (4461) Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split. |
Beta Was this translation helpful? Give feedback.
All reactions
-
What is the power profile for this MI25? Mine is 110W but its running slower than yours on git from today.
Beta Was this translation helpful? Give feedback.
All reactions
-
Mine defaults to 220w.
You can increase the power with rocm-smi --setpoweroverdrive 220
Beta Was this translation helpful? Give feedback.
All reactions
-
|
AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).
|
Beta Was this translation helpful? Give feedback.
All reactions
-
There is no Vulkan flash attention support (except with coopmat2 on very new nvidia drivers). What you're measuring here is a CPU fallback.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
I see, I was sure about the CPU fallback but didn't know there was no flash attention support at all.
Beta Was this translation helpful? Give feedback.
All reactions
-
I tried but there's nothing after 1 hrs , ok, might be 40 mins...
Anyway I run the llama_cli for a sample eval...
build: 4419 (46e3556e)
./llama-cli -m ~/storage/llama-2-7b.Q4_0.gguf -p "can u" -ngl 100 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Mali-G57 (Mali-G57) | uma: 1 | fp16: 1 | warp size: 16 | matrix cores: none build: 4419 (46e3556e) with clang version 19.1.6 for aarch64-unknown-linux-android24
llama_perf_sampler_print: sampling time = 3.31 ms / 24 runs ( 0.14 ms per token, 7242.00 tokens per second) llama_perf_context_print: load time = 28544.85 ms llama_perf_context_print: prompt eval time = 3788.63 ms / 3 tokens ( 1262.88 ms per token, 0.79 tokens per second) llama_perf_context_print: eval time = 23248.44 ms / 20 runs ( 1162.42 ms per token, 0.86 tokens per second) llama_perf_context_print: total time = 27591.65 ms / 23 tokens
Meanwhile OpenBLAS
llama_perf_sampler_print: sampling time = 5.00 ms / 43 runs ( 0.12 ms per token, 8608.61 tokens per second) llama_perf_context_print: load time = 10871.74 ms llama_perf_context_print: prompt eval time = 1228.38 ms / 3 tokens ( 409.46 ms per token, 2.44 tokens per second) llama_perf_context_print: eval time = 17010.39 ms / 39 runs ( 436.16 ms per token, 2.29 tokens per second) llama_perf_context_print: total time = 18639.62 ms / 42 tokens
Beta Was this translation helpful? Give feedback.
All reactions
-
Even at below 1t/s llama-bench shouldn't run for an hour. The support just isn't there atm for Vulkan on Android.
Beta Was this translation helpful? Give feedback.
All reactions
-
Truth is ...
(0.79 tokens per second),
3788.63 ms / 3 tokens
So it's not even...it just slower...
Beta Was this translation helpful? Give feedback.
All reactions
-
@ ~/git/llama.cpp/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
lsfg-vk: Configuration entry disappeared, disabling.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | pp512 | 1997.19 ± 7.76 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | tg128 | 163.00 ± 0.76 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | pp512 | 2044.99 ± 14.06 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | tg128 | 166.25 ± 0.19 |
build: 3c3635d2 (6400)
@ ~/git/llama.cpp/build/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
lsfg-vk: Configuration entry disappeared, disabling.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | pp512 | 2254.00 ± 9.48 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | tg128 | 162.16 ± 0.11 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | pp512 | 2281.62 ± 25.37 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | tg128 | 165.08 ± 0.27 |
build: 7a50cf388 (6779)
Newer builds seem to have better tg performance than pp performance or I'm confused as to what I am doing wrong to get lower pp performance on my XTX.
This mesa radv merge request massively improves prompt processing
I'll have to do an upgrade later.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Titan V (12 GB / HBM2 / 3072 bit) ggml_vulkan: Found 1 Vulkan devices:
build: e56abd2 (6794) I wonder if there's a way to actually utilize the matrix cores. |
Beta Was this translation helpful? Give feedback.
All reactions
-
NVIDIA GeForce RTX 4080 SUPEROS: NixOS / Linux 6.16.11-xanmod1
build: 81086cd (6729) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Radeon 660M (Ryzen 5 6600H), dual channel DDR5 4800MT/s
build: ee09828 (6795) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Radeon 680M (Ryzen 5 6800H), LPDDR5 6400 MT/s
build: ee09828 (6795) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
RTX 3070 Laptop GPU (8 GB / GDDR6 / 256 bit) Driver Version: 580.76.05
build: ceff6bb (6783) |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
RX Vega 56
build: 66b0dbc (6791) |
Beta Was this translation helpful? Give feedback.
All reactions
-
I know that's an old run but interestingly this is beating the Vega 64 by quite a bit.
Beta Was this translation helpful? Give feedback.
All reactions
-
|
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Another Strix Halo 395+ (128GB). Also with the optimizations alluded to in the Strix Halo benchmarking guide.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS
Release: 24.04
Codename: noble
Linux framework 6.17.1-061701-generic #202510060945 SMP PREEMPT_DYNAMIC Mon Oct 6 12:03:14 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
radv, mesa 26.0.0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | pp512 | 1120.83 ± 2.57 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | tg128 | 53.21 ± 0.38 |
build: f8f071fad (6830)
framework@framework:~/workshop/llm/llama.cpp$ AMD_VULKAN_ICD=RADV /home/framework/workshop/llm/llama.cpp/build/bin/llama-bench -m /home/framework/workshop/llm/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | 1 | pp512 | 1243.31 ± 2.12 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | 1 | tg128 | 52.69 ± 0.10 |
build: f8f071fad (6830)
amdvulkan
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | pp512 | 1264.60 ± 2.36 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | tg128 | 52.50 ± 0.13 |
build: f8f071fad (6830)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD RYZEN AI MAX+ 395 w/ Radeon 8060S (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | 1 | pp512 | 1305.74 ± 2.58 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan,BLAS | 16 | 1 | tg128 | 52.03 ± 0.05 |
build: f8f071fad (6830)
Beta Was this translation helpful? Give feedback.
All reactions
-
|
AMD Radeon RX 480 $ llama-bench --device vulkan2 -ngl 100 -m llama-2-7b.Q4_0.gguf -fa 0,1
build: 0bcb40b (6833) |
Beta Was this translation helpful? Give feedback.
All reactions
-
AMD Ryzen 5 5600H with Vega 7 iGPU. The package is capped at 35W. 2x 32GB DDR4 3200MHz (generic).
mesa-vulkan-drivers:amd64 25.2.3-1ubuntu1.
llama.cpp/build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | pp512 | 83.02 ± 0.01 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 0 | tg128 | 10.87 ± 0.01 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | pp512 | 79.06 ± 0.01 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | 1 | tg128 | 10.75 ± 0.00 |
Slightly more surprising is this:
llama.cpp/build/bin/llama-bench -ngl 100 -fa 0,1 -m /mnt/models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 0 | pp512 | 110.79 ± 1.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 0 | tg128 | 15.42 ± 0.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 1 | pp512 | 106.54 ± 0.99 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | Vulkan | 100 | 1 | tg128 | 15.49 ± 0.01 |
build: 5d195f17b (6839)
Beta Was this translation helpful? Give feedback.
All reactions
-
@pt13762104 Yes, most likely. Memory bandwidth is a limiting factor here I think. However those models make local LLMs feasible on platforms that were never thought to be sufficient for it.
Here's another example, 7B MoE, running completely on the iGPU and staying relatively cool.
$ llama-bench -ngl 100 -m /mnt/models/unsloth/granite-4.0-h-tiny-GGUF/granite-4.0-h-tiny-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| granitehybrid 7B.A1B Q4_0 | 3.73 GiB | 6.94 B | Vulkan | 100 | pp512 | 193.02 ± 0.95 |
| granitehybrid 7B.A1B Q4_0 | 3.73 GiB | 6.94 B | Vulkan | 100 | tg128 | 29.40 ± 0.07 |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
AMD Ryzen 5 5600H with Vega 7 iGPU. The package is capped at 35W. 2x 32GB DDR4 3200MHz (generic).
mesa-vulkan-drivers:amd64 25.2.3-1ubuntu1.
You can uncap TDP for package / iGPU.
https://github.com/JamesCJ60/Universal-x86-Tuning-Utility/releases
I use this, and on my humble R5 4650U I GPU runs so fast you can play BG3 in full HD with decent FPS.
Beta Was this translation helpful? Give feedback.
All reactions
-
You can uncap TDP for package / iGPU.
I know, but (a) this is a tiny mini PC that fits in my palm - it simply doesn't provide the cooling or voltage regulators; (b) the wall wart delivers 55W max; and (c) this isn't Windows (but POR can be edited in the BIOS).
Beta Was this translation helpful? Give feedback.
All reactions
-
😄 1
-
(b) the wall wart delivers 55W max
Wall wart hahahahahaha made me laugh aloud
Beta Was this translation helpful? Give feedback.
All reactions
-
I get ~80 tps on a MX150 with 13 layers offloaded on my old laptop, it's quite simmilar to this, but tg is only 3 tps due to the laptop having only 1 channels of DDR4
Beta Was this translation helpful? Give feedback.
All reactions
-
|
b6840, macOS Sequoia AMD Radeon RX 6900 XT, eGPU, TB3, iMac Pro ./llama.cpp/build/bin/llama-bench -m ./GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 -sm none -mg 0
AMD Radeon Pro Vega 64, internal, iMac Pro ./llama.cpp/build/bin/llama-bench -m ./GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 -sm none -mg 1
FA ALL: ./llama/llama-bench -m ./GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa all -sm none -mg 0
./llama/llama-bench -m ./GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa all -sm none -mg 1
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Radeon Vega 6 APU
(AMD Ryzen 5 PRO 4650U) .\llama-bench.exe -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
FA ALL .\llama-bench.exe -m llama-2-7b.Q4_0.gguf -ngl 100 -fa all
|
Beta Was this translation helpful? Give feedback.
All reactions
-
|
AMD Radeon 880M (Ryzen AI 9 365) Tested with RADV and AMDVLK. % AMD_VULKAN_ICD=AMDVLK ./build-vk/bin/llama-bench -m llama-2-7b.Q4_0.gguf -fa 0,1
% AMD_VULKAN_ICD=RADV ./build-vk/bin/llama-bench -m llama-2-7b.Q4_0.gguf -fa 0,1
build: c55d53a (6854) Edit: it's reported as 890M but it's 880M. |
Beta Was this translation helpful? Give feedback.