-
Couldn't load subscription status.
- Fork 13.4k
-
Something I'm observing, unsure if this is expected behavior or not. Using b6719, built for Vulkan, presently with a Radeon RX 6600XT. I don't know if this is worthy of creating an issue or not.
When I'm using partial GPU offloading (such as for MoE, or -nkvo 1) I'm noticing that often, the CPU usage is lower in prompt processing than token generation. Kinda seems like it's using 3-4 threads for PP but it's respecting my -t 9 setting during PG. I'm observing this simply by watching amdgpu_top while llama.cpp is running.
/root/llama-builds/llama.cpp/bin/llama-bench \
-m /mnt/models/unsloth/LFM2-8B-A1B-UD-Q6_K_XL.gguf \
-ngl 999 -sm none -mg 1 -nkvo 1 \
--n-cpu-moe 0,999 \
-p 512,2048 \
-t 9
An example of what I'm seeing:
image This is while it's running the 2048 prompt size. I'd expect that CPU usage to be in the 800-900%, but it's hovering right now between 200-300%.
It also doesn't seem to be consistent. Sometimes it will use the full CPU threads for PP.
Is there a better way to test/observe what's happening?
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment
-
My guess would be that it is still running the large prompt processing operators on the GPU, so the CPU doesn't need all of its threads. I wouldn't worry too much about these usage indicators as long as the performance you are getting is in line with what you expect.
Beta Was this translation helpful? Give feedback.