- 
  Notifications
 You must be signed in to change notification settings 
- Fork 13.4k
Llama.cpp RPC over Ethernet strangely slow #9136
-
Hey everyone,
I was hoping to get some help with the RPC service on Llama.cpp. I'm running a pair of systems, with the latest Llama.cpp which I compiled myself ('DGGML_CUDA=ON DGGML_RPC=ON DGGML_CUDA_FORCE_CUBLAS=ON' flags on cmake). Each system has two GPUs, all recent discrete GeForce cards (4090 and 4060Tis).
The trouble is that I recently upgraded that segment of my network to have 2.5GB ethernet, as I understood from reading Reddit posts that this would be the limiting factor of Llama.cpp's ability to provide inference over RPC. Someone on one Reddit thread was talking about using USB4/Thunderbolt 4 to achieve a theoretical 40 Gb/s. The strange thing is that I'm only seeing about 30-50 Mb/s (as in megabit, not gigabit) of transfer on my ethernet when I'm running an inference task. This is nowhere near the maximum ethernet speed I've seen when doing other tasks, such as loading the model (which is about 1.9-2.1 Gb/s). As a result, the tokens per second is much slower than I would have expected, being around 3t/s.
If I disable one of the cards from being involved in the RPC cluster, the inference speeds up a little, but the network still doesn't transfer any faster than around 30-50 Mb/s. It makes sense that the inference is a little faster, but the amount of VRAM is lower, so the models can't be as large.
(I've also tried several models, with no notable difference in that network speed or tokens per second.)
Given that the GPU's internal buses are faster (18 Gb/s memory bus on a 4060 Ti) and PCIe is faster (7.877 GB/s on PCIe 4.0 at 4x), I don't understand why the network isn't being saturated, and thus the inference is running much slower. So, I don't understand what's causing this bottleneck.
Are there any obvious things I can check to try to fix this please? Any help is greatly, greatly appreciated, thank you!
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 4 comments 2 replies
-
Could you provide some more information about the commands and models that you are using? Would be useful to run llama-bench without and with RPC to see how the numbers compare. Also some iperf numbers might help.
During inference, only the hidden state is transferred across the network after each layer. The state is very small (few kB), so it's normal to not see huge network traffic. The "3t/s" speed is hard to say if it is expected without additional information.
There is some work pending for optimizing the network overhead (#8032), but not yet ready for testing.
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
I just did a lot of tests on RPC between 2 identical Xeon servers and I think that reddit post got the wrong conclusion.
The limiting factor is NOT the ethernet speed by far....
The limiting factor is the way how the parallelism works currently.
It would need some optimization.
The nodes should continue processing of a parallel inference Instead of waiting for the other node.
The current method is always waiting no matter how many parallel threads are configured.
You can simply see this exactly if you split the model in half between 2 RPC nodes and run parallel inferencing on them.
The CPU utilization will be 1/2 what it should be.
I have no time to go into more details now, but I am motivated in this topic, because I see a lot of room for improvement.
Also I had to add thread configuration and NUMA configuration to the rpc-server, because it was completely missing in the code.
Update: Regarding the slow speed between the RPC nodes I also noticed it. The speed fluctuates between 10-100MBytes/s with a stable 1G connection. But as I mentioned this is a secondary issue.
Beta Was this translation helpful? Give feedback.
All reactions
-
I agree with @Zorg33 , if there was some way to implement both tensor parallelization and pipeline parallelization, it'd make the RPC servers more effective. Also, if there was a way to implement pipeline parallelism with tensor parallelism it'd make for a more efficient RPC system. The bottleneck between RPC nodes seems to be coming from the communication methods between each RPC backend.
Picture of pipeline parallelism
image 
Picture of pipeline parallelism (llama.cpp row split?)
image 
If there is any other testing you'd like to see, I have access to 16 GPUs (ROCm) across 2 nodes (can only run up to 15 GPUs using RPC because local host is considered a device), let me know and I can run them for you and post the results @ggerganov
Beta Was this translation helpful? Give feedback.
All reactions
-
Exactly!
That C picture illustrates the exact problem with the handling of parallelism!
I try to formalize it in words:
The offset between parallel pipelines should be ONLY 1 LAYER instead of the whole set of layers that are handled by a node.
There are at least 2 levels of granularity in this problem.
- Node level: When a node is ready with processing a sequence it should immediately start processing the next input and not wait for the other nodes.
- Layer level: When a node is ready with a single layer it should start processing the next input.
- ...it can be split further and further... for less and less performance gain
The first level does not seem to be hard to implement and would give a N* performance boost where N is the number of RPC nodes.
The second level is a bit more complicated.
Keep in mind that the performance gain only applies for parallel inference and not for a single inference, because inference is strictly sequential regarding the order of layers!
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
I'm working at optimizing RPC and multi GPU as well, and no its not the network... i did upgrade to a 2.5gb network switch to make sure of that.
Its the way llama.cpp passes from GPU to GPU and then spits out the answer.
The more GPUs you have the more it has to go from GPU to GPU.
Sequence, not parallel.
As long as you have enough CPU to control the GPU's it doesn't slow it down. the GPUs are maxed (4070 and 4060ti)
So there is some improvement with say a 4080-5090 Main GPU setup (most the work is done by the main GPU)... but still its not worth it.
It doesn't seem to be memory bandwidth on the GPUs either... so we are not maxing LAN or VRAM Bus.
time taken for the main GPU to access the slave GPUs? i don't know that's something for the llama.cpp developers.
Beta Was this translation helpful? Give feedback.
All reactions
-
I got RPC working between my two machines, but yeah turns out running the model on a single machine with more CPU offloading actually performs almost twice as fast. I have one machine with 2x3090 and 1x4090 and then another machine with a 5090.
If I just run on the single machine with 3 GPUs, GLM-4.6 at IQ3_XXS with --n-cpu-moe of 67 I get a bit over 5t/s. If I add rpc to the 5090 machine which adds ~30gb of VRAM and less I need to offload to RAM, I figured it would be faster, but I only get 2t/s. I see about 500mbps over the network during inference.
I guess because I am still offloading to CPU as well there's also more overhead. I would have thought that if there was enough network bandwidth that it shouldn't be too big of a deal though considering you can inference with PCIE x1 and it's still fine bandwidth wise even when CPU offloading.
Edit: Ok yeah after doing some more testing it must be due to CPU offloading, if I use GLM-4.5-Air and can load all layers into the GPUs it's a pretty big improvement. GLM-4.5-Air on the 3x GPUs with --n-cpu-moe 11 I get ~17t/s, but if I use RPC and fit across the 4 GPUs I get ~53t/s
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1