-
Couldn't load subscription status.
- Fork 13.4k
-
Hello there, thanks for the great work.
I'm wondering how to set a device order when using multiGPU system + RPC.
I have this example.
I have a consumer mobo, running on Linux, Fedora:
-
X8/X8 5.0 from CPU from top 2 PCIe slots (5090/5090).
-
X4/X4 4.0 from CPU from top 2 M2 slots, to PCIe adapters (4090/4090, both slots and adapters support 5.0 but 4090s are 4.0).
-
X4 4.0 from Chipset from bottom PCIe slot (A6000)
-
X4/X4 4.0 from Chipset from bottom M2 slots, to PCIe adapters (3090/3090)
And a Windows PC with a RTX 5090.
I have a 10gbps NIC on both PCs.
This complex example will be shown using GLM 4.6 IQ4_XS.
When running fully on GPU on the linux PC, with this command:
LLAMA_SET_ROWS=1 ./llama-server \
-m '/models/GLM-4.6-IQ4_XS.gguf' \
-c 32768 \
--no-mmap \
-ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn.=CUDA0" \
-ot "blk.(16|17|18|19|20|21|22|23|24|25).ffn.=CUDA1" \
-ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
-ot "blk.(38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
-ot "blk.(51|52|53|54|55|56|57|58|59).ffn.=CUDA4" \
-ot "blk.(61|62|63|64|65|66|67|68|69|70).ffn.=CUDA5" \
-ot "blk.(72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91).ffn.=CUDA6" \
-ot "blk.26.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
-ot "blk.26.ffn_gate_exps.weight=CUDA1" \
-ot "blk.26.ffn_(down_exps|up_exps).weight=CUDA0" \
-ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
-ot "blk.37.ffn_gate_exps.weight=CUDA2" \
-ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA3" \
-ot "blk.60.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
-ot "blk.60.ffn_gate_exps.weight=CUDA4" \
-ot "blk.60.ffn_(down_exps|up_exps).weight=CUDA6" \
-ot "blk.71.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA5" \
-ot "blk.71.ffn_gate_exps.weight=CUDA5" \
-ot "blk.71.ffn_(down_exps|up_exps).weight=CUDA6" \
-fa on \
-mg 0 \
-ub 1792
I get:
prompt eval time = 5781.87 ms / 4410 tokens ( 1.31 ms per token, 762.73 tokens per second)
eval time = 64378.63 ms / 1700 tokens ( 37.87 ms per token, 26.41 tokens per second)
But when removing a 3090 for this PC and using the 40Gbps NIC, running it with:
LLAMA_SET_ROWS=1 ./llama-server -m '/models/GLM-4.6-IQ4_XS.gguf' -c 32768 --no-mmap --rpc 192.168.50.2:50052 -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn.=CUDA0" \
-ot "blk.(16|17|18|19|20|21|22|23|24|25).ffn.=CUDA1" \
-ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
-ot "blk.(38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
-ot "blk.(51|52|53|54|55|56|57|58|59).ffn.=CUDA4" \
-ot "blk.(61|62|63|64|65|66|67|68|69|70).ffn.=RPC0[192.168.50.2:50052]" \
-ot "blk.(72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91).ffn.=CUDA5" \
-ot "blk.26.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
-ot "blk.26.ffn_gate_exps.weight=CUDA1" \
-ot "blk.26.ffn_(down_exps|up_exps).weight=CUDA0" \
-ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
-ot "blk.37.ffn_gate_exps.weight=CUDA2" \
-ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA3" \
-ot "blk.60.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
-ot "blk.60.ffn_gate_exps.weight=CUDA4" \
-ot "blk.60.ffn_(down_exps|up_exps).weight=CUDA5" \
-ot "blk.71.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=RPC0[192.168.50.2:50052]" \
-ot "blk.71.ffn_gate_exps.weight=RPC0[192.168.50.2:50052]" \
-ot "blk.71.ffn_(down_exps|up_exps).weight=CUDA5" \
-fa on -mg 1 -ub 1792
I get about 240 t/s PP and 16 t/s TG.
Note that -mg 0 or -mg 1 makes no difference.
When using 40Gbps at X1 3.0 (so about 9Gbps), I get
prompt eval time = 27661.35 ms / 4410 tokens ( 6.27 ms per token, 159.43 tokens per second)
eval time = 140832.84 ms / 1784 tokens ( 78.94 ms per token, 12.67 tokens per second)
I noticed this when loading the model:
load_tensors: offloading 93 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 94/94 layers to GPU
load_tensors: RPC0[192.168.50.2:50052] model buffer size = 20957.13 MiB
load_tensors: CPU model buffer size = 416.25 MiB
load_tensors: CUDA0 model buffer size = 27658.19 MiB
load_tensors: CUDA1 model buffer size = 20677.38 MiB
load_tensors: CUDA2 model buffer size = 20747.32 MiB
load_tensors: CUDA3 model buffer size = 27371.32 MiB
load_tensors: CUDA4 model buffer size = 18745.29 MiB
load_tensors: CUDA5 model buffer size = 43127.69 MiB
Where RPC seems to be first, and seems compute buffers also follow that pattern
llama_context: n_ctx_per_seq (32768) < n_ctx_train (202752) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: RPC0[192.168.50.2:50052] KV buffer size = 1792.00 MiB
llama_kv_cache: CUDA0 KV buffer size = 1792.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 1280.00 MiB
llama_kv_cache: CUDA2 KV buffer size = 1408.00 MiB
llama_kv_cache: CUDA3 KV buffer size = 1792.00 MiB
llama_kv_cache: CUDA4 KV buffer size = 1280.00 MiB
llama_kv_cache: CUDA5 KV buffer size = 2432.00 MiB
llama_kv_cache: size = 11776.00 MiB ( 32768 cells, 92 layers, 1/1 seqs), K (f16): 5888.00 MiB, V (f16): 5888.00 MiB
llama_context: RPC0[192.168.50.2:50052] compute buffer size = 819.03 MiB
llama_context: CUDA0 compute buffer size = 750.18 MiB
llama_context: CUDA1 compute buffer size = 638.15 MiB
llama_context: CUDA2 compute buffer size = 638.15 MiB
llama_context: CUDA3 compute buffer size = 750.18 MiB
llama_context: CUDA4 compute buffer size = 638.15 MiB
llama_context: CUDA5 compute buffer size = 1141.00 MiB
llama_context: CPU compute buffer size = 259.05 MiB
llama_context: graph nodes = 6529
llama_context: graph splits = 276
For reference, GPU order is this (I manually set a 5090 first):
ggml_cuda_init: found 6 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
It seems bigger compute buffer is on RPC, despite specifying -mg 1.
So I think it is first doing RPC and then sending the data via RPC (about 4-5Gbps) to the other PC and then it starts working.
Is there a way to like, do CUDA_VISIBLE_DEVICES to reorder, but for any device? Something like:
GGML_VISIBLE_DEVICES=CUDA 0, RPC0[192.168.50.2:50052], CUDA 1, CUDA 2, CUDA 3, CUDA 4, CUDA 5
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions
I am 90% sure, but in PR #16276 (comment) made it easier for setting the device name (thanks , @rgerganov). I had forgotten this.
It should now be
--device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5
Replies: 2 comments 18 replies
-
Is there a way to like, do CUDA_VISIBLE_DEVICES to reorder, but for any device? Something like:
GGML_VISIBLE_DEVICES=CUDA 0, RPC0[192.168.50.2:50052], CUDA 1, CUDA 2, CUDA 3, CUDA 4, CUDA 5
From tools/rpc/README.md
You can control the set of exposed CUDA devices with the
CUDA_VISIBLE_DEVICESenvironment variable or the--devicecommand line option.
in your case and example, it would be:
(削除) --device CUDA0,RPC0[192.168.50.2:50052],CUDA1,CUDA2,CUDA3,CUDA4,CUDA5 (削除ここまで)
Edit: --device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5
I think you will have to move the RPC device further down the list.
RPC is not without loss. Even if the RPC device is set inside the same machine, you will be losing performance compared to no RPC. There is no free lunch.
Beta Was this translation helpful? Give feedback.
All reactions
-
I just tried but got an error
LLAMA_SET_ROWS=1 ./llama-server -m '/models/GLM-4.6-IQ4_XS.gguf' -c 32768 --no-mmap --rpc 192.168.50.2:50052 -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn.=CUDA0" \
-ot "blk.(16|17|18|19|20|21|22|23|24|25).ffn.=CUDA1" \
-ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
-ot "blk.(38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
-ot "blk.(51|52|53|54|55|56|57|58|59).ffn.=CUDA4" \
-ot "blk.(61|62|63|64|65|66|67|68|69|70).ffn.=RPC0[192.168.50.2:50052]" \
-ot "blk.(72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91).ffn.=CUDA5" \
-ot "blk.26.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
-ot "blk.26.ffn_gate_exps.weight=CUDA1" \
-ot "blk.26.ffn_(down_exps|up_exps).weight=CUDA0" \
-ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
-ot "blk.37.ffn_gate_exps.weight=CUDA2" \
-ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA3" \
-ot "blk.60.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
-ot "blk.60.ffn_gate_exps.weight=CUDA4" \
-ot "blk.60.ffn_(down_exps|up_exps).weight=CUDA5" \
-ot "blk.71.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=RPC0[192.168.50.2:50052]" \
-ot "blk.71.ffn_gate_exps.weight=RPC0[192.168.50.2:50052]" \
-ot "blk.71.ffn_(down_exps|up_exps).weight=CUDA5" \
-fa on -mg 0 -ub 1792 --device CUDA0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5,RPC0[192.168.50.2:50052
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
error while handling argument "--device": invalid device: RPC0[192.168.50.2:50052]
Is there something wrong in the syntax?
Beta Was this translation helpful? Give feedback.
All reactions
-
Sorry, I am not on my main machine right now and cannot check my llama-swap file. Try RPC without the device number RPC[192.168.50.2:50052]. I think there was a change to how the device was named, but I cannot find it now.
Once I am back home (or I find the PR with the change explained) I will confirm.
Beta Was this translation helpful? Give feedback.
All reactions
-
Tried but sadly no luck either
ggml_cuda_init: found 6 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
error while handling argument "--device": invalid device: RPC[192.168.50.2:50052]
Sure no problem, many thanks!
Beta Was this translation helpful? Give feedback.
All reactions
-
I am 90% sure, but in PR #16276 (comment) made it easier for setting the device name (thanks , @rgerganov). I had forgotten this.
It should now be
--device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5
Beta Was this translation helpful? Give feedback.
All reactions
-
🎉 2
-
Okay perfect, that helped a lot!
At default order it was:
prompt eval time = 27661.35 ms / 4410 tokens ( 6.27 ms per token, 159.43 tokens per second)
eval time = 140832.84 ms / 1784 tokens ( 78.94 ms per token, 12.67 tokens per second)
By setting it at the end, it went to
prompt eval time = 16727.94 ms / 4410 tokens ( 3.79 ms per token, 263.63 tokens per second)
eval time = 78875.15 ms / 1396 tokens ( 56.50 ms per token, 17.70 tokens per second)
When putting it on the middle:
prompt eval time = 6483.46 ms / 4410 tokens ( 1.47 ms per token, 680.19 tokens per second)
eval time = 78029.06 ms / 1757 tokens ( 44.41 ms per token, 22.52 tokens per second)
Which is absolutely insane! So I will want to test installing Linux-Fedora in this PC and see if it makes it faster, but this is absolutely usable.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
I am 90% sure, but in PR #16276 (comment) made it easier for setting the device name (thanks , @rgerganov). I had forgotten this.
It should now be
--device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5
When I try to specify --device ROCm0,RPC0,RPC1,RPC2 like you mention it always ends up trying to offload the entire model to ROCm0, any suggestions?
command:
./llama-bench -m DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf -p 4096 -ngl 999 -fa 0,1 --mmap 0 --rpc 10.6.204.167:50053,10.6.207.147:50053,10.6.207.181:50053 --device ROCm0,RPC0,RPC1,RPC2 --
verbose
output:
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 238967.39 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 250575469056
llama_model_load: error loading model: unable to allocate ROCm0 buffer
llama_model_load_from_file_impl: failed to load mode
llama-bench --list-devices:
Available devices:
ROCm0: AMD Radeon Graphics (122880 MiB, 122724 MiB free)
RPC0: 10.6.204.167:50053 (122880 MiB, 122584 MiB free)
RPC1: 10.6.207.147:50053 (122880 MiB, 122584 MiB free)
RPC2: 10.6.207.181:50053 (122880 MiB, 122579 MiB free)
Beta Was this translation helpful? Give feedback.
All reactions
-
With llama-bench use / as the device separator, e.g. --device ROCm0/RPC0/RPC1/RPC2.
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 2
-
With
llama-benchuse/as the device separator, e.g.--device ROCm0/RPC0/RPC1/RPC2.
That works! Thank you!
Beta Was this translation helpful? Give feedback.
All reactions
-
We are always putting RPC devices first in the device chain because we want to make sure we don't copy logits over the network (see PR #9296). (削除) This is done here and unfortunately cannot be overridden with --device as @abc-nix suggests. Feel free to hack the device order in the code and let me know if you get better results if RPC devices are in the middle of the device chain. Ideally, we should respect what the user specified with --device and try to "optimize" only if device order is not explicitly set. (削除ここまで)
Beta Was this translation helpful? Give feedback.
All reactions
-
Also your first example was missing the closing square bracket: RPC0[192.168.50.2:50052 so perhaps will work with the IP if you add that?
Beta Was this translation helpful? Give feedback.
All reactions
-
@jukofyork At the end did what @abc-nix mentioned on #16625 (reply in thread), which is as you suggest as well to just use "RPC0" and it worked. It is basically 4x+ times faster on PP and ~2x times faster on TG, it's insane haha.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Can you run a test that is exactly the same as your "When running fully on GPU on the linux PC" settings, but ofload 1 small tensor to the other machine to see how much the PP and TG drop due to the latency costs?
Beta Was this translation helpful? Give feedback.
All reactions
-
@jukofyork sure I will give it a go when I get home.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
@jukofyork Sorry for the delay, but here are the results.
Running fully on GPU with the command mentioned on first post, I get:
prompt eval time = 5868.68 ms / 4458 tokens ( 1.32 ms per token, 759.63 tokens per second)
eval time = 67975.59 ms / 1760 tokens ( 38.62 ms per token, 25.89 tokens per second)
And then, giving 1 layer (~2GB) via RPC with:
LLAMA_SET_ROWS=1 ./llama-server \
-m '/models/GLM-4.6-IQ4_XS.gguf' \
-c 32768 \
--no-mmap \
-ngl 999 \
--rpc 192.168.50.2:50052 \
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn.=CUDA0" \
-ot "blk.(16|17|18|19|20|21|22|23|24|25).ffn.=CUDA1" \
-ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
-ot "blk.(38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
-ot "blk.(51|52|53|54|55|56|57|58|59).ffn.=CUDA4" \
-ot "blk.61.ffn.=RPC0[192.168.50.2:50052]" \
-ot "blk.(62|63|64|65|66|67|68|69|70).ffn.=CUDA5" \
-ot "blk.(72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91).ffn.=CUDA6" \
-ot "blk.26.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
-ot "blk.26.ffn_gate_exps.weight=CUDA1" \
-ot "blk.26.ffn_(down_exps|up_exps).weight=CUDA0" \
-ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
-ot "blk.37.ffn_gate_exps.weight=CUDA2" \
-ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA3" \
-ot "blk.60.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
-ot "blk.60.ffn_gate_exps.weight=CUDA4" \
-ot "blk.60.ffn_(down_exps|up_exps).weight=CUDA6" \
-ot "blk.71.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA5" \
-ot "blk.71.ffn_gate_exps.weight=CUDA5" \
-ot "blk.71.ffn_(down_exps|up_exps).weight=CUDA6" \
-fa on \
-mg 0 \
-ub 1792 --device CUDA0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5,RPC0,CUDA6
I get
prompt eval time = 13757.30 ms / 4458 tokens ( 3.09 ms per token, 324.05 tokens per second)
eval time = 74831.85 ms / 1391 tokens ( 53.80 ms per token, 18.59 tokens per second)
For some reason by using an extra CUDA 6 device, it tanks the speed. I have confirmed that removing that extra device, I get
prompt eval time = 6472.06 ms / 4458 tokens ( 1.45 ms per token, 688.81 tokens per second)
eval time = 86154.20 ms / 1967 tokens ( 43.80 ms per token, 22.83 tokens per second)
Not sure if I'm doing something wrong or such, will have to do more tests.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1