Boost performance by using idle GPU periods · ggml-org/llama.cpp · Discussion #16621

karambaso
Oct 16, 2025

When a model is shares across a GPU's and PC's memory (layer parallelism) there are idle periods for both, the propcessor and GPU. The same situation is for multi-GPU setup. Another option is tensor parallelism, but it requires very fast interconnect between computing devices and actually in many cases much slower than layer parallelism. It means there is no faster option for inference, but the idle periods are here and can help if calculations to be made during them.

Actual question, or suggestion, is about complexity of using the idle periods for computations. Is it too hard to get GPU (and CPU) work during such periods? Why it isn't implemented? It is really hot option, considering, for example, llama-server, which is able to process many requests in parallel. But it employs another technique, which still leaves the idle periods untouched.

In a very schematic scenario, if llama backend would to support it, the parallel work could be split into two parts - the existing batch processing option and the idle period usage in form of another request being processed at the time when in current setup GPUs or CPU are idle (waiting other GPUs or CPU to process their layers).

Can such feature be implemented? Is it prohibitively complex to implement it?

If the feature would be implemented then there will be a great performance boost, linearly increasing with additional GPU's. Of course, there would be some performance loss for scheduling, but it will be negligible in comparison with the boost. For 2 GPU setup a GPU is idle for partOfTotalLayers/tokensPerSecond, it's 10 milliseconds for 50 t/s inference on two fast GPUs, and 10 ms is really a huge time for any scheduling. If layers are split across computing units (GPU, CPU) proportionally to unit's performance then throughput is increased twice for two GPU, thrice for 3, etc, comparing with the current option. With less optimal layer split the boost will be smaller, but on GPUs with equal performance and memory capacity the two (and three, and even four, as many, as there are GPUs) time increase is absolutely obvious, unless the scheduler is considered too hard to implement. If we to run models, bigger than GPU's memory, then the idle periods are unavoidable in the current setup, but with potential scheduling option such model split will be no longer a way to waste your GPUs.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Boost performance by using idle GPU periods #16621

Uh oh!

{{title}}

Uh oh!

karambaso
Oct 16, 2025

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Boost performance by using idle GPU periods #16621

Uh oh!

karambaso Oct 16, 2025

Replies: 0 comments

karambaso
Oct 16, 2025