Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Boost performance by using idle GPU periods #16621

Unanswered
karambaso asked this question in Q&A
Discussion options

When a model is shares across a GPU's and PC's memory (layer parallelism) there are idle periods for both, the propcessor and GPU. The same situation is for multi-GPU setup. Another option is tensor parallelism, but it requires very fast interconnect between computing devices and actually in many cases much slower than layer parallelism. It means there is no faster option for inference, but the idle periods are here and can help if calculations to be made during them.

Actual question, or suggestion, is about complexity of using the idle periods for computations. Is it too hard to get GPU (and CPU) work during such periods? Why it isn't implemented? It is really hot option, considering, for example, llama-server, which is able to process many requests in parallel. But it employs another technique, which still leaves the idle periods untouched.

In a very schematic scenario, if llama backend would to support it, the parallel work could be split into two parts - the existing batch processing option and the idle period usage in form of another request being processed at the time when in current setup GPUs or CPU are idle (waiting other GPUs or CPU to process their layers).

Can such feature be implemented? Is it prohibitively complex to implement it?

If the feature would be implemented then there will be a great performance boost, linearly increasing with additional GPU's. Of course, there would be some performance loss for scheduling, but it will be negligible in comparison with the boost. For 2 GPU setup a GPU is idle for partOfTotalLayers/tokensPerSecond, it's 10 milliseconds for 50 t/s inference on two fast GPUs, and 10 ms is really a huge time for any scheduling. If layers are split across computing units (GPU, CPU) proportionally to unit's performance then throughput is increased twice for two GPU, thrice for 3, etc, comparing with the current option. With less optimal layer split the boost will be smaller, but on GPUs with equal performance and memory capacity the two (and three, and even four, as many, as there are GPUs) time increase is absolutely obvious, unless the scheduler is considered too hard to implement. If we to run models, bigger than GPU's memory, then the idle periods are unavoidable in the current setup, but with potential scheduling option such model split will be no longer a way to waste your GPUs.

You must be logged in to vote

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant

AltStyle によって変換されたページ (->オリジナル) /