Round-robin model layer loading when only performing prefill on a given node · ggml-org/llama.cpp · Discussion #16654

mrebersv
Oct 18, 2025

I searched to see if this was already requested, but didn't find any previous requests for the same feature. Apologies if I missed something.

Recently, there was a very interesting Exo blog post where prefill was performed on a DGX Spark while decoding was performed on a Mac Studio in order to leverage the strengths of both devices.

As I understand the current state of things, the DGX spark would still need to load the entire model into its memory to perform the prefill. Yet, based on the information in the Exo blog post, prefill seems to occur one layer at a time.

If a node was set to do the prefill only (as in the Exo post) in a cluster, it seems we could potentially use much cheaper GPUs that don't have large amounts of VRAM to speed up prefill. In order to do this, I propose a sort of round-robin loading of the layer models into the VRAM/RAM on the prefill node. An example would be as follows:

We have a GPU with 16 GB of VRAM, but a very large model that won't fit into that 16 GB. Instead, we can fit, for example, 5 layers into the VRAM (layers 1-5 at the start). When the prefill for layer 1 completes, it's freed from memory, and the next layer (6, in this example) is loaded into the newly freed memory. Of course, if there isn't enough free memory to load layer 6 yet, either it could be partially loaded, or we'd need to wait until layer 2 is finished and freed in order to load layer 6. Repeat until the entire prefill has been completed.

This would allow for purchasing two different pieces of hardware, one with higher memory bandwidth and capacity for the decode stage, and one with an inexpensive GPU for the prefill stage. Instead of needing the highest-end for hardware with large amounts of VRAM, one could buy, for instance a Mac Studio or used server with high memory bandwidth for the decode stage, and an inexpensive machine with a modest Nvidia GPU for the prefill stage. Or, potentially, run a container on the same Mac/Server with an external GPU passed through to the container. Obviously, it will still be faster if everything could happen within a single GPU, but larger models/contexts require expensive GPUs with enough VRAM, and especially large models still won't fit, requiring the purchase of multiple, expensive GPUs to obtain enough VRAM.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Round-robin model layer loading when only performing prefill on a given node #16654

Uh oh!

{{title}}

Uh oh!

mrebersv
Oct 18, 2025

Replies: 0 comments

Select a reply

Uh oh!

Round-robin model layer loading when only performing prefill on a given node #16654

Uh oh!

mrebersv Oct 18, 2025

Replies: 0 comments

mrebersv
Oct 18, 2025