-
Notifications
You must be signed in to change notification settings - Fork 13.4k
-
I searched to see if this was already requested, but didn't find any previous requests for the same feature. Apologies if I missed something.
Recently, there was a very interesting Exo blog post where prefill was performed on a DGX Spark while decoding was performed on a Mac Studio in order to leverage the strengths of both devices.
As I understand the current state of things, the DGX spark would still need to load the entire model into its memory to perform the prefill. Yet, based on the information in the Exo blog post, prefill seems to occur one layer at a time.
If a node was set to do the prefill only (as in the Exo post) in a cluster, it seems we could potentially use much cheaper GPUs that don't have large amounts of VRAM to speed up prefill. In order to do this, I propose a sort of round-robin loading of the layer models into the VRAM/RAM on the prefill node. An example would be as follows:
We have a GPU with 16 GB of VRAM, but a very large model that won't fit into that 16 GB. Instead, we can fit, for example, 5 layers into the VRAM (layers 1-5 at the start). When the prefill for layer 1 completes, it's freed from memory, and the next layer (6, in this example) is loaded into the newly freed memory. Of course, if there isn't enough free memory to load layer 6 yet, either it could be partially loaded, or we'd need to wait until layer 2 is finished and freed in order to load layer 6. Repeat until the entire prefill has been completed.
This would allow for purchasing two different pieces of hardware, one with higher memory bandwidth and capacity for the decode stage, and one with an inexpensive GPU for the prefill stage. Instead of needing the highest-end for hardware with large amounts of VRAM, one could buy, for instance a Mac Studio or used server with high memory bandwidth for the decode stage, and an inexpensive machine with a modest Nvidia GPU for the prefill stage. Or, potentially, run a container on the same Mac/Server with an external GPU passed through to the container. Obviously, it will still be faster if everything could happen within a single GPU, but larger models/contexts require expensive GPUs with enough VRAM, and especially large models still won't fit, requiring the purchase of multiple, expensive GPUs to obtain enough VRAM.
Beta Was this translation helpful? Give feedback.