-
Notifications
You must be signed in to change notification settings - Fork 432
-
Following up on the discussion in #772, I took a look at the code, and did a few tests to identify potential optimization opportunities.
All tests were done on an AMD Ryzen 5 3400G, RX 7600 XT, and SSD storage, running Linux, with display output routed to the iGPU. Mostly on Vulkan, though ROCm behaves similarly.
During model loading, each tensor is processed sequentially in three steps:
a) read from the model file;
b) converted (dequantized), depending on the tensor’s weight type;
c) loaded into VRAM (on devices where VRAM is separate from system RAM).
Step (a) is, as expected, I/O-bound and varies significantly between cold and hot cache scenarios. The file cache appears effective at minimizing this cost for hot cache.
Step (b) is CPU-bound and currently single-threaded. The overhead depends on the model’s structure: in the SDXL .safetensors file I tested, there were many small tensors requiring conversion (half in number, around 10% in bytes), but most were loaded as-is.
Step (c) is more unpredictable. Its duration can vary widely depending on what else is happening on the GPU. For example, in one instance, simply loading a small LLM model before sd.cpp caused image model loading to take 2 seconds longer (for a total of 12 seconds). In another test, after a system sleep and resume cycle, loading time was roughly halved. I ruled out thermal throttling, though I can’t be 100% certain no (削除) PEBKAC (削除ここまで) other external factor was involved.
For #772 , a low-hanging fruit could be decoupling (a+b) from (c): we could add support to the model context to persist the model in RAM, and only load/unload it into VRAM when needed. This would allow a single process to cache a few models, and switch between them with just the cost of (c).
I made a quick test converting the models fully into RAM before loading into VRAM:
wbruna@579972e . @JustMaier, could you please run a few tests in your setup with this change? The "phase 2" logs should give us a good estimate of the potential performance gains from keeping models cached in a persistent sd.cpp process.
Another option could be allowing overlap between (a), (b) and (c) for different tensors. I noticed ggml has support for loading weights asynchronously into VRAM, although it doesn't look like my hardware has support for it.
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 2 comments 8 replies
-
The --offload-to-cpu parameter has already achieved a similar function, allowing all weights to be placed in RAM and only loaded into VRAM when needed. You can checkout this pr #778.
Beta Was this translation helpful? Give feedback.
All reactions
-
--offload-to-cpu
Nice. Unloading from VRAM is currently only controlled by the free_params_immediately context flag, right? So setting that flag to false (to avoid paying for the RAM->VRAM cost on each inference), together with a new unload_weights function, would pretty much work as the 'low-hanging fruit' I suggested above. The process could keep an sd_ctx_t for each "pre-loaded" model, and one (or a few) active; and call unload_weights when switching that active context.
Beta Was this translation helpful? Give feedback.
All reactions
-
The --offload-to-cpu option will automatically offload the weights from the GPU once the computation is completed.
Beta Was this translation helpful? Give feedback.
All reactions
-
I managed to accelerate model loading by using multithreading, achieving up to ×ばつ faster loading compared to master
https://github.com/rmatif/stable-diffusion.cpp/blob/ref-tensor-loading/model.cpp
@leejet I can open a PR if you think this can be merged
Regarding the SDXL use case: what the civitai folks are doing is keeping the model in RAM and loading it directly from RAM → VRAM. What I noticed is that we are severely CPU-bound in this process.
I tried using mmap from ggml here: https://github.com/rmatif/stable-diffusion.cpp/tree/add-mmap
Storing the model in a ramdisk and loading it into RAM takes ~4.3s while using a warm load with mmap takes ~1.1s.
The warm load is essentially memory-bound, which means the cpu overhead is actually huge. From my measurements with multithreading we can achieve ramdisk → RAM in ~1.3s, meaning the cpu overhead will be around ~200ms. With further optimizations such as avoiding regex I think we could potentially reduce this even further
I noticed ggml has support for loading weights asynchronously into VRAM, although it doesn't look like my hardware has support for it
CUDA does indeed support asynchronous loading and it's actually doing great, the RAM -> VRAM (once the cpu overhead reduced) is not a bottleneck imo
Beta Was this translation helpful? Give feedback.
All reactions
-
🚀 3
-
I managed to accelerate model loading by using multithreading, achieving up to ×ばつ faster loading compared to master
https://github.com/rmatif/stable-diffusion.cpp/blob/ref-tensor-loading/model.cpp
@leejet I can open a PR if you think this can be merged
Cool. Ping me when you do, I can help reviewing it.
Regarding the SDXL use case: what the civitai folks are doing is keeping the model in RAM and loading it directly from RAM → VRAM. What I noticed is that we are severely CPU-bound in this process.
Note the bottleneck will likely depend on the CPU, backend and weight types. If the model's weights already have their final types, most of that CPU cost can vanish (even a straight conversion to gguf already speeds things up).
I tried using mmap from ggml here: https://github.com/rmatif/stable-diffusion.cpp/tree/add-mmap
Storing the model in a ramdisk and loading it into RAM takes ~4.3s while using a warm load with mmap takes ~1.1s.
The warm load is essentially memory-bound, which means the cpu overhead is actually huge. From my measurements with multithreading we can achieve ramdisk → RAM in ~1.3s, meaning the cpu overhead will be around ~200ms. With further optimizations such as avoiding regex I think we could potentially reduce this even further
Interesting. I saw you're using memcpy between the mapped file and the local buffers. It'd likely be a bigger, more invasive change, but for weights that don't need conversion, you could instead use the mapped area directly as the tensor buffer. This would help reducing memory pressure in cases where the same file may be loaded more than once - like Civitai's :-) And together with @leejet 's new offload-to-cpu code path, some models could then run directly from the page cache.
CUDA does indeed support asynchronous loading and it's actually doing great, the RAM -> VRAM (once the cpu overhead reduced) is not a bottleneck imo
Maybe not for plain CUDA :-) The RAM -> VRAM path on my end can take around 60% of the loading time.
Beta Was this translation helpful? Give feedback.
All reactions
-
Note the bottleneck will likely depend on the CPU, backend and weight types. If the model's weights already have their final types, most of that CPU cost can vanish (even a straight conversion to gguf already speeds things up).
Even with a good CPU and no conversion, it still takes around 4s to load from ramdisk → RAM
Interesting. I saw you're using
memcpybetween the mapped file and the local buffers. It'd likely be a bigger, more invasive change, but for weights that don't need conversion, you could instead use the mapped area directly as the tensor buffer. This would help reducing memory pressure in cases where the same file may be loaded more than once - like Civitai's :-)
Dropping memcpy would indeed be more effective. I also thought about adding a "pinged" option, where we explicitly request the model to be pinged and avoid being paged or moved. I didn’t pursue it further, since they mainly rely on safetensors and prefer having more explicit control over memory management using ramdisk directly rather than relying on mmap
And together with @leejet 's new
offload-to-cpucode path, some models could then run directly from the page cache.
I had already considered this, but I’m skeptical in the case of SDXL. Its model arch isn’t linear, so I doubt it can be done easily, it would probably require a significant amount of work imo
Maybe not for plain CUDA :-) The RAM -> VRAM path on my end can take around 60% of the loading time.
On my testing it's about ~20% I think it's reasonable
Beta Was this translation helpful? Give feedback.
All reactions
-
Great! PR is welcome. Where does this speed improvement come from? Is it due to multithreaded memcpy?
Beta Was this translation helpful? Give feedback.
All reactions
-
Using more (f)reads in parallel, I presume. The same rules as to why reading mmapped files is faster.
Beta Was this translation helpful? Give feedback.
All reactions
-
Great! PR is welcome. Where does this speed improvement come from? Is it due to multithreaded memcpy?
It's a combinaison of parallelizing metadata processing, executing concurrent disk reads and hiding the latency by overlapping I/O and cpu work
Beta Was this translation helpful? Give feedback.