-
Notifications
You must be signed in to change notification settings - Fork 432
-
With models like Lumina 2.0 and HiDream I1, the future of diffusion models seems to be to use autoregressive LLMs (GPTs) as text encoders, for example Google Gemma2 for Lumina or Meta Llama3 for HiDream. These models are already very well supported in llama.cpp, So I'm wondering what should be the way to support them.
Should llama.cpp be included as a submodule? (this could maybe help T5 run better on GPU too) Or should sdcpp re-implement these models from scratch?
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 3
Replies: 3 comments 5 replies
-
Also, the way llama.cpp is moving, it is accumulating more and more other features, like clip and other embeddings, tts with audio en/decoder... just a matter of time before vae and diffusion sampling becomes a desired feature of llama.cpp .
Beta Was this translation helpful? Give feedback.
All reactions
-
if diffusion were to be part of llama cpp, will that make this obsolete?
Beta Was this translation helpful? Give feedback.
All reactions
-
To a degree. This code base is different to what would fit into llama.cpp .
See how clip.cpp got integrated and the original repo is mostly un-maintained now, with the clip.cpp contained within llama.cpp being more feature full.
Beta Was this translation helpful? Give feedback.
All reactions
-
Or should sdcpp re-implement these models from scratch?
Reimplementing from scratch definitely should be avoided in my opinion. I think koboldcpp is positioned well
Beta Was this translation helpful? Give feedback.
All reactions
-
The new Qwen-Image will use Qwen2.5-VL as its text/image encoder. The future seems to be moving towards vision-language models, and more models will natively support instruction-based image editing. We should definitely rely on llama.cpp to catch up with this new trend
Beta Was this translation helpful? Give feedback.
All reactions
-
There are a couple of interesting models that have been trained on a diffusion objective, either solely or in conjunction with autoregressive training objectives.
See Transfusion, MoT, LMFusion and similar.
I also saw X-Omni, which does "normal token prediction" autoregressively and then projects the image output part into the conditioner of flux.1-dev.
Model files available.
Beta Was this translation helpful? Give feedback.
All reactions
-
I see now that you meant "Qwen-Image". So yea, it goes both way. But it still is very close to flux in principle.
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, that's why I mentioned this model in particular. It's not an exotic architecture, it's MMDiT, something we can support if we have a text encoder implementation
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2