Autoregressive LLMs as text encoders? · leejet/stable-diffusion.cpp · Discussion #653

stduhpf
Apr 11, 2025

With models like Lumina 2.0 and HiDream I1, the future of diffusion models seems to be to use autoregressive LLMs (GPTs) as text encoders, for example Google Gemma2 for Lumina or Meta Llama3 for HiDream. These models are already very well supported in llama.cpp, So I'm wondering what should be the way to support them.
Should llama.cpp be included as a submodule? (this could maybe help T5 run better on GPU too) Or should sdcpp re-implement these models from scratch?

Replies: 3 comments 5 replies

Green-Sky
Apr 11, 2025

Also, the way llama.cpp is moving, it is accumulating more and more other features, like clip and other embeddings, tts with audio en/decoder... just a matter of time before vae and diffusion sampling becomes a desired feature of llama.cpp .

2 replies

@wandbrandon

wandbrandon Apr 21, 2025

if diffusion were to be part of llama cpp, will that make this obsolete?

@Green-Sky

Green-Sky Apr 21, 2025

To a degree. This code base is different to what would fit into llama.cpp .
See how clip.cpp got integrated and the original repo is mostly un-maintained now, with the clip.cpp contained within llama.cpp being more feature full.

SkutteOleg
Apr 11, 2025

Or should sdcpp re-implement these models from scratch?

Reimplementing from scratch definitely should be avoided in my opinion. I think koboldcpp is positioned well

0 replies

rmatif
Aug 4, 2025

The new Qwen-Image will use Qwen2.5-VL as its text/image encoder. The future seems to be moving towards vision-language models, and more models will natively support instruction-based image editing. We should definitely rely on llama.cpp to catch up with this new trend

3 replies

@Green-Sky

Green-Sky Aug 4, 2025

There are a couple of interesting models that have been trained on a diffusion objective, either solely or in conjunction with autoregressive training objectives.

See Transfusion, MoT, LMFusion and similar.

I also saw X-Omni, which does "normal token prediction" autoregressively and then projects the image output part into the conditioner of flux.1-dev.
Model files available.

@Green-Sky

Green-Sky Aug 4, 2025

I see now that you meant "Qwen-Image". So yea, it goes both way. But it still is very close to flux in principle.

@rmatif

rmatif Aug 5, 2025

Yes, that's why I mentioned this model in particular. It's not an exotic architecture, it's MMDiT, something we can support if we have a text encoder implementation

Autoregressive LLMs as text encoders? #653

Uh oh!

Uh oh!

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 3 comments 5 replies