Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Autoregressive LLMs as text encoders? #653

stduhpf started this conversation in General
Discussion options

With models like Lumina 2.0 and HiDream I1, the future of diffusion models seems to be to use autoregressive LLMs (GPTs) as text encoders, for example Google Gemma2 for Lumina or Meta Llama3 for HiDream. These models are already very well supported in llama.cpp, So I'm wondering what should be the way to support them.
Should llama.cpp be included as a submodule? (this could maybe help T5 run better on GPU too) Or should sdcpp re-implement these models from scratch?

You must be logged in to vote

Replies: 3 comments 5 replies

Comment options

Also, the way llama.cpp is moving, it is accumulating more and more other features, like clip and other embeddings, tts with audio en/decoder... just a matter of time before vae and diffusion sampling becomes a desired feature of llama.cpp .

You must be logged in to vote
2 replies
Comment options

if diffusion were to be part of llama cpp, will that make this obsolete?

Comment options

To a degree. This code base is different to what would fit into llama.cpp .
See how clip.cpp got integrated and the original repo is mostly un-maintained now, with the clip.cpp contained within llama.cpp being more feature full.

Comment options

Or should sdcpp re-implement these models from scratch?

Reimplementing from scratch definitely should be avoided in my opinion. I think koboldcpp is positioned well

You must be logged in to vote
0 replies
Comment options

The new Qwen-Image will use Qwen2.5-VL as its text/image encoder. The future seems to be moving towards vision-language models, and more models will natively support instruction-based image editing. We should definitely rely on llama.cpp to catch up with this new trend

You must be logged in to vote
3 replies
Comment options

There are a couple of interesting models that have been trained on a diffusion objective, either solely or in conjunction with autoregressive training objectives.

See Transfusion, MoT, LMFusion and similar.

I also saw X-Omni, which does "normal token prediction" autoregressively and then projects the image output part into the conditioner of flux.1-dev.
Model files available.

Comment options

I see now that you meant "Qwen-Image". So yea, it goes both way. But it still is very close to flux in principle.

Comment options

Yes, that's why I mentioned this model in particular. It's not an exotic architecture, it's MMDiT, something we can support if we have a text encoder implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /