-
Notifications
You must be signed in to change notification settings - Fork 310
Qwen3-VL #1063
Open
Description
Is your feature request related to a problem? Please describe.
Currently, Tunix supports text-only Qwen3, but not multimodal Qwen3-VL. This makes it harder to compare performance of different VLMs on vision-language tasks.
Describe the solution you'd like
According to the technical report, Qwen3-VL adopts a three-module architecture comprising a vision encoder, an MLP-based vision–language merger, and a large language model (LLM). The vision encoder is SigLIP2, which we already have as WIP in #511.
It's also worth to mention:
- Interleaved MRoPE, which we don't seem to support yet
- DeepStack for the vision-language merger, which is a bit more complicated than what we gave for Gemma 3
- Video timestamp
Additional context
A couple of design questions:
- Should we wait for Uiuc vlm pr compressed fixed #511 to be merged or start the work on Qwen3-VL in parallel? Calling @abheesht17 for an opinion.
- Should we extend (text-only) Qwen3 or create a totally new model? I'm not sure it will be easy to integrate DeepStack without changing the way of the text-only version.
Checklist
- I have searched the existing issues for similar feature requests.
- This is not a support question (please use the "bug template" for that).
Metadata
Metadata
Assignees
Type
Fields
Give feedbackNo fields configured for issues without a type.