Qwen3-VL #1063

Open

#1177

Labels

type:feature/enhancement

@ridcl

Description

@ridcl

ridcl

opened

on Feb 7, 2026

Is your feature request related to a problem? Please describe.

Currently, Tunix supports text-only Qwen3, but not multimodal Qwen3-VL. This makes it harder to compare performance of different VLMs on vision-language tasks.

Describe the solution you'd like

According to the technical report, Qwen3-VL adopts a three-module architecture comprising a vision encoder, an MLP-based vision–language merger, and a large language model (LLM). The vision encoder is SigLIP2, which we already have as WIP in #511.

It's also worth to mention:

Interleaved MRoPE, which we don't seem to support yet
DeepStack for the vision-language merger, which is a bit more complicated than what we gave for Gemma 3
Video timestamp

Additional context

A couple of design questions:

Should we wait for Uiuc vlm pr compressed fixed #511 to be merged or start the work on Qwen3-VL in parallel? Calling @abheesht17 for an opinion.
Should we extend (text-only) Qwen3 or create a totally new model? I'm not sure it will be easy to integrate DeepStack without changing the way of the text-only version.

Checklist

I have searched the existing issues for similar feature requests.
This is not a support question (please use the "bug template" for that).

Metadata

Assignees

No one assigned

Labels

type:feature/enhancement

Type

No type

Fields

Give feedback

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-VL #1063

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions