Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Qwen3-VL #1063

Open
Open

Description

Is your feature request related to a problem? Please describe.

Currently, Tunix supports text-only Qwen3, but not multimodal Qwen3-VL. This makes it harder to compare performance of different VLMs on vision-language tasks.

Describe the solution you'd like

According to the technical report, Qwen3-VL adopts a three-module architecture comprising a vision encoder, an MLP-based vision–language merger, and a large language model (LLM). The vision encoder is SigLIP2, which we already have as WIP in #511.

It's also worth to mention:

  • Interleaved MRoPE, which we don't seem to support yet
  • DeepStack for the vision-language merger, which is a bit more complicated than what we gave for Gemma 3
  • Video timestamp

Additional context

A couple of design questions:

  1. Should we wait for Uiuc vlm pr compressed fixed #511 to be merged or start the work on Qwen3-VL in parallel? Calling @abheesht17 for an opinion.
  2. Should we extend (text-only) Qwen3 or create a totally new model? I'm not sure it will be easy to integrate DeepStack without changing the way of the text-only version.

Checklist

  • I have searched the existing issues for similar feature requests.
  • This is not a support question (please use the "bug template" for that).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /