-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Open
@bhaktatejas922
Description
Your current environment
on vllm v0.11.1rc0 - Nvidia GH200 (arm64)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 190, in forward
browser-1 | (EngineCore_DP0 pid=546) x = x + self.attn(self.norm1(x),
browser-1 | (EngineCore_DP0 pid=546) ^^^^^^^^^^^^^^^^^^^^^^^^
browser-1 | (EngineCore_DP0 pid=546) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
browser-1 | (EngineCore_DP0 pid=546) return self._call_impl(*args, **kwargs)
browser-1 | (EngineCore_DP0 pid=546) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
browser-1 | (EngineCore_DP0 pid=546) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
browser-1 | (EngineCore_DP0 pid=546) return forward_call(*args, **kwargs)
browser-1 | (EngineCore_DP0 pid=546) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
browser-1 | (EngineCore_DP0 pid=546) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 384, in forward
browser-1 | (EngineCore_DP0 pid=546) output = flash_attn_varlen_func(q,
browser-1 | (EngineCore_DP0 pid=546) ^^^^^^^^^^^^^^^^^^^^^^^^^
browser-1 | (EngineCore_DP0 pid=546) File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 233, in flash_attn_varlen_func
browser-1 | (EngineCore_DP0 pid=546) out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
browser-1 | (EngineCore_DP0 pid=546) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
browser-1 | (EngineCore_DP0 pid=546) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
browser-1 | (EngineCore_DP0 pid=546) return self._op(*args, **(kwargs or {}))
browser-1 | (EngineCore_DP0 pid=546) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
browser-1 | (EngineCore_DP0 pid=546) RuntimeError: This flash attention build does not support headdim not being a multiple of 32.
🐛 Describe the bug
Using VLLM with FA2 or FA3 results in this error.
For FA#, used latest FA3 built from source which should allow multiples of 8, but looks like functions from vllm fa2 get pulled.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.