[Bug]: torch dynamo is not compatible with triton autotune #26993

New issue

Closed

Labels

bug rocm

@Rus-P

Description

@Rus-P

Rus-P

opened

on Oct 16, 2025

Your current environment

The output of python collect_env.py

==============================
 System Info
==============================
OS : Ubuntu 24.04.2 LTS (x86_64)
GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version : 19.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.4.3 25224 d366fa84f3fdcbd4b10847ebd5db572ae12a34fb)
CMake version : version 4.1.0
Libc version : glibc-2.39
==============================
 PyTorch Info
==============================
PyTorch version : 2.8.0+rocm6.4
Is debug build : False
CUDA used to build PyTorch : N/A
ROCM used to build PyTorch : 6.4.43482-0f2d60242
==============================
 Python Environment
==============================
Python version : 3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-6.8.0-79-generic-x86_64-with-glibc2.39
==============================
 CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : Could not collect
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : AMD Radeon PRO W7900 Dual Slot (gfx1100)
Nvidia driver version : Could not collect
cuDNN version : Could not collect
HIP runtime version : 6.4.43482
MIOpen runtime version : 3.4.0
Is XNNPACK available : True
==============================
 CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
BIOS Vendor ID: Advanced Micro Devices, Inc.
Model name: AMD EPYC 9554 64-Core Processor
BIOS Model name: AMD EPYC 9554 64-Core Processor Unknown CPU @ 3.1GHz
BIOS CPU family: 107
CPU family: 25
Model: 17
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU(s) scaling MHz: 49%
CPU max MHz: 3100.0000
CPU min MHz: 1500.0000
BogoMIPS: 6200.45
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization: AMD-V
L1d cache: 4 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 128 MiB (128 instances)
L3 cache: 512 MiB (16 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] mypy==1.13.0
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] onnx==1.16.1
[pip3] onnxscript==0.1.0.dev20240817
[pip3] optree==0.13.0
[pip3] pytorch-triton-rocm==3.4.0
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+rocm6.4
[pip3] torchvision==0.23.0+rocm6.4
[pip3] transformers==4.57.0
[pip3] triton==3.2.0+gite5be006a
[conda] No relevant packages
==============================
 vLLM Info
==============================
ROCM Version : 6.4.43484-123eb5128
Neuron SDK Version : N/A
vLLM Version : 0.10.1.dev1+gd3eea28b8.d20251016 (git sha: d3eea28b8, date: 20251016)
vLLM Build Flags:
 CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
 ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 40 40 40 72 72 72 72
GPU1 40 0 40 40 72 72 72 72
GPU2 40 40 0 40 72 72 72 72
GPU3 40 40 40 0 72 72 72 72
GPU4 72 72 72 72 0 40 40 40
GPU5 72 72 72 72 40 0 40 40
GPU6 72 72 72 72 40 40 0 40
GPU7 72 72 72 72 40 40 40 0
================================= Hops between two GPUs ==================================
 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 2 2 2 3 3 3 3
GPU1 2 0 2 2 3 3 3 3
GPU2 2 2 0 2 3 3 3 3
GPU3 2 2 2 0 3 3 3 3
GPU4 3 3 3 3 0 2 2 2
GPU5 3 3 3 3 2 0 2 2
GPU6 3 3 3 3 2 2 0 2
GPU7 3 3 3 3 2 2 2 0
=============================== Link Type between two GPUs ===============================
 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 PCIE PCIE PCIE PCIE PCIE PCIE PCIE
GPU1 PCIE 0 PCIE PCIE PCIE PCIE PCIE PCIE
GPU2 PCIE PCIE 0 PCIE PCIE PCIE PCIE PCIE
GPU3 PCIE PCIE PCIE 0 PCIE PCIE PCIE PCIE
GPU4 PCIE PCIE PCIE PCIE 0 PCIE PCIE PCIE
GPU5 PCIE PCIE PCIE PCIE PCIE 0 PCIE PCIE
GPU6 PCIE PCIE PCIE PCIE PCIE PCIE 0 PCIE
GPU7 PCIE PCIE PCIE PCIE PCIE PCIE PCIE 0
======================================= Numa Nodes =======================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 0
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: 0
GPU[2] : (Topology) Numa Node: 0
GPU[2] : (Topology) Numa Affinity: 0
GPU[3] : (Topology) Numa Node: 0
GPU[3] : (Topology) Numa Affinity: 0
GPU[4] : (Topology) Numa Node: 1
GPU[4] : (Topology) Numa Affinity: 1
GPU[5] : (Topology) Numa Node: 1
GPU[5] : (Topology) Numa Affinity: 1
GPU[6] : (Topology) Numa Node: 1
GPU[6] : (Topology) Numa Affinity: 1
GPU[7] : (Topology) Numa Node: 1
GPU[7] : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================
==============================
 Environment Variables
==============================
PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda
NCCL_IB_GDR_LEVEL=0
NCCL_NET_GDR_LEVEL=0
NCCL_DEBUG=INFO
NCCL_IB_HCA=mlx5_0
NCCL_IB_GID_INDEX=3
NCCL_PROTO=Simple
PYTORCH_TEST_WITH_ROCM=1
NCCL_DMABUF_ENABLE=0
PYTORCH_ROCM_ARCH=gfx1100
MAX_JOBS=32
LD_LIBRARY_PATH=/opt/ompi/lib:/opt/rocm/lib:/usr/local/lib:
NCCL_IB_DISABLE=0
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

when I ran the deepseek-v3 on 0.10.0 with v1 engine, it crashed when capturing cuda graph. When inferring, the optimal config for matmul is selected based on the size of M. This conflicts with the static graph requirement of Dynamo. How should this problem be solved?

command:

VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_MLA_DISABLE=1 
VLLM_USE_TRITON_FLASH_ATTN=1 
vllm serve DeepSeek-V3-0324-BF16-Cast-To-Blockwise-Int8 \
--block-size 16 \
--enable-chunked-prefill \
--max-num-seqs 256 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.97 \
--max-model-len 8192 \
--trust-remote-code \
-tp 8 \
-pp 4 \
--enable-expert-parallel \
--distributed-executor-backend ray

error:

2025年10月16日 05:51:05,844 ERROR worker.py:420 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1990, ip=192.168.23.10, actor_id=4cd629363da9989a8b12942002000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7ae31af94d40>)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 620, in execute_method
 raise e
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 611, in execute_method
 return run_method(self, method, args, kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2985, in run_method
 return func(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
 return func(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 233, in determine_available_memory
 self.model_runner.profile_run()
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2422, in profile_run
 = self._dummy_run(self.max_num_tokens, is_profile=True)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
 return func(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2174, in _dummy_run
 outputs = model(
 ^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 return self._call_impl(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 return forward_call(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 823, in forward
 hidden_states = self.model(input_ids, positions, intermediate_tensors,
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 272, in __call__
 output = self.compiled_callable(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 745, in compile_wrapper
 raise e.with_traceback(None) from e.__cause__ # User compiler error
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: invalid call to builtin op handler
 Explanation: Encountered TypeError when trying to handle op min
 Hint: This graph break may be difficult to debug. Please report an issue to PyTorch for assistance.
 Developer debug context: invalid args to <bound method BuiltinVariable._call_min_max of BuiltinVariable(min)>: [DictKeysVariable()] {'key': NestedUserFunctionVariable()}
from user code:
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 719, in forward
 hidden_states, residual = layer(positions, hidden_states, residual)
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 619, in forward
 hidden_states = self.self_attn(
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 330, in forward
 q = self.q_a_proj(hidden_states)[0]
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 367, in forward
 output = self.quant_method.apply(self, x, bias)
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/blockwise_int8.py", line 236, in apply
 return apply_w8a8_block_int8_linear(
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/int8_utils.py", line 430, in apply_w8a8_block_int8_linear
 output = w8a8_block_int8_matmul(
 File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/int8_utils.py", line 369, in w8a8_block_int8_matmul
 config = configs[min(configs.keys(), key=lambda x: abs(x - M))]
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Assignees

No one assigned

Labels

bug rocm

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: torch dynamo is not compatible with triton autotune #26993

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions