Wan2.2 TI2V-5B Tiled VAE Tensor size mismatch #12529

New issue

Open

Labels

bug

@CalamitousFelicitousness

Description

@CalamitousFelicitousness

CalamitousFelicitousness

opened

on Oct 22, 2025

Describe the bug

When running Wan 2.2 TI2V 5B with tiled VAE, the VAE step is unsuccessful, resulting in:
RuntimeError: The size of tensor a (2) must match the size of tensor b (4) at non-singleton dimension 2

Reproduction

import torch
import numpy as np
from diffusers import WanPipeline, AutoencoderKLWan
from diffusers.utils import export_to_video
dtype = torch.bfloat16
device = "cuda"
model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
vae.enable_tiling(
 tile_sample_min_height=256,
 tile_sample_min_width=256,
 tile_sample_stride_height=64,
 tile_sample_stride_width=64
)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.to(device)
height = 704
width = 1280
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0
prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
output = pipe(
 prompt=prompt,
 negative_prompt=negative_prompt,
 height=height,
 width=width,
 num_frames=num_frames,
 guidance_scale=guidance_scale,
 num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "5bit2v_output.mp4", fps=24)
print(f"Max memory allocated: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

Logs

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 2.07it/s]
The config attributes {'clip_output': False} were passed to AutoencoderKLWan, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:28<00:00, 5.60s/it]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 7.77it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:26<00:00, 8.66s/it]
Traceback (most recent call last):
 File "/workspace/Wan2.2/wan22tiledvae.py", line 75, in <module>
 output = pipe(
 ^^^^^
 File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
 return func(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/diffusers/pipelines/wan/pipeline_wan.py", line 645, in __call__
 video = self.vae.decode(latents, return_dict=False)[0]
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
 return method(self, *args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/diffusers/models/autoencoders/autoencoder_kl_wan.py", line 1248, in decode
 decoded = self._decode(z).sample
 ^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/diffusers/models/autoencoders/autoencoder_kl_wan.py", line 1204, in _decode
 return self.tiled_decode(z, return_dict=return_dict)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/diffusers/models/autoencoders/autoencoder_kl_wan.py", line 1374, in tiled_decode
 decoded = self.decoder(tile, feat_cache=self._feat_map, feat_idx=self._conv_idx)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 return self._call_impl(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 return forward_call(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/diffusers/models/autoencoders/autoencoder_kl_wan.py", line 893, in forward
 x = up_block(x, feat_cache, feat_idx, first_chunk=first_chunk)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 return self._call_impl(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 return forward_call(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.12/dist-packages/diffusers/models/autoencoders/autoencoder_kl_wan.py", line 709, in forward
 x = x + self.avg_shortcut(x_copy, first_chunk=first_chunk)
 ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (2) must match the size of tensor b (4) at non-singleton dimension 2

System Info

Tested with multiple, no difference noted.

🤗 Diffusers version: 0.35.2
Platform: Linux-6.8.0-65-generic-x86_64-with-glibc2.39
Running on Google Colab?: No
Python version: 3.12.3
PyTorch version (GPU?): 2.8.0+cu128 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.35.3
Transformers version: 4.51.3
Accelerate version: 1.11.0
PEFT version: not installed
Bitsandbytes version: not installed
Safetensors version: 0.6.2
xFormers version: not installed
Accelerator: NVIDIA RTX A5000, 24564 MiB
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@sayakpaul @DN6

Metadata

Assignees

No one assigned

Labels

bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Wan2.2 TI2V-5B Tiled VAE Tensor size mismatch #12529

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions