Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

i think there is something wrong with new/latest scripts. RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072) #12494

Open
Labels
bugSomething isn't working
@gerylavin

Description

Describe the bug

i got "RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)" when i ran "train_dreambooth_lora_flux_advanced.py" from the latest version of diffusers or v0.35.1 but the problem solved when i downgraded the version to v0.31.0 including all the dependencies. i ran the scripts on modal (serverless gpu cloud). i used L40S 48GB and the same training parameters/arguments.

Reproduction

i think it's because i put "accelerate env" when the building image was in progress so the is no the description of the gpu below/

for the latest version of diffusers. i used this config:

  • Accelerate version: 1.10.0
  • Platform: Linux-4.4.0-x86_64-with-glibc2.39
  • accelerate bash location: /usr/local/bin/accelerate
  • Python version: 3.11.5
  • Numpy version: 2.3.4
  • PyTorch version: 2.8.0+cu129
  • PyTorch accelerator: N/A
  • System RAM: 167.58 GB
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: NO
    • mixed_precision: bf16
    • use_cpu: False
    • debug: False
    • num_processes: 1
    • machine_rank: 0
    • num_machines: 1
    • rdzv_backend: static
    • same_network: False
    • main_training_function: main
    • enable_cpu_affinity: False
    • downcast_bf16: False
    • tpu_use_cluster: False
    • tpu_use_sudo: False

for the oldest version of diffusers. i used this config:

  • Accelerate version: 1.2.1
  • Platform: Linux-4.4.0-x86_64-with-glibc2.35
  • accelerate bash location: /usr/local/bin/accelerate
  • Python version: 3.11.5
  • Numpy version: 2.3.3
  • PyTorch version (GPU?): 2.5.1+cu124 (False)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • PyTorch MLU available: False
  • PyTorch MUSA available: False
  • System RAM: 167.58 GB
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: NO
    • mixed_precision: bf16
    • use_cpu: False
    • debug: False
    • num_processes: 1
    • machine_rank: 0
    • num_machines: 1
    • rdzv_backend: static
    • same_network: False
    • main_training_function: main
    • enable_cpu_affinity: False
    • downcast_bf16: False
    • tpu_use_cluster: False
    • tpu_use_sudo: False

Logs

the latest version of diffusers:
10/15/2025 16:56:04 - INFO - __main__ - ***** Running training *****
10/15/2025 16:56:04 - INFO - __main__ - Num examples = 338
10/15/2025 16:56:04 - INFO - __main__ - Num batches each epoch = 338
10/15/2025 16:56:04 - INFO - __main__ - Num Epochs = 10
10/15/2025 16:56:04 - INFO - __main__ - Instantaneous batch size per device = 1
10/15/2025 16:56:04 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
10/15/2025 16:56:04 - INFO - __main__ - Gradient Accumulation steps = 1
10/15/2025 16:56:04 - INFO - __main__ - Total optimization steps = 3380
Steps: 0%| | 0/3380 [00:00<?, ?it/s]
[2025年10月15日 16:56:05,548] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
[2025年10月15日 16:56:09,437] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
Traceback (most recent call last):
 File "/root/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py", line 2470, in <module>
 main(args)
 File "/root/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py", line 2224, in main
 model_pred = transformer(
 ^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 return self._call_impl(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 return forward_call(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 818, in forward
 return model_forward(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 806, in __call__
 return convert_to_fp32(self.model_forward(*args, **kwargs))
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
 return func(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^
 File "/root/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 696, in forward
 else self.time_text_embed(timestep, guidance, pooled_projections)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 return self._call_impl(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 return forward_call(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/root/diffusers/src/diffusers/models/embeddings.py", line 1614, in forward
 pooled_projections = self.text_embedder(pooled_projection)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 return self._call_impl(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 return forward_call(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/root/diffusers/src/diffusers/models/embeddings.py", line 2207, in forward
 hidden_states = self.linear_1(caption)
 ^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 return self._call_impl(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 return forward_call(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
 return F.linear(input, self.weight, self.bias)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x1536 and 768x3072)
Steps: 0%| | 0/3380 [00:07<?, ?it/s]
Traceback (most recent call last):
 File "/usr/local/bin/accelerate", line 10, in <module>
 sys.exit(main())
 ^^^^^^
 File "/usr/local/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
 args.func(args)
 File "/usr/local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1235, in launch_command
 simple_launcher(args)
 File "/usr/local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 823, in simple_launcher
 raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', 'train_dreambooth_lora_flux_advanced.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=/root/r41s4', '--token_abstraction=r4is4', '--instance_prompt=a photo of a r4is4 woman', '--class_data_dir=/root/class_images', '--class_prompt=a photo of a woman', '--with_prior_preservation', '--prior_loss_weight=0.3', '--num_class_images=338', '--output_dir=/root/output_lora', '--lora_layers=attn.to_k,attn.to_q,attn.to_v,attn.to_out.0', '--mixed_precision=bf16', '--optimizer=prodigy', '--train_transformer_frac=1', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=.25', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--repeats=10', '--learning_rate=1.0', '--gradient_accumulation_steps=1', '--rank=16', '--num_train_epochs=10', '--checkpointing_steps=100', '--cache_latents', '--mixed_precision=bf16', '--gradient_checkpointing']' returned non-zero exit status 1.
Traceback (most recent call last):
 File "/pkg/modal/_runtime/container_io_manager.py", line 778, in handle_input_exception
 yield
 File "/pkg/modal/_container_entrypoint.py", line 243, in run_input_sync
 res = io_context.call_finalized_function()
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/pkg/modal/_runtime/container_io_manager.py", line 197, in call_finalized_function
 res = self.finalized_function.callable(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/root/training_baru11_flux_full.py", line 159, in mulai_training
 subprocess.run(jalankan_training, cwd="/root/diffusers/examples/advanced_diffusion_training", check=True)
 File "/usr/local/lib/python3.11/subprocess.py", line 571, in run
 raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['accelerate', 'launch', '--config_file', '/root/accelerate_config.yaml', 'train_dreambooth_lora_flux_advanced.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--instance_data_dir=/root/r41s4', '--token_abstraction=r4is4', '--instance_prompt=a photo of a r4is4 woman', '--class_data_dir=/root/class_images', '--class_prompt=a photo of a woman', '--with_prior_preservation', '--prior_loss_weight=0.3', '--num_class_images=338', '--output_dir=/root/output_lora', '--lora_layers=attn.to_k,attn.to_q,attn.to_v,attn.to_out.0', '--mixed_precision=bf16', '--optimizer=prodigy', '--train_transformer_frac=1', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=.25', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--guidance_scale=1', '--repeats=10', '--learning_rate=1.0', '--gradient_accumulation_steps=1', '--rank=16', '--num_train_epochs=10', '--checkpointing_steps=100', '--cache_latents', '--mixed_precision=bf16', '--gradient_checkpointing']' returned non-zero exit status 1.
the oldest version of diffusers:
10/15/2025 17:00:37 - INFO - __main__ - ***** Running training *****
10/15/2025 17:00:37 - INFO - __main__ - Num examples = 338
10/15/2025 17:00:37 - INFO - __main__ - Num batches each epoch = 338
10/15/2025 17:00:37 - INFO - __main__ - Num Epochs = 10
10/15/2025 17:00:37 - INFO - __main__ - Instantaneous batch size per device = 1
10/15/2025 17:00:37 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
10/15/2025 17:00:37 - INFO - __main__ - Gradient Accumulation steps = 1
10/15/2025 17:00:37 - INFO - __main__ - Total optimization steps = 3380
Steps: 0%| | 0/3380 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 1/3380 [00:05<5:08:01, 5.47s/it, loss=0.582, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 2/3380 [00:10<4:53:58, 5.22s/it, loss=0.737, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 3/3380 [00:15<4:49:15, 5.14s/it, loss=0.694, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 4/3380 [00:20<4:47:07, 5.10s/it, loss=0.518, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 5/3380 [00:25<4:45:53, 5.08s/it, loss=0.536, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 6/3380 [00:30<4:45:17, 5.07s/it, loss=0.381, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 7/3380 [00:35<4:44:57, 5.07s/it, loss=0.692, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 8/3380 [00:40<4:44:49, 5.07s/it, loss=1.08, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 9/3380 [00:45<4:44:47, 5.07s/it, loss=0.59, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 10/3380 [00:50<4:44:35, 5.07s/it, loss=0.687, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 11/3380 [00:56<4:44:33, 5.07s/it, loss=0.7, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 12/3380 [01:01<4:44:29, 5.07s/it, loss=0.747, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 13/3380 [01:06<4:44:25, 5.07s/it, loss=0.48, lr=1] Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 14/3380 [01:11<4:44:16, 5.07s/it, loss=0.448, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 15/3380 [01:16<4:44:13, 5.07s/it, loss=0.722, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 0%| | 16/3380 [01:21<4:44:16, 5.07s/it, loss=0.578, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps: 1%| | 17/3380 [01:26<4:44:09, 5.07s/it, loss=0.683, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
..............................................

System Info

since it's kinda complex (for me ) to run the cli on modal container. i provide these system infos using the images of each of the scripts:

for the latest version of diffusers, i used this image:
image = (
modal.Image.from_registry(
"nvidia/cuda:12.9.1-devel-ubuntu24.04", add_python="3.11"
)

.apt_install("git")
.pip_install("uv==0.8.12","ninja<=1.13.0") #==0.5.5
.run_commands("git clone https://github.com/huggingface/diffusers.git /root/diffusers && cd /root/diffusers && uv pip install --system -e .")
.uv_pip_install("huggingface_hub[hf_transfer]==0.34.4", #0.1.8
 "accelerate>=0.31.0,<=1.10.0",
 "transformers>=4.41.2,<=4.55.2",
 "ftfy<=6.2.3",
 "tensorboard<=2.20.0",
 "Jinja2<=3.1.6",
 "peft>=0.11.1,<=0.17.0",
 "sentencepiece<=0.2.1",
 "wheel<=0.41.1",
 "wandb<=0.21.1",
 "bitsandbytes<=0.47.0",
 "datasets<=4.0.0",
 "pyarrow<=21.0.0",
 "prodigyopt<=1.1.2",
 "deepspeed<=0.17.4",
 "xformers<=0.0.32.post2",
 "triton<=3.4.0",
 "torch==2.8.0",
 "torchaudio==2.8.0",
 "torchvision==0.23.0",
 extra_index_url="https://download.pytorch.org/whl/cu129"
 )
.uv_pip_install("https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.2+cu129torch2.8-cp311-cp311-linux_x86_64.whl")
.run_function(setup_accelerate, gpu="L40S")
.env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
.add_local_dir(DATASET_LOCAL_PATH, DATASET_DIR)
.add_local_dir(CLASS_LOCAL_PATH, CLASS_DIR)
 )

for the oldest version of diffusers, i used this image:
image = (
modal.Image.from_registry(
"nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04", add_python="3.11"
)

.apt_install("git")
.pip_install("uv==0.5.5",)
.run_commands("git clone -b v0.31.0 https://github.com/huggingface/diffusers.git /root/diffusers && cd /root/diffusers && uv pip install --system -e .")
.uv_pip_install("huggingface_hub[hf_transfer]==0.26.0",
 "accelerate>=0.31.0,<=1.2.1",
 "transformers>=4.41.2,<=4.47.0",
 "ftfy==6.3.1",
 "tensorboard==2.18.0",
 "Jinja2==3.1.4",
 "peft>=0.11.1,<=0.14.0",
 "sentencepiece<=0.2.0",
 "wheel<=0.44.0",
 "bitsandbytes<=0.44.1",
 "datasets<=3.0.1",
 "pyarrow<=20.0.0",
 "prodigyopt<=1.0",
 "deepspeed<=0.15.3",
 "xformers<=0.0.28.post3",
 "triton<=3.1.0",
 "torch==2.5.1",
 "torchaudio==2.5.1",
 "torchvision==0.20.1",
 extra_index_url="https://download.pytorch.org/whl/cu124"
 )
.uv_pip_install("flash-attn<=2.7.2.post1", extra_options="--no-build-isolation")
.run_function(setup_accelerate, gpu="L40S")
.env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
.add_local_dir(DATASET_LOCAL_PATH, DATASET_DIR)
.add_local_dir(CLASS_LOCAL_PATH, CLASS_DIR)
 )

Who can help?

@sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /