[Auto] Fix DreamBooth LoRA fp16 validation unscale crash (cluster-1246-28): merged 2 of 5 PRs#1

Open

evalstate wants to merge 4 commits into

main from

merge-cluster-cluster-1246-28-20260427125125

Open

[Auto] Fix DreamBooth LoRA fp16 validation unscale crash (cluster-1246-28): merged 2 of 5 PRs #1
evalstate wants to merge 4 commits into
main from
merge-cluster-cluster-1246-28-20260427125125

Conversation

@evalstate

@evalstate evalstate commented Apr 27, 2026

Copy link

Copy Markdown

Owner

Cluster: cluster-1246-28
Base: origin/main
Branch: merge-cluster-cluster-1246-28-20260427125125

Summary:

Merged 2 of 5 PRs from the cluster around DreamBooth LoRA fp16 validation unscale failures (train_dreambooth_lora.py -- ValueError: Attempting to unscale FP16 gradients caused by "--validation_prompt" param. huggingface/diffusers#13124 ).

Merged PRs:

Fix DreamBooth LoRA fp16 training crash after validation huggingface/diffusers#13510 : Merged cleanly; adds targeted fp32 re-upcast after DreamBooth LoRA validation to prevent fp16 GradScaler unscale failure.
fix: wrap validation inference with torch.no_grad() in dreambooth examples huggingface/diffusers#13273 : Merged cleanly; wraps validation inference in torch.no_grad() across 17 DreamBooth example scripts.

Skipped PRs:

Fixes training resuming: Advanced Dreambooth LoRa Training huggingface/diffusers#6566 : Already merged upstream; historical advanced DreamBooth LoRA resume fix is integrated.
Fix to fp16 unscaling bug huggingface/diffusers#10783 : Closed unmerged as stale/duplicate; maintainer comments indicate the change was already present on main.
[Fix] fp16 unscaling in train_dreambooth_lora_sdxl huggingface/diffusers#10889 : Already merged upstream; SDXL LoRA fp16 unscale fix is integrated.

Failed PRs:

None.

Validation:

git diff --check origin/main...HEAD: passed.
python -m py_compile on all 17 DreamBooth files touched by fix: wrap validation inference with torch.no_grad() in dreambooth examples huggingface/diffusers#13273 : passed.
No GPU reproduction was run.

Notes / next steps:

No merge conflicts occurred and no manual conflict resolution was required.
Combined branch keeps both Fix DreamBooth LoRA fp16 training crash after validation huggingface/diffusers#13510 targeted dtype restoration and fix: wrap validation inference with torch.no_grad() in dreambooth examples huggingface/diffusers#13273 broader no-grad validation hygiene.
Human review should decide whether to take both open PRs together or prefer Fix DreamBooth LoRA fp16 training crash after validation huggingface/diffusers#13510 as the minimal bug fix and treat fix: wrap validation inference with torch.no_grad() in dreambooth examples huggingface/diffusers#13273 as additional validation hygiene.
Coordinate on issue train_dreambooth_lora.py -- ValueError: Attempting to unscale FP16 gradients caused by "--validation_prompt" param. huggingface/diffusers#13124 and the overlapping PR threads before proposing a final upstream PR.

gambletan and others added 4 commits

March 16, 2026 22:13


 fix: wrap validation inference with torch.no_grad() in dreambooth exa...

2daab54

...mples
When `log_validation()` runs pipeline inference during training, it uses the
same UNet/transformer that is being trained. Without `torch.no_grad()`, PyTorch
computes and stores gradients during validation. With `--mixed_precision="fp16"`,
this causes the gradient scaler to encounter FP16 gradients from the validation
pass when training resumes, resulting in:
 ValueError: Attempting to unscale FP16 gradients.
This adds `torch.no_grad()` around all pipeline inference calls in
`log_validation()` across all dreambooth training scripts to prevent gradient
computation during validation.
Fixes huggingface#13124

@Ricardo-M-L


 Fix train_dreambooth_lora.py fp16 unscale error after validation

b0b0e02

When `--mixed_precision=fp16` and `--validation_prompt` are both set,
training aborts on the first step after the first validation with:
 ValueError: Attempting to unscale FP16 gradients.
Root cause:
* The LoRA trainable params are upcast to fp32 once, before training,
 via `cast_training_params(models, dtype=torch.float32)`.
* Validation builds `DiffusionPipeline.from_pretrained(unet=unwrap_model(unet),
 torch_dtype=weight_dtype, ...)` and hands the pipeline to `log_validation`.
* `log_validation` calls `pipeline.to(accelerator.device, dtype=torch_dtype)`,
 which casts the *shared* `unet` module — including the LoRA adapter weights
 registered as trainable — back down to fp16.
* The next backward then produces fp16 grads, and the grad scaler refuses to
 unscale them.
Re-run `cast_training_params(..., dtype=torch.float32)` immediately after
`log_validation` returns (only when `mixed_precision == "fp16"`), mirroring
the pre-training upcast. bf16 mixed-precision is unaffected since no grad
scaler is in play there.
Fixes huggingface#13124

@evalstate


 Merge remote-tracking branch 'refs/remotes/pr/13510' into merge-clust...

...er-cluster-1246-28-20260427125125

@evalstate


 Merge remote-tracking branch 'refs/remotes/pr/13273' into merge-clust...

d483eb7

...er-cluster-1246-28-20260427125125

@evalstate

evalstate commented Apr 27, 2026

Copy link

Copy Markdown

Owner Author

Trace for this mergeability run: https://huggingface.co/datasets/evalstate/transformers-merge-experiments/blob/main/2604271351-ZB9XUw__dev__codex.jsonl

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auto] Fix DreamBooth LoRA fp16 validation unscale crash (cluster-1246-28): merged 2 of 5 PRs#1

[Auto] Fix DreamBooth LoRA fp16 validation unscale crash (cluster-1246-28): merged 2 of 5 PRs #1
evalstate wants to merge 4 commits into
main from
merge-cluster-cluster-1246-28-20260427125125

Conversation

@evalstate evalstate commented Apr 27, 2026

Uh oh!

evalstate commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants