Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[Auto] Fix DreamBooth LoRA fp16 validation unscale crash (cluster-1246-28): merged 2 of 5 PRs#1

Open
evalstate wants to merge 4 commits into
main from
merge-cluster-cluster-1246-28-20260427125125
Open

[Auto] Fix DreamBooth LoRA fp16 validation unscale crash (cluster-1246-28): merged 2 of 5 PRs #1
evalstate wants to merge 4 commits into
main from
merge-cluster-cluster-1246-28-20260427125125

Conversation

@evalstate

@evalstate evalstate commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Cluster: cluster-1246-28
Base: origin/main
Branch: merge-cluster-cluster-1246-28-20260427125125

Summary:

Merged PRs:

Skipped PRs:

Failed PRs:

  • None.

Validation:

Notes / next steps:

gambletan and others added 4 commits March 16, 2026 22:13
...mples
When `log_validation()` runs pipeline inference during training, it uses the
same UNet/transformer that is being trained. Without `torch.no_grad()`, PyTorch
computes and stores gradients during validation. With `--mixed_precision="fp16"`,
this causes the gradient scaler to encounter FP16 gradients from the validation
pass when training resumes, resulting in:
 ValueError: Attempting to unscale FP16 gradients.
This adds `torch.no_grad()` around all pipeline inference calls in
`log_validation()` across all dreambooth training scripts to prevent gradient
computation during validation.
Fixes huggingface#13124 
When `--mixed_precision=fp16` and `--validation_prompt` are both set,
training aborts on the first step after the first validation with:
 ValueError: Attempting to unscale FP16 gradients.
Root cause:
* The LoRA trainable params are upcast to fp32 once, before training,
 via `cast_training_params(models, dtype=torch.float32)`.
* Validation builds `DiffusionPipeline.from_pretrained(unet=unwrap_model(unet),
 torch_dtype=weight_dtype, ...)` and hands the pipeline to `log_validation`.
* `log_validation` calls `pipeline.to(accelerator.device, dtype=torch_dtype)`,
 which casts the *shared* `unet` module — including the LoRA adapter weights
 registered as trainable — back down to fp16.
* The next backward then produces fp16 grads, and the grad scaler refuses to
 unscale them.
Re-run `cast_training_params(..., dtype=torch.float32)` immediately after
`log_validation` returns (only when `mixed_precision == "fp16"`), mirroring
the pre-training upcast. bf16 mixed-precision is unaffected since no grad
scaler is in play there.
Fixes huggingface#13124 

Copy link
Copy Markdown
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /