Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add Soft-Muon softening knob for finite-Schatten-p updates#92

Open
JohnLangford wants to merge 2 commits into
main from
jcl/soft-muon-schatten
Open

Add Soft-Muon softening knob for finite-Schatten-p updates #92
JohnLangford wants to merge 2 commits into
main from
jcl/soft-muon-schatten

Conversation

@JohnLangford

@JohnLangford JohnLangford commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an optional softening factor s ∈ [0, 1] to Muon, NorMuon, Dion2, and NorDion2 that turns the orthogonalized (Schatten-∞) update into a tunable, heavier-tailed finite-Schatten-p update — the "Soft-Muon" lever.

The selected Newton-Schulz function is wrapped to return:

(1 - s) * NS(X) + s * X / ||X||_F

Because Newton-Schulz preserves the singular vectors of X, this scales the i-th update singular value to (1 - s) + s * σ_i / ||X||_F — a monotone function of σ_i. So s = 0 is plain orthogonalization (all singular values → 1), and s > 0 retains spectral decay, downweighting noise-dominated directions and producing a heavier-tailed update.

  • s = 0 is the default and is byte-identical to current behavior (no wrapping).
  • The blend runs on the gathered (unsharded) matrix inside the NS call, so it needs no extra communication and composes with the standard, polar-express, triton, and gram Newton-Schulz backends.
  • Cost is one Frobenius norm + one axpy on the already-gathered tile — negligible vs. the NS iteration itself (true Tier-0 / matmul-free; no SVD, eigh, or rational iteration).

Motivation

Recent work argues the optimal Schatten-p geometry is regime-dependent, and that in the token-rich / low-dimensional (Chinchilla) regime a smaller Schatten-p than ∞ is preferable:

  • Pethick, When to use what Schatten-p norm in deep learning? (arXiv:2606.15268) — smaller p favored once tokens N ≳ effective rank d; Chinchilla scaling sits in this regime.
  • HTMuon (arXiv:2603.10067) — heavier-tailed updates U Σ^p VT; notes finite-step NS already beats exact SVD orthogonalization because it under-flattens the spectrum.
  • Freon (arXiv:2605.11181) — (GGT)^{-c} G family; shows NS-polynomial iterations are unstable for the exact fractional power, motivating a cheap blend instead.

This PR is the minimal, numerically-safe realization of that idea: a softening blend rather than an exact σ^{q-1} power (which would require SVD/QDWH/eigh).

Testing

  • New tests/test_softening.py (17 cases, CPU): closed-form match of softened singular values to (1-s)+sσ/||X||_F; monotone spectral-spread increase with s; singular-vector preservation; constructor validation; wiring through all classes; a real optimizer step.
  • Verified s = 0 leaves the selected NS function object unwrapped/identical for all four optimizers (no behavior change at the default).
  • Full suite collects cleanly (236 tests, no import changes).

Behavioral (loss) validation should be done in a token-rich run (the regime where smaller-p is expected to help, e.g. ≥100 tokens/param); a short smoke run sits in the high-dimensional regime where s = 0 is expected to be optimal and softening neutral.

Notes

  • softening is currently a constant construction-time hyperparameter. A training-schedule (softening more later in training, as Soft-Muon does empirically) is a natural follow-up.
  • At larger s the update's spectral norm is < 1, so the effective step shrinks slightly; this composes with adjust_lr="spectral_norm" and may want LR retuning.

John Langford added 2 commits June 17, 2026 12:44
Add an optional softening factor s in [0,1] to Muon, NorMuon, Dion2, and
NorDion2. The selected Newton-Schulz orthogonalization is wrapped to return
(1 - s) * NS(X) + s * X / ||X||_F, blending the orthogonalized (Schatten-inf)
update toward the spectrally-normalized momentum. Since NS preserves singular
vectors, this scales the i-th update singular value to (1 - s) + s * sigma_i /
||X||_F, a monotone function of sigma_i, yielding a heavier-tailed, finite-
Schatten-p style update (cf. Soft-Muon; Pethick 2606.15268, HTMuon 2603.10067,
Freon 2605.11181).
The blend runs on the gathered (unsharded) matrix inside the NS call, so it
needs no extra communication and composes with the standard, polar-express,
triton, and gram Newton-Schulz backends. s=0 is the default and is byte-
identical to the current behavior (no wrapping).
...cs, test real NS backend
- Use torch.lerp for the (1-s)*NS(X)+s*X/||X||_F blend: one fused kernel,
 fewer temporaries (bit-equivalent to the manual blend in fp64).
- Correct the docstring: under a finite-step NS backend the i-th softened
 singular value is (1-s)*f(sigma_i)+s*sigma_i/||X||_F, not the exact
 (1-s)+s*sigma_i/||X||_F (the default backends keep f in ~[0.85,1.15], so
 the closed form is approximate). Verified measured deviation ~6.5%.
- Fix 'spectrally-normalized' -> 'Frobenius-normalized' in the four optimizer
 docstrings: the code divides by ||X||_F, not the spectral norm.
- Reject NaN softening (already covered by the bound check; now asserted).
- Add a test exercising the real polar_express backend through the wrapper
 (the others use exact SVD), checking blend identity, dtype/shape, finiteness,
 and heavier-tail endpoint relation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

AltStyle によって変換されたページ (->オリジナル) /