Add Soft-Muon softening knob for finite-Schatten-p updates#92

Open

JohnLangford wants to merge 2 commits into

main from

jcl/soft-muon-schatten

Open

Add Soft-Muon softening knob for finite-Schatten-p updates #92
JohnLangford wants to merge 2 commits into
main from
jcl/soft-muon-schatten

Conversation

@JohnLangford

@JohnLangford JohnLangford commented Jun 17, 2026

Copy link

Copy Markdown

Contributor

Summary

Adds an optional softening factor s ∈ [0, 1] to Muon, NorMuon, Dion2, and NorDion2 that turns the orthogonalized (Schatten-∞) update into a tunable, heavier-tailed finite-Schatten-p update — the "Soft-Muon" lever.

The selected Newton-Schulz function is wrapped to return:

(1 - s) * NS(X) + s * X / ||X||_F

Because Newton-Schulz preserves the singular vectors of X, this scales the i-th update singular value to (1 - s) + s * σ_i / ||X||_F — a monotone function of σ_i. So s = 0 is plain orthogonalization (all singular values → 1), and s > 0 retains spectral decay, downweighting noise-dominated directions and producing a heavier-tailed update.

s = 0 is the default and is byte-identical to current behavior (no wrapping).
The blend runs on the gathered (unsharded) matrix inside the NS call, so it needs no extra communication and composes with the standard, polar-express, triton, and gram Newton-Schulz backends.
Cost is one Frobenius norm + one axpy on the already-gathered tile — negligible vs. the NS iteration itself (true Tier-0 / matmul-free; no SVD, eigh, or rational iteration).

Motivation

Recent work argues the optimal Schatten-p geometry is regime-dependent, and that in the token-rich / low-dimensional (Chinchilla) regime a smaller Schatten-p than ∞ is preferable:

Pethick, When to use what Schatten-p norm in deep learning? (arXiv:2606.15268) — smaller p favored once tokens N ≳ effective rank d; Chinchilla scaling sits in this regime.
HTMuon (arXiv:2603.10067) — heavier-tailed updates U Σ^p VT; notes finite-step NS already beats exact SVD orthogonalization because it under-flattens the spectrum.
Freon (arXiv:2605.11181) — (GGT)^{-c} G family; shows NS-polynomial iterations are unstable for the exact fractional power, motivating a cheap blend instead.

This PR is the minimal, numerically-safe realization of that idea: a softening blend rather than an exact σ^{q-1} power (which would require SVD/QDWH/eigh).

Testing

New tests/test_softening.py (17 cases, CPU): closed-form match of softened singular values to (1-s)+sσ/||X||_F; monotone spectral-spread increase with s; singular-vector preservation; constructor validation; wiring through all classes; a real optimizer step.
Verified s = 0 leaves the selected NS function object unwrapped/identical for all four optimizers (no behavior change at the default).
Full suite collects cleanly (236 tests, no import changes).

Behavioral (loss) validation should be done in a token-rich run (the regime where smaller-p is expected to help, e.g. ≥100 tokens/param); a short smoke run sits in the high-dimensional regime where s = 0 is expected to be optimal and softening neutral.

Notes

softening is currently a constant construction-time hyperparameter. A training-schedule (softening more later in training, as Soft-Muon does empirically) is a natural follow-up.
At larger s the update's spectral norm is < 1, so the effective step shrinks slightly; this composes with adjust_lr="spectral_norm" and may want LR retuning.

John Langford added 2 commits

June 17, 2026 12:44


 Add Soft-Muon softening knob for finite-Schatten-p updates

6a203a1

Add an optional softening factor s in [0,1] to Muon, NorMuon, Dion2, and
NorDion2. The selected Newton-Schulz orthogonalization is wrapped to return
(1 - s) * NS(X) + s * X / ||X||_F, blending the orthogonalized (Schatten-inf)
update toward the spectrally-normalized momentum. Since NS preserves singular
vectors, this scales the i-th update singular value to (1 - s) + s * sigma_i /
||X||_F, a monotone function of sigma_i, yielding a heavier-tailed, finite-
Schatten-p style update (cf. Soft-Muon; Pethick 2606.15268, HTMuon 2603.10067,
Freon 2605.11181).
The blend runs on the gathered (unsharded) matrix inside the NS call, so it
needs no extra communication and composes with the standard, polar-express,
triton, and gram Newton-Schulz backends. s=0 is the default and is byte-
identical to the current behavior (no wrapping).


 softening: fuse blend with lerp, correct closed-form/normalization do...

dfefe27

...cs, test real NS backend
- Use torch.lerp for the (1-s)*NS(X)+s*X/||X||_F blend: one fused kernel,
 fewer temporaries (bit-equivalent to the manual blend in fp64).
- Correct the docstring: under a finite-step NS backend the i-th softened
 singular value is (1-s)*f(sigma_i)+s*sigma_i/||X||_F, not the exact
 (1-s)+s*sigma_i/||X||_F (the default backends keep f in ~[0.85,1.15], so
 the closed form is approximate). Verified measured deviation ~6.5%.
- Fix 'spectrally-normalized' -> 'Frobenius-normalized' in the four optimizer
 docstrings: the code divides by ||X||_F, not the spectral norm.
- Reject NaN softening (already covered by the bound check; now asserted).
- Add a test exercising the real polar_express backend through the wrapper
 (the others use exact SVD), checking blend identity, dtype/shape, finiteness,
 and heavier-tail endpoint relation.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Soft-Muon softening knob for finite-Schatten-p updates#92

Add Soft-Muon softening knob for finite-Schatten-p updates #92
JohnLangford wants to merge 2 commits into
main from
jcl/soft-muon-schatten

Conversation

@JohnLangford JohnLangford commented Jun 17, 2026

Summary

Motivation

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant