-
Notifications
You must be signed in to change notification settings - Fork 57
Add Soft-Muon softening knob for finite-Schatten-p updates#92
Open
JohnLangford wants to merge 2 commits into
Open
Add Soft-Muon softening knob for finite-Schatten-p updates #92JohnLangford wants to merge 2 commits into
JohnLangford wants to merge 2 commits into
Conversation
Add an optional softening factor s in [0,1] to Muon, NorMuon, Dion2, and NorDion2. The selected Newton-Schulz orthogonalization is wrapped to return (1 - s) * NS(X) + s * X / ||X||_F, blending the orthogonalized (Schatten-inf) update toward the spectrally-normalized momentum. Since NS preserves singular vectors, this scales the i-th update singular value to (1 - s) + s * sigma_i / ||X||_F, a monotone function of sigma_i, yielding a heavier-tailed, finite- Schatten-p style update (cf. Soft-Muon; Pethick 2606.15268, HTMuon 2603.10067, Freon 2605.11181). The blend runs on the gathered (unsharded) matrix inside the NS call, so it needs no extra communication and composes with the standard, polar-express, triton, and gram Newton-Schulz backends. s=0 is the default and is byte- identical to the current behavior (no wrapping).
...cs, test real NS backend - Use torch.lerp for the (1-s)*NS(X)+s*X/||X||_F blend: one fused kernel, fewer temporaries (bit-equivalent to the manual blend in fp64). - Correct the docstring: under a finite-step NS backend the i-th softened singular value is (1-s)*f(sigma_i)+s*sigma_i/||X||_F, not the exact (1-s)+s*sigma_i/||X||_F (the default backends keep f in ~[0.85,1.15], so the closed form is approximate). Verified measured deviation ~6.5%. - Fix 'spectrally-normalized' -> 'Frobenius-normalized' in the four optimizer docstrings: the code divides by ||X||_F, not the spectral norm. - Reject NaN softening (already covered by the bound check; now asserted). - Add a test exercising the real polar_express backend through the wrapper (the others use exact SVD), checking blend identity, dtype/shape, finiteness, and heavier-tail endpoint relation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an optional
softeningfactors ∈ [0, 1]toMuon,NorMuon,Dion2, andNorDion2that turns the orthogonalized (Schatten-∞) update into a tunable, heavier-tailed finite-Schatten-p update — the "Soft-Muon" lever.The selected Newton-Schulz function is wrapped to return:
Because Newton-Schulz preserves the singular vectors of
X, this scales the i-th update singular value to(1 - s) + s * σ_i / ||X||_F— a monotone function ofσ_i. Sos = 0is plain orthogonalization (all singular values → 1), ands > 0retains spectral decay, downweighting noise-dominated directions and producing a heavier-tailed update.s = 0is the default and is byte-identical to current behavior (no wrapping).Motivation
Recent work argues the optimal Schatten-p geometry is regime-dependent, and that in the token-rich / low-dimensional (Chinchilla) regime a smaller Schatten-p than ∞ is preferable:
U Σ^p VT; notes finite-step NS already beats exact SVD orthogonalization because it under-flattens the spectrum.(GGT)^{-c} Gfamily; shows NS-polynomial iterations are unstable for the exact fractional power, motivating a cheap blend instead.This PR is the minimal, numerically-safe realization of that idea: a softening blend rather than an exact
σ^{q-1}power (which would require SVD/QDWH/eigh).Testing
tests/test_softening.py(17 cases, CPU): closed-form match of softened singular values to(1-s)+sσ/||X||_F; monotone spectral-spread increase withs; singular-vector preservation; constructor validation; wiring through all classes; a real optimizer step.s = 0leaves the selected NS function object unwrapped/identical for all four optimizers (no behavior change at the default).Behavioral (loss) validation should be done in a token-rich run (the regime where smaller-p is expected to help, e.g. ≥100 tokens/param); a short smoke run sits in the high-dimensional regime where
s = 0is expected to be optimal and softening neutral.Notes
softeningis currently a constant construction-time hyperparameter. A training-schedule (softening more later in training, as Soft-Muon does empirically) is a natural follow-up.sthe update's spectral norm is< 1, so the effective step shrinks slightly; this composes withadjust_lr="spectral_norm"and may want LR retuning.