diff --git a/CHANGELOG.md b/CHANGELOG.md index 0950944..acbe477 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,13 +9,16 @@ All notable changes to this project are documented in this file. - **Breaking (install):** `gram-newton-schulz` and `quack-kernels` are no longer base dependencies. They moved to an optional `dion[gram-newton-schulz]` extra (alias `dion[gns]`), and are also excluded from the `dev` and `train` extras. - This keeps the default install free of the transitive `nvidia-cutlass-dsl==4.4.2` - pin, which conflicts with Flash-Attention-4 / Blackwell stacks built on cutlass - `4.5.2`. + This keeps the default install free of the heavy Gram Newton-Schulz GPU stack + (and its transitive `nvidia-cutlass-dsl` pin). **Action required:** if you run with `use_gram_newton_schulz=True`, install the extra (`pip install "dion[gns] @ git+https://github.com/microsoft/dion.git"`, or `pip install -e ".[gns]"` from a clone). Without it, optimizer construction now raises a clear `ImportError` at runtime instead of the kernels being silently - present. Opting in re-introduces the cutlass `4.4.2` pin, so use a separate - environment from FA4/Blackwell. + present. + +- Bumped the optional `dion[gns]` extra to `gram-newton-schulz==0.1.5` + (`quack-kernels==0.5.0`). This moves its transitive `nvidia-cutlass-dsl` pin from + `4.4.2` to `4.5.2`, matching current Flash-Attention-4 / Blackwell stacks, so the + extra no longer conflicts with them. diff --git a/README.md b/README.md index 3f93415..723500f 100644 --- a/README.md +++ b/README.md @@ -50,7 +50,7 @@ Our implementations are available as a `pip` package! Install to use in your pro pip install git+https://github.com/microsoft/dion.git ``` -> The optional Gram Newton-Schulz orthogonalization kernels (enabled with `use_gram_newton_schulz=True`) are not pulled in by the base install. Add them with `pip install "dion[gram-newton-schulz] @ git+https://github.com/microsoft/dion.git"`, or `pip install -e ".[gram-newton-schulz]"` from a clone. Note: this extra pins `nvidia-cutlass-dsl==4.4.2`, which conflicts with Flash-Attention-4 / Blackwell stacks built on cutlass `4.5.2`, so install it in a separate environment if you need both. +> The optional Gram Newton-Schulz orthogonalization kernels (enabled with `use_gram_newton_schulz=True`) are not pulled in by the base install. Add them with `pip install "dion[gram-newton-schulz] @ git+https://github.com/microsoft/dion.git"`, or `pip install -e ".[gram-newton-schulz]"` from a clone. Note: this extra pins `nvidia-cutlass-dsl==4.5.2`, matching the cutlass version used by current Flash-Attention-4 / Blackwell stacks (the earlier `4.4.2` conflict no longer applies). Then in your code, you can use: @@ -68,7 +68,7 @@ git clone https://github.com/microsoft/dion.git cd dion pip install -e .[train] ``` -> `train` stays free of the Gram Newton-Schulz kernels (and their `nvidia-cutlass-dsl==4.4.2` pin) so the default training install works on Flash-Attention-4 / Blackwell stacks. To train with `--use_gram_newton_schulz`, use `pip install -e ".[train,gns]"` in a separate environment. Likewise, to develop or test the Gram Newton-Schulz path, install `pip install -e ".[dev,gns]"` — a plain `[dev]` install skips the GNS-specific test cases. +> `train` stays free of the Gram Newton-Schulz kernels, which remain an opt-in extra. To train with `--use_gram_newton_schulz`, use `pip install -e ".[train,gns]"`; the extra's `nvidia-cutlass-dsl==4.5.2` pin now matches Flash-Attention-4 / Blackwell stacks, so the two no longer conflict. Likewise, to develop or test the Gram Newton-Schulz path, install `pip install -e ".[dev,gns]"` — a plain `[dev]` install skips the GNS-specific test cases. Download pretokenized FineWeb dataset: ```bash diff --git a/requirements_gns.txt b/requirements_gns.txt index de046e2..02bf1ec 100644 --- a/requirements_gns.txt +++ b/requirements_gns.txt @@ -1,6 +1,6 @@ # Optional Gram Newton-Schulz orthogonalization kernels (use_gram_newton_schulz=True). # quack-kernels is pulled in transitively by gram-newton-schulz; the explicit pin here # is for reproducibility and must track whatever quack version the pinned gram-newton-schulz -# requires (gram-newton-schulz==0.1.4 also transitively pins nvidia-cutlass-dsl==4.4.2). -gram-newton-schulz==0.1.4 -quack-kernels==0.4.1 +# requires (gram-newton-schulz==0.1.5 also transitively pins nvidia-cutlass-dsl==4.5.2). +gram-newton-schulz==0.1.5 +quack-kernels==0.5.0