Commit 4acbfbf

ishan-modisayakpaul

and

authored

[Quantization] Add TRT-ModelOpt as a Backend (#11173)

* initial commit * update * updates * update * update * update * update * update * update * addressed PR comments * update * addressed PR comments * update * update * update * update * update * update * updates * update * update * addressed PR comments * updates * code formatting * update * addressed PR comments * addressed PR comments * addressed PR comments * addressed PR comments * fix docs and dependencies * fixed dependency test --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

1 parent 6549b04 commit 4acbfbfCopy full SHA for 4acbfbf

File tree

17 files changed

+936

-3

lines changed

.github/workflows
- nightly_tests.yml
docs/source/en
- _toctree.yml
- quantization
  - modelopt.md
setup.py
src/diffusers
- __init__.py
- dependency_versions_table.py
- quantizers
  - auto.py
  - modelopt
    - __init__.py
    - modelopt_quantizer.py
  - quantization_config.py
- utils
tests
- others
  - test_dependencies.py
- quantization/modelopt
  - __init__.py
  - test_modelopt.py

17 files changed

+936

-3

lines changed

`‎.github/workflows/nightly_tests.yml‎`

Lines changed: 3 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -340,6 +340,9 @@ jobs:`
`340`	`340`	`- backend: "optimum_quanto"`
`341`	`341`	`test_location: "quanto"`
`342`	`342`	`additional_deps: []`
	`343`	`+ - backend: "nvidia_modelopt"`
	`344`	`+ test_location: "modelopt"`
	`345`	`+ additional_deps: []`
`343`	`346`	`runs-on:`
`344`	`347`	`group: aws-g6e-xlarge-plus`
`345`	`348`	`container:`

`‎docs/source/en/_toctree.yml‎`

Lines changed: 2 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -188,6 +188,8 @@`
`188`	`188`	`title: torchao`
`189`	`189`	`- local: quantization/quanto`
`190`	`190`	`title: quanto`
	`191`	`+ - local: quantization/modelopt`
	`192`	`+ title: NVIDIA ModelOpt`
`191`	`193`
`192`	`194`	`- title: Model accelerators and hardware`
`193`	`195`	`isExpanded: false`

`‎docs/source/en/quantization/modelopt.md‎`

Lines changed: 141 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,141 @@`
	`1`	`+<!-- Copyright 2025 The HuggingFace Team. All rights reserved.`
	`2`	`+`
	`3`	`+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with`
	`4`	`+the License. You may obtain a copy of the License at`
	`5`	`+`
	`6`	`+http://www.apache.org/licenses/LICENSE-2.0`
	`7`	`+`
	`8`	`+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on`
	`9`	`+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the`
	`10`	`+specific language governing permissions and limitations under the License. -->`
	`11`	`+`
	`12`	`+# NVIDIA ModelOpt`
	`13`	`+`
	`14`	`+[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.`
	`15`	`+`
	`16`	`+Before you begin, make sure you have nvidia_modelopt installed.`
	`17`	`+`
	`18`	+```bash
	`19`	`+pip install -U "nvidia_modelopt[hf]"`
	`20`	+```
	`21`	`+`
	`22`	+Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
	`23`	`+`
	`24`	`+The example below only quantizes the weights to FP8.`
	`25`	`+`
	`26`	+```python
	`27`	`+import torch`
	`28`	`+from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig`
	`29`	`+`
	`30`	`+model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"`
	`31`	`+dtype = torch.bfloat16`
	`32`	`+`
	`33`	`+quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")`
	`34`	`+transformer = AutoModel.from_pretrained(`
	`35`	`+ model_id,`
	`36`	`+ subfolder="transformer",`
	`37`	`+ quantization_config=quantization_config,`
	`38`	`+ torch_dtype=dtype,`
	`39`	`+)`
	`40`	`+pipe = SanaPipeline.from_pretrained(`
	`41`	`+ model_id,`
	`42`	`+ transformer=transformer,`
	`43`	`+ torch_dtype=dtype,`
	`44`	`+)`
	`45`	`+pipe.to("cuda")`
	`46`	`+`
	`47`	`+print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")`
	`48`	`+`
	`49`	`+prompt = "A cat holding a sign that says hello world"`
	`50`	`+image = pipe(`
	`51`	`+ prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512`
	`52`	`+).images[0]`
	`53`	`+image.save("output.png")`
	`54`	+```
	`55`	`+`
	`56`	`+> Note:`
	`57`	`+>`
	`58`	`+> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.`
	`59`	`+>`
	`60`	`+> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).`
	`61`	`+`
	`62`	`+## NVIDIAModelOptConfig`
	`63`	`+`
	`64`	+The `NVIDIAModelOptConfig` class accepts three parameters:
	`65`	+- `quant_type`: A string value mentioning one of the quantization types below.
	`66`	+- `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SD3Transformer2DModel`]'s pos_embed projection blocks, one would specify: `modules_to_not_convert=["pos_embed.proj.weight"]`.
	`67`	+- `disable_conv_quantization`: A boolean value which when set to `True` disables quantization for all convolutional layers in the model. This is useful as channel and block quantization generally don't work well with convolutional layers (used with INT4, NF4, NVFP4). If you want to disable quantization for specific convolutional layers, use `modules_to_not_convert` instead.
	`68`	+- `algorithm`: The algorithm to use for determining scale, defaults to `"max"`. You can check modelopt documentation for more algorithms and details.
	`69`	+- `forward_loop`: The forward loop function to use for calibrating activation during quantization. If not provided, it relies on static scale values computed using the weights only.
	`70`	+- `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`.
	`71`	`+`
	`72`	`+## Supported quantization types`
	`73`	`+`
	`74`	`+ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference.`
	`75`	`+`
	`76`	+Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation.
	`77`	`+`
	`78`	`+The quantization methods supported are as follows:`
	`79`	`+`
	`80`	`+\| Quantization Type \| Supported Schemes \| Required Kwargs \| Additional Notes \|`
	`81`	`+\|-----------------------\|-----------------------\|---------------------\|----------------------\|`
	`82`	+\| INT8 \| `int8 weight only`, `int8 channel quantization`, `int8 block quantization` \| `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` \|
	`83`	+\| FP8 \| `fp8 weight only`, `fp8 channel quantization`, `fp8 block quantization` \| `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` \|
	`84`	+\| INT4 \| `int4 weight only`, `int4 block quantization` \| `quant_type`, `quant_type + channel_quantize + block_quantize` \| `channel_quantize = -1 is only supported for now`\|
	`85`	+\| NF4 \| `nf4 weight only`, `nf4 double block quantization` \| `quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize` \| `channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now` \|
	`86`	+\| NVFP4 \| `nvfp4 weight only`, `nvfp4 block quantization` \| `quant_type`, `quant_type + channel_quantize + block_quantize` \| `channel_quantize = -1 is only supported for now`\|
	`87`	`+`
	`88`	`+`
	`89`	`+Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.`
	`90`	`+`
	`91`	`+## Serializing and Deserializing quantized models`
	`92`	`+`
	`93`	+To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method.
	`94`	`+`
	`95`	+```python
	`96`	`+import torch`
	`97`	`+from diffusers import AutoModel, NVIDIAModelOptConfig`
	`98`	`+from modelopt.torch.opt import enable_huggingface_checkpointing`
	`99`	`+`
	`100`	`+enable_huggingface_checkpointing()`
	`101`	`+`
	`102`	`+model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"`
	`103`	`+quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"}`
	`104`	`+quant_config_fp8 = NVIDIAModelOptConfig(**quant_config_fp8)`
	`105`	`+model = AutoModel.from_pretrained(`
	`106`	`+ model_id,`
	`107`	`+ subfolder="transformer",`
	`108`	`+ quantization_config=quant_config_fp8,`
	`109`	`+ torch_dtype=torch.bfloat16,`
	`110`	`+)`
	`111`	`+model.save_pretrained('path/to/sana_fp8', safe_serialization=False)`
	`112`	+```
	`113`	`+`
	`114`	+To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method.
	`115`	`+`
	`116`	+```python
	`117`	`+import torch`
	`118`	`+from diffusers import AutoModel, NVIDIAModelOptConfig, SanaPipeline`
	`119`	`+from modelopt.torch.opt import enable_huggingface_checkpointing`
	`120`	`+`
	`121`	`+enable_huggingface_checkpointing()`
	`122`	`+`
	`123`	`+quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")`
	`124`	`+transformer = AutoModel.from_pretrained(`
	`125`	`+ "path/to/sana_fp8",`
	`126`	`+ subfolder="transformer",`
	`127`	`+ quantization_config=quantization_config,`
	`128`	`+ torch_dtype=torch.bfloat16,`
	`129`	`+)`
	`130`	`+pipe = SanaPipeline.from_pretrained(`
	`131`	`+ "Efficient-Large-Model/Sana_600M_1024px_diffusers",`
	`132`	`+ transformer=transformer,`
	`133`	`+ torch_dtype=torch.bfloat16,`
	`134`	`+)`
	`135`	`+pipe.to("cuda")`
	`136`	`+prompt = "A cat holding a sign that says hello world"`
	`137`	`+image = pipe(`
	`138`	`+ prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512`
	`139`	`+).images[0]`
	`140`	`+image.save("output.png")`
	`141`	+```

`‎setup.py‎`

Lines changed: 2 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -132,6 +132,7 @@`
`132`	`132`	`"gguf>=0.10.0",`
`133`	`133`	`"torchao>=0.7.0",`
`134`	`134`	`"bitsandbytes>=0.43.3",`
	`135`	`+ "nvidia_modelopt[hf]>=0.33.1",`
`135`	`136`	`"regex!=2019年12月17日",`
`136`	`137`	`"requests",`
`137`	`138`	`"tensorboard",`
`@@ -244,6 +245,7 @@ def run(self):`
`244`	`245`	`extras["gguf"] = deps_list("gguf", "accelerate")`
`245`	`246`	`extras["optimum_quanto"] = deps_list("optimum_quanto", "accelerate")`
`246`	`247`	`extras["torchao"] = deps_list("torchao", "accelerate")`
	`248`	`+extras["nvidia_modelopt"] = deps_list("nvidia_modelopt[hf]")`
`247`	`249`
`248`	`250`	`if os.name == "nt": # windows`
`249`	`251`	`extras["flax"] = [] # jax is not supported on windows`

`‎src/diffusers/init.py‎`

Lines changed: 21 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -13,6 +13,7 @@`
`13`	`13`	`is_k_diffusion_available,`
`14`	`14`	`is_librosa_available,`
`15`	`15`	`is_note_seq_available,`
	`16`	`+ is_nvidia_modelopt_available,`
`16`	`17`	`is_onnx_available,`
`17`	`18`	`is_opencv_available,`
`18`	`19`	`is_optimum_quanto_available,`
`@@ -111,6 +112,18 @@`
`111`	`112`	`else:`
`112`	`113`	`_import_structure["quantizers.quantization_config"].append("QuantoConfig")`
`113`	`114`
	`115`	`+try:`
	`116`	`+ if not is_torch_available() and not is_accelerate_available() and not is_nvidia_modelopt_available():`
	`117`	`+ raise OptionalDependencyNotAvailable()`
	`118`	`+except OptionalDependencyNotAvailable:`
	`119`	`+ from .utils import dummy_nvidia_modelopt_objects`
	`120`	`+`
	`121`	`+ _import_structure["utils.dummy_nvidia_modelopt_objects"] = [`
	`122`	`+ name for name in dir(dummy_nvidia_modelopt_objects) if not name.startswith("_")`
	`123`	`+ ]`
	`124`	`+else:`
	`125`	`+ _import_structure["quantizers.quantization_config"].append("NVIDIAModelOptConfig")`
	`126`	`+`
`114`	`127`	`try:`
`115`	`128`	`if not is_onnx_available():`
`116`	`129`	`raise OptionalDependencyNotAvailable()`
`@@ -795,6 +808,14 @@`
`795`	`808`	`else:`
`796`	`809`	`from .quantizers.quantization_config import QuantoConfig`
`797`	`810`
	`811`	`+ try:`
	`812`	`+ if not is_nvidia_modelopt_available():`
	`813`	`+ raise OptionalDependencyNotAvailable()`
	`814`	`+ except OptionalDependencyNotAvailable:`
	`815`	`+ from .utils.dummy_nvidia_modelopt_objects import *`
	`816`	`+ else:`
	`817`	`+ from .quantizers.quantization_config import NVIDIAModelOptConfig`
	`818`	`+`
`798`	`819`	`try:`
`799`	`820`	`if not is_onnx_available():`
`800`	`821`	`raise OptionalDependencyNotAvailable()`

`‎src/diffusers/dependency_versions_table.py‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -39,6 +39,7 @@`
`39`	`39`	`"gguf": "gguf>=0.10.0",`
`40`	`40`	`"torchao": "torchao>=0.7.0",`
`41`	`41`	`"bitsandbytes": "bitsandbytes>=0.43.3",`
	`42`	`+ "nvidia_modelopt[hf]": "nvidia_modelopt[hf]>=0.33.1",`
`42`	`43`	`"regex": "regex!=2019年12月17日",`
`43`	`44`	`"requests": "requests",`
`44`	`45`	`"tensorboard": "tensorboard",`

`‎src/diffusers/quantizers/auto.py‎`

Lines changed: 7 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -21,9 +21,11 @@`
`21`	`21`
`22`	`22`	`from .bitsandbytes import BnB4BitDiffusersQuantizer, BnB8BitDiffusersQuantizer`
`23`	`23`	`from .gguf import GGUFQuantizer`
	`24`	`+from .modelopt import NVIDIAModelOptQuantizer`
`24`	`25`	`from .quantization_config import (`
`25`	`26`	`BitsAndBytesConfig,`
`26`	`27`	`GGUFQuantizationConfig,`
	`28`	`+ NVIDIAModelOptConfig,`
`27`	`29`	`QuantizationConfigMixin,`
`28`	`30`	`QuantizationMethod,`
`29`	`31`	`QuantoConfig,`
`@@ -39,6 +41,7 @@`
`39`	`41`	`"gguf": GGUFQuantizer,`
`40`	`42`	`"quanto": QuantoQuantizer,`
`41`	`43`	`"torchao": TorchAoHfQuantizer,`
	`44`	`+ "modelopt": NVIDIAModelOptQuantizer,`
`42`	`45`	`}`
`43`	`46`
`44`	`47`	`AUTO_QUANTIZATION_CONFIG_MAPPING = {`
`@@ -47,6 +50,7 @@`
`47`	`50`	`"gguf": GGUFQuantizationConfig,`
`48`	`51`	`"quanto": QuantoConfig,`
`49`	`52`	`"torchao": TorchAoConfig,`
	`53`	`+ "modelopt": NVIDIAModelOptConfig,`
`50`	`54`	`}`
`51`	`55`
`52`	`56`
`@@ -137,6 +141,9 @@ def merge_quantization_configs(`
`137`	`141`	`if isinstance(quantization_config, dict):`
`138`	`142`	`quantization_config = cls.from_dict(quantization_config)`
`139`	`143`
	`144`	`+ if isinstance(quantization_config, NVIDIAModelOptConfig):`
	`145`	`+ quantization_config.check_model_patching()`
	`146`	`+`
`140`	`147`	`if warning_msg != "":`
`141`	`148`	`warnings.warn(warning_msg)`
`142`	`149`

`‎src/diffusers/quantizers/modelopt/init.py‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+from .modelopt_quantizer import NVIDIAModelOptQuantizer`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 4acbfbf

File tree

17 files changed

17 files changed

`‎.github/workflows/nightly_tests.yml‎`

`‎docs/source/en/_toctree.yml‎`

`‎docs/source/en/quantization/modelopt.md‎`

`‎setup.py‎`

`‎src/diffusers/init.py‎`

`‎src/diffusers/dependency_versions_table.py‎`

`‎src/diffusers/quantizers/auto.py‎`

`‎src/diffusers/quantizers/modelopt/init.py‎`

0 commit comments