Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 4acbfbf

Browse files
ishan-modisayakpaul
andauthored
[Quantization] Add TRT-ModelOpt as a Backend (#11173)
* initial commit * update * updates * update * update * update * update * update * update * addressed PR comments * update * addressed PR comments * update * update * update * update * update * update * updates * update * update * addressed PR comments * updates * code formatting * update * addressed PR comments * addressed PR comments * addressed PR comments * addressed PR comments * fix docs and dependencies * fixed dependency test --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
1 parent 6549b04 commit 4acbfbf

File tree

17 files changed

+936
-3
lines changed

17 files changed

+936
-3
lines changed

‎.github/workflows/nightly_tests.yml‎

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,9 @@ jobs:
340340
- backend: "optimum_quanto"
341341
test_location: "quanto"
342342
additional_deps: []
343+
- backend: "nvidia_modelopt"
344+
test_location: "modelopt"
345+
additional_deps: []
343346
runs-on:
344347
group: aws-g6e-xlarge-plus
345348
container:

‎docs/source/en/_toctree.yml‎

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,8 @@
188188
title: torchao
189189
- local: quantization/quanto
190190
title: quanto
191+
- local: quantization/modelopt
192+
title: NVIDIA ModelOpt
191193

192194
- title: Model accelerators and hardware
193195
isExpanded: false
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# NVIDIA ModelOpt
13+
14+
[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
15+
16+
Before you begin, make sure you have nvidia_modelopt installed.
17+
18+
```bash
19+
pip install -U "nvidia_modelopt[hf]"
20+
```
21+
22+
Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
23+
24+
The example below only quantizes the weights to FP8.
25+
26+
```python
27+
import torch
28+
from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig
29+
30+
model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
31+
dtype = torch.bfloat16
32+
33+
quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")
34+
transformer = AutoModel.from_pretrained(
35+
model_id,
36+
subfolder="transformer",
37+
quantization_config=quantization_config,
38+
torch_dtype=dtype,
39+
)
40+
pipe = SanaPipeline.from_pretrained(
41+
model_id,
42+
transformer=transformer,
43+
torch_dtype=dtype,
44+
)
45+
pipe.to("cuda")
46+
47+
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
48+
49+
prompt = "A cat holding a sign that says hello world"
50+
image = pipe(
51+
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
52+
).images[0]
53+
image.save("output.png")
54+
```
55+
56+
> **Note:**
57+
>
58+
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
59+
>
60+
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).
61+
62+
## NVIDIAModelOptConfig
63+
64+
The `NVIDIAModelOptConfig` class accepts three parameters:
65+
- `quant_type`: A string value mentioning one of the quantization types below.
66+
- `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SD3Transformer2DModel`]'s pos_embed projection blocks, one would specify: `modules_to_not_convert=["pos_embed.proj.weight"]`.
67+
- `disable_conv_quantization`: A boolean value which when set to `True` disables quantization for all convolutional layers in the model. This is useful as channel and block quantization generally don't work well with convolutional layers (used with INT4, NF4, NVFP4). If you want to disable quantization for specific convolutional layers, use `modules_to_not_convert` instead.
68+
- `algorithm`: The algorithm to use for determining scale, defaults to `"max"`. You can check modelopt documentation for more algorithms and details.
69+
- `forward_loop`: The forward loop function to use for calibrating activation during quantization. If not provided, it relies on static scale values computed using the weights only.
70+
- `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`.
71+
72+
## Supported quantization types
73+
74+
ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference.
75+
76+
Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation.
77+
78+
The quantization methods supported are as follows:
79+
80+
| **Quantization Type** | **Supported Schemes** | **Required Kwargs** | **Additional Notes** |
81+
|-----------------------|-----------------------|---------------------|----------------------|
82+
| **INT8** | `int8 weight only`, `int8 channel quantization`, `int8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` |
83+
| **FP8** | `fp8 weight only`, `fp8 channel quantization`, `fp8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` |
84+
| **INT4** | `int4 weight only`, `int4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
85+
| **NF4** | `nf4 weight only`, `nf4 double block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize` | `channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now` |
86+
| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
87+
88+
89+
Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
90+
91+
## Serializing and Deserializing quantized models
92+
93+
To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method.
94+
95+
```python
96+
import torch
97+
from diffusers import AutoModel, NVIDIAModelOptConfig
98+
from modelopt.torch.opt import enable_huggingface_checkpointing
99+
100+
enable_huggingface_checkpointing()
101+
102+
model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
103+
quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"}
104+
quant_config_fp8 = NVIDIAModelOptConfig(**quant_config_fp8)
105+
model = AutoModel.from_pretrained(
106+
model_id,
107+
subfolder="transformer",
108+
quantization_config=quant_config_fp8,
109+
torch_dtype=torch.bfloat16,
110+
)
111+
model.save_pretrained('path/to/sana_fp8', safe_serialization=False)
112+
```
113+
114+
To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method.
115+
116+
```python
117+
import torch
118+
from diffusers import AutoModel, NVIDIAModelOptConfig, SanaPipeline
119+
from modelopt.torch.opt import enable_huggingface_checkpointing
120+
121+
enable_huggingface_checkpointing()
122+
123+
quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")
124+
transformer = AutoModel.from_pretrained(
125+
"path/to/sana_fp8",
126+
subfolder="transformer",
127+
quantization_config=quantization_config,
128+
torch_dtype=torch.bfloat16,
129+
)
130+
pipe = SanaPipeline.from_pretrained(
131+
"Efficient-Large-Model/Sana_600M_1024px_diffusers",
132+
transformer=transformer,
133+
torch_dtype=torch.bfloat16,
134+
)
135+
pipe.to("cuda")
136+
prompt = "A cat holding a sign that says hello world"
137+
image = pipe(
138+
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
139+
).images[0]
140+
image.save("output.png")
141+
```

‎setup.py‎

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@
132132
"gguf>=0.10.0",
133133
"torchao>=0.7.0",
134134
"bitsandbytes>=0.43.3",
135+
"nvidia_modelopt[hf]>=0.33.1",
135136
"regex!=2019年12月17日",
136137
"requests",
137138
"tensorboard",
@@ -244,6 +245,7 @@ def run(self):
244245
extras["gguf"] = deps_list("gguf", "accelerate")
245246
extras["optimum_quanto"] = deps_list("optimum_quanto", "accelerate")
246247
extras["torchao"] = deps_list("torchao", "accelerate")
248+
extras["nvidia_modelopt"] = deps_list("nvidia_modelopt[hf]")
247249

248250
if os.name == "nt": # windows
249251
extras["flax"] = [] # jax is not supported on windows

‎src/diffusers/__init__.py‎

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
is_k_diffusion_available,
1414
is_librosa_available,
1515
is_note_seq_available,
16+
is_nvidia_modelopt_available,
1617
is_onnx_available,
1718
is_opencv_available,
1819
is_optimum_quanto_available,
@@ -111,6 +112,18 @@
111112
else:
112113
_import_structure["quantizers.quantization_config"].append("QuantoConfig")
113114

115+
try:
116+
if not is_torch_available() and not is_accelerate_available() and not is_nvidia_modelopt_available():
117+
raise OptionalDependencyNotAvailable()
118+
except OptionalDependencyNotAvailable:
119+
from .utils import dummy_nvidia_modelopt_objects
120+
121+
_import_structure["utils.dummy_nvidia_modelopt_objects"] = [
122+
name for name in dir(dummy_nvidia_modelopt_objects) if not name.startswith("_")
123+
]
124+
else:
125+
_import_structure["quantizers.quantization_config"].append("NVIDIAModelOptConfig")
126+
114127
try:
115128
if not is_onnx_available():
116129
raise OptionalDependencyNotAvailable()
@@ -795,6 +808,14 @@
795808
else:
796809
from .quantizers.quantization_config import QuantoConfig
797810

811+
try:
812+
if not is_nvidia_modelopt_available():
813+
raise OptionalDependencyNotAvailable()
814+
except OptionalDependencyNotAvailable:
815+
from .utils.dummy_nvidia_modelopt_objects import *
816+
else:
817+
from .quantizers.quantization_config import NVIDIAModelOptConfig
818+
798819
try:
799820
if not is_onnx_available():
800821
raise OptionalDependencyNotAvailable()

‎src/diffusers/dependency_versions_table.py‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
"gguf": "gguf>=0.10.0",
4040
"torchao": "torchao>=0.7.0",
4141
"bitsandbytes": "bitsandbytes>=0.43.3",
42+
"nvidia_modelopt[hf]": "nvidia_modelopt[hf]>=0.33.1",
4243
"regex": "regex!=2019年12月17日",
4344
"requests": "requests",
4445
"tensorboard": "tensorboard",

‎src/diffusers/quantizers/auto.py‎

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,11 @@
2121

2222
from .bitsandbytes import BnB4BitDiffusersQuantizer, BnB8BitDiffusersQuantizer
2323
from .gguf import GGUFQuantizer
24+
from .modelopt import NVIDIAModelOptQuantizer
2425
from .quantization_config import (
2526
BitsAndBytesConfig,
2627
GGUFQuantizationConfig,
28+
NVIDIAModelOptConfig,
2729
QuantizationConfigMixin,
2830
QuantizationMethod,
2931
QuantoConfig,
@@ -39,6 +41,7 @@
3941
"gguf": GGUFQuantizer,
4042
"quanto": QuantoQuantizer,
4143
"torchao": TorchAoHfQuantizer,
44+
"modelopt": NVIDIAModelOptQuantizer,
4245
}
4346

4447
AUTO_QUANTIZATION_CONFIG_MAPPING = {
@@ -47,6 +50,7 @@
4750
"gguf": GGUFQuantizationConfig,
4851
"quanto": QuantoConfig,
4952
"torchao": TorchAoConfig,
53+
"modelopt": NVIDIAModelOptConfig,
5054
}
5155

5256

@@ -137,6 +141,9 @@ def merge_quantization_configs(
137141
if isinstance(quantization_config, dict):
138142
quantization_config = cls.from_dict(quantization_config)
139143

144+
if isinstance(quantization_config, NVIDIAModelOptConfig):
145+
quantization_config.check_model_patching()
146+
140147
if warning_msg != "":
141148
warnings.warn(warning_msg)
142149

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .modelopt_quantizer import NVIDIAModelOptQuantizer

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /