-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[Quantization] Add TRT-ModelOpt as a Backend #11173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@sayakpaul, would you mind giving a quick look and suggestions
Thanks for getting started on this. I guess there is a problem here: NVIDIA/TensorRT-Model-Optimizer#165? Additionally, the API should have a TRTConfig in place of just a dict being the quantization config.
I think the problem has been fixed the newest release, I just need to bump it up in diffusers requirements, also we can do the following for passing Config class
from diffusers.quantizers.quantization_config import ModelOptConfig
quant_config = ModelOptConfig(quant_type="FP8_WO", modules_to_not_convert=["conv"])
model = SanaTransformer2DModel.from_pretrained(checkpoint, subfolder="transformer", quantization_config=quant_config...
by TRTConfig did you mean including the config classes from ModelOptimizer here ?
We use namings like BitsAndBytesConfig
depending on the backend. See here:
https://github.com/huggingface/diffusers/blob/fb54499614f9603bfaa4c026202c5783841b3a80/src/diffusers/quantizers/quantization_config.py#L177C7-L177C25
So, in this case, we should be using TRTConfig
or something similar.
I think the problem has been fixed the newest release, I just need to bump it up in diffusers requirements
Alright, let's try with the latest fixes then.
The newer version wasn't backward compatible hence the issues, I have fixed it.
Related to naming, package name is nvidia_modelopt
, hence ModelOpt
, but I can make it TRTModelOpt
if you'd like ?
Doesn't it have any reliance on tensorrt?
No it doesn't, we can use TRT to compile the quantized model
No it doesn't, we can use TRT to compile the quantized model
Could you elaborate what you mean by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice. Could you demonstrate some memory savings and any speedups when using modelopt, please? We can then add tests, docs, etc.
Could you elaborate what you mean by this?
Yeah, so for quantizing the model we dont use tensorRT, but once the model is quantized we can compile the model using tensorrt.
💾 Model & Inference Memory (in MB)
Following is the codeimport torch from tqdm import tqdm from diffusers import SanaTransformer2DModel, SD3Transformer2DModel, FluxTransformer2DModel from diffusers.quantizers.quantization_config import NVIDIAModelOptConfig checkpoint = "Efficient-Large-Model/Sana_600M_1024px_diffusers" model_cls = SanaTransformer2DModel # checkpoint = "stabilityai/stable-diffusion-3-medium-diffusers" # model_cls = SD3Transformer2DModel # checkpoint = "black-forest-labs/FLUX.1-dev" # model_cls = FluxTransformer2DModel input = lambda: (torch.randn((2, 32, 32, 32), dtype=torch.bfloat16).to('cuda'), torch.randn((2,10,300,2304), dtype=torch.bfloat16).to('cuda'), torch.Tensor([0,0]).to('cuda')) # input = lambda: (torch.randn((1,16,96,96), dtype=torch.bfloat16).to('cuda'), torch.randn((1,300,4096), dtype=torch.bfloat16).to('cuda'), torch.randn((1, 2048), dtype=torch.bfloat16).to('cuda'), torch.Tensor([0]).to('cuda')) # input = lambda: (torch.randn((1,1024, 64), dtype=torch.bfloat16).to('cuda'), torch.randn((1,300,4096), dtype=torch.bfloat16).to('cuda'), torch.randn((1, 768), dtype=torch.bfloat16).to('cuda'), torch.Tensor([0]).to('cuda'), torch.randn((300, 3)).to('cuda'), torch.randn((1024, 3)).to('cuda'), torch.Tensor([0]).to('cuda')) quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"} quant_config_int4 = {"quant_type": "INT4", "quant_method": "modelopt", "block_quantize": 128, "channel_quantize": -1} quant_config_nvfp4 = {"quant_type": "NVFP4", "quant_method": "modelopt", "block_quantize": 128, "channel_quantize": -1, 'modules_to_not_convert' : ['conv']} def test_quantization(config, checkpoint, model_cls): quant_config = NVIDIAModelOptConfig(**config) print(quant_config.get_config_from_quant_type()) quant_model = model_cls.from_pretrained(checkpoint, subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.bfloat16, device_map="balanced").to('cuda') print(f"Quant {config['quant_type']} Model Memory Footprint: ", quant_model.get_memory_footprint() / 1e6) return quant_model def test_quant_inference(model, input, iter=10): torch.cuda.empty_cache() torch.cuda.reset_max_memory_allocated() inference_memory = 0 for _ in tqdm(range(iter)): with torch.no_grad(): output = model(*input()) inference_memory += torch.cuda.max_memory_allocated() inference_memory /= iter print("Inference Memory: ", inference_memory / 1e6) test_quant_inference(test_quantization(quant_config_fp8, checkpoint, model_cls), input) # test_quant_inference(test_quantization(quant_config_int4, checkpoint, model_cls), input) # test_quant_inference(test_quantization(quant_config_nvfp4, checkpoint, model_cls), input) # test_quant_inference(model_cls.from_pretrained(checkpoint, subfolder="transformer", torch_dtype=torch.bfloat16).to('cuda'), input) Speed UpsThere is no significant speedup between the different quantizations because internally modelopt still uses high precision arithmetic (float32). Sorry for being a bit late on this, @sayakpaul let me know next steps ! |
@ishan-modi let us know if this is ready to be reviewed.
@sayakpaul, I think it is ready for preliminary review, on-the-fly quantization works fine. But loading pre-quantized models errors out and will be fixed in next release here (early may) by NVIDIA team.
@jingyu-ml, just so that you are in the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
(削除) Could you also demonstrate some memory and timing numbers with the modelopt
toolkit and some visual results? (削除ここまで)
No need, just saw #11173 (comment). But it doesn't measure the inference memory which is usually done via torch.cuda.max_memory_allocated()
. Could we also see those numbers? Would it be possible to make it clear in the PR description that
on-the-fly quantization works fine. But loading pre-quantized models errors out and will be fixed in next release NVIDIA/TensorRT-Model-Optimizer#185 (early may) by NVIDIA team.
@jingyu-ml is it expected to not see any speedups in latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this Just some nits, it could be nice to add this quantization scheme to transformers after this gets merged !
@ishan-modi just a quick question. Do we know if the nunchaku
SVDQuant method is supported through modelopt
? From https://github.com/NVIDIA/TensorRT-Model-Optimizer?tab=readme-ov-file#quantization-examples-docs, it seems like it is supported. But could you confirm?
@sayakpaul, yes modelopt does support SVDQuant, but in this integration we support only min-max
based calibration see here. I think we should iteratively add advanced quantizations like svd_quant
and awq
once we have the base going, let me know if you think otherwise.
That's fine. I wanted to because I think if we can support svd_quant
through our modelopt
backend, I am happy to drop #12207. Hence wanted to check.
Will merge after @DN6 has had a chance to review. @ishan-modi can we also include a note in the docs that just performing the conversion step with modelopt
won't lead to speed improvements (as pointed out here)?
@realAsma @jingyu-ml after this PR is merged, we could plan writing a post/guide on how to take a modelopt
converted diffusers pipeline and use in deployment settings for realizing the actual speed gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @ishan-modi 👍🏽 Thank you 🙏🏽
@ishan-modi can we fix the remaining CI problems and then we should be good to go.
@sayakpaul, should be fixed now.
Congratulations on shipping this thing, @ishan-modi! Thank you!
Let's maybe now focus on the following things to maximize the potential impact:
- SVDQuant Support
- Guide to actually benefit from speedups
Happy to help.
Uh oh!
There was an error while loading. Please reload this page.
What does this PR do?
WIP, aimed at adding new backend for quantization #11032. For now, this PR just works for on-the-fly quantization. Loading pre-quantized models errors out and it is to be fixed by NVIDIA team in next release early may
Depends on
(削除) this to support latest diffusers (削除ここまで)(削除) this to enable INT8 quantization (削除ここまで)(削除) this to enable NF4 quantization (削除ここまで)Code
Following is a discussion on speedups while using real_quant with NVIDIA team here