Name	Name	Last commit message	Last commit date
Latest commit History 918 Commits
.agents	.agents
.claude	.claude
.github	.github
.gitlab	.gitlab
.vscode	.vscode
docs/source	docs/source
examples	examples
experimental	experimental
modelopt	modelopt
modelopt_recipes	modelopt_recipes
tests	tests
tools	tools
.coderabbit.yaml	.coderabbit.yaml
.dockerignore	.dockerignore
.gitignore	.gitignore
.gitmodules	.gitmodules
.markdownlint-cli2.yaml	.markdownlint-cli2.yaml
.pre-commit-config.yaml	.pre-commit-config.yaml
AGENTS.md	AGENTS.md
CHANGELOG.rst	CHANGELOG.rst
CLAUDE.md	CLAUDE.md
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
LICENSE_HEADER	LICENSE_HEADER
README.md	README.md
SECURITY.md	SECURITY.md
noxfile.py	noxfile.py
pyproject.toml	pyproject.toml
uv.lock	uv.lock

NVIDIA Model Optimizer

NVIDIA Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization, pruning, Neural Architecture Search (NAS), distillation, speculative decoding and sparsity to accelerate models.

[Input] Model Optimizer currently supports inputs of a Hugging Face, PyTorch or ONNX model.

[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint. Model Optimizer is also integrated with NVIDIA Megatron-Bridge, Megatron-LM and Hugging Face Accelerate for training required inference optimization techniques.

[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, TensorRT, or vLLM. The unified Hugging Face export API now supports both transformers and diffusers models.

Latest News

[2026年05月27日] End-to-end optimization tutorial for Nemotron-3-Nano-30B-A3B: Pruning + distillation (with long context extension) + FP8 quantization achieving ×ばつ vLLM throughput and ×ばつ memory reduction.
[2026年05月13日] Puzzletron: A new algorithm for heterogeneous pruning & NAS of LLM and VLM models.
[2026年04月15日] Customer story: Domyn compresses Colosseum-355B → 260B using ModelOpt's Minitron pruning + distillation
[2026年03月17日] Customer story: Bielik.AI builds Bielik Minitron 7B (33% smaller, 50% faster, 90% quality retained) using ModelOpt's Minitron pruning + distillation
[2026年03月11日] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: FP8, NVFP4. Learn more in the Nemotron 3 Super release blog. Check out how to quantize Nemotron 3 models for deployment acceleration here
[2026年03月11日] NeMo Megatron Bridge now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the Quantization (PTQ and QAT) guide for FP8/NVFP4 quantization and HF export instructions.
[2025年12月11日] BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference
[2025年12月08日] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
[2025年10月07日] BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer
[2025年09月17日] BLOG: An Introduction to Speculative Decoding for Reducing Latency in AI Inference
[2025年09月11日] BLOG: How Quantization Aware Training Enables Low-Precision Accuracy Recovery
[2025年08月29日] BLOG: Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
[2025年08月01日] BLOG: Optimizing LLMs for Performance and Accuracy with Post-Training Quantization
[2025年06月24日] BLOG: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
[2025年05月14日] NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs
[2025年04月21日] Adobe optimized deployment using Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership
[2025年04月05日] NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick. Check out how to quantize Llama4 for deployment acceleration here
[2025年03月18日] World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell
[2025年02月25日] Model Optimizer quantized NVFP4 models available on Hugging Face for download: DeepSeek-R1-FP4, Llama-3.3-70B-Instruct-FP4, Llama-3.1-405B-Instruct-FP4
[2025年01月28日] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ here.
[2025年01月28日] Model Optimizer is now open source!

Previous News

[2024年10月23日] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: 8B, 70B, 405B.
[2024年09月10日] Post-Training Quantization of LLMs with NVIDIA NeMo and Model Optimizer.
[2024年08月28日] Boosting Llama 3.1 405B Performance up to 44% with Model Optimizer on NVIDIA H200 GPUs
[2024年08月28日] Up to 1.9X Higher Llama 3.1 Performance with Medusa
[2024年08月15日] New features in recent releases: Cache Diffusion, QLoRA workflow with NVIDIA NeMo, and more. Check out our blog for details.
[2024年06月03日] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow here
[2024年05月08日] Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance
[2024年03月27日] Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records
[2024年03月18日] GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT
[2024年03月07日] Model Optimizer's 8-bit Post-Training Quantization enables TensorRT to accelerate Stable Diffusion to nearly 2x faster
[2024年02月01日] Speed up inference with Model Optimizer quantization techniques in TRT-LLM

Install

To install stable release packages for Model Optimizer with pip from PyPI:

pip install -U nvidia-modelopt[all]

Model Optimizer will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

To install from source in editable mode with all development dependencies or to use the latest features, run:

# Clone the Model Optimizer repository
git clone git@github.com:NVIDIA/Model-Optimizer.git
cd Model-Optimizer
pip install -e .[dev]

You can also directly use NVIDIA container images, which have Model Optimizer pre-installed:

nvcr.io/nvidia/pytorch:<version>-py3
nvcr.io/nvidia/nemo:<version>
nvcr.io/nvidia/tensorrt-llm/release:<version>

Before pulling and using the container images, please review their respective license terms. Make sure to upgrade Model Optimizer to the latest version as described above. Visit our installation guide for more fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.

Techniques

Technique	Description	Examples	Docs
Post Training Quantization	Compress model size by 2x-4x, speeding up inference while preserving model quality!	[LLMs] [diffusers] [VLMs] [onnx] [windows]	[docs]
Quantization Aware Training	Refine accuracy even further with a few training steps!	[Hugging Face]	[docs]
Pruning	Reduce your model size and accelerate inference by removing unnecessary weights!	[General] [Megatron-Bridge]
Distillation	Reduce deployment model size by teaching small models to behave like larger models!	[Megatron-Bridge] [Megatron-LM] [Hugging Face]	[docs]
Speculative Decoding	Train draft modules to predict extra tokens during inference!	[Megatron] [Hugging Face]	[docs]
Sparsity	Efficiently compress your model by storing only its non-zero parameter values and their locations	[PyTorch]	[docs]

Pre-Quantized Checkpoints

Ready-to-deploy checkpoints [🤗 Hugging Face - Nvidia Model Optimizer Collection]
Deployable on TensorRT-LLM, vLLM and SGLang
More models coming soon!

Resources

Model Support Matrix

Model Type	Support Matrix
LLM Quantization	View Support Matrix
Diffusers Quantization	View Support Matrix
VLM Quantization	View Support Matrix
ONNX Quantization	View Support Matrix
Windows Quantization	View Support Matrix
Quantization Aware Training	View Support Matrix
Pruning	View Support Matrix
Distillation	View Support Matrix
Speculative Decoding	View Support Matrix

Deprecation Policy

Model Optimizer follows a structured approach to managing deprecated features:

Communication: Deprecation notices are documented in the Changelog. Deprecated items include source code statements indicating deprecation timing, with runtime warnings issued upon use.
Migration Period: Since Model Optimizer is still pre-1.0, we provide a 1-release (~1-month) migration period after deprecation. During this window, deprecated features continue functioning while issuing warnings.
Scope: The policy addresses both complete deprecations (entire APIs removed) and partial ones (specific parameters removed while methods remain).
Removal: Following the migration period, deprecated elements are removed in alignment with semantic versioning standards, potentially including breaking changes in minor version updates while Model Optimizer remains in 0.x.

Contributing

Model Optimizer is now open source! We welcome any feedback, feature requests and PRs. Please read our Contributing guidelines for details on how to contribute to this project.

AI Agents

For AI-assisted development setup, see the agent tooling notes.

Top Contributors

Contributors

Happy optimizing!

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA/Model-Optimizer

Folders and files

Latest commit

History

Repository files navigation

NVIDIA Model Optimizer

Latest News

Install

Techniques

Pre-Quantized Checkpoints

Resources

Model Support Matrix

Deprecation Policy

Contributing

AI Agents

Top Contributors

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 37

Uh oh!

Contributors

Uh oh!

Languages