-
Notifications
You must be signed in to change notification settings - Fork 184
Releases: NVIDIA-NeMo/Automodel
NVIDIA NeMo-Automodel 0.4.0
b651aa8 Release Notes
-
Highlights
- Expanded VLM line-up: Gemma 4, Mistral 4, Qwen3.5 VL
- Diffusion and discrete-diffusion LLM (new tracks)
- NeMo Retriever – bi-encoder + cross-encoder / reranker
- Knowledge Distillation scaled to TP > 1 and PP (Sepehr Sameni)
- MoE infrastructure deepening – UCCL-EP, HybridEP, grouped_mm
- SkyPilot launcher backend (Aditya Saxena, community)
- End-to-end checkpoint + convergence robustness framework
-
Model Support – newly supported families in r0.4.0
- LLM
- VLM / OMNI
- Diffusion
- Flux T2I, Hunyuan T2V, Wan 2.1 T2V (see "Diffusion" section)
- Discrete diffusion LLM
- LLaDA (see "Discrete Diffusion LLM" section)
-
Diffusion – new track in r0.4.0
- HuggingFace Diffuser integration
- r0.4.0 ships full pretrain / finetune / generate pipelines with LoRA support for diffusion models (T2V, T2I)
- Wan integrated with multi-resolution DataLoader (#1475)
- Inference utility for diffusion (#1491)
- LoRA for diffusion (#1653, Linnan Wang)
- Diffusion processor registry (#1379)
- Models / recipes shipped
- Flux T2I – pretrain, SFT, LoRA, generate
- Hunyuan T2V – SFT, LoRA, generate
- Wan 2.1 T2V – pretrain, SFT, LoRA, generate
- Documentation guides for dataset preprocessing and finetuning.
-
Discrete Diffusion LLM (dLLM) – new track in r0.4.0
-
NeMo Retriever (bi-encoder + cross-encoder)
- Refactored cross-encoder / reranker training loop (new in r0.4.0) — (#1449).
- Bi-encoder datasets can be loaded directly from the HuggingFace Hub (#1380)
- Bi-encoder masking + consistent attn_implementation default (#1349)
- Resolve retrieval dataset corpus paths relative to training file (#1367)
- Docs: docs/guides/retrieval/finetune.md
-
Knowledge Distillation — Sepehr Sameni
-
Parallelism / Performance / Train-loop
- FSDP2
- FSDP2 weight prefetching + async TP optimization (#1711)
- Context Parallel
- Pipeline Parallel
- Activation checkpointing
- Gradient_checkpointing overhead reduction i[n transformers 5.3 (#1621 — Yuki Huang)
- MoE infrastructure
- UCCL-EP alternative dispatcher (#1635 – Zhiqi Li & Hemil Desai)
- HybridEP (#1333, #1666)
- DeepEP-on-H100 RDMA fallback detection (#1275 — Piotr Żelasko)
- torch._grouped_mm expert backend (#1228)
- TE FusedAdam QuantizedTensor compatibility patch (#1417)
- MoE LoRA rank scaling + torch_mm path (#1300, #1392)
- Expert / diversity metrics (#1232, #1506), top-k utilization (#1418)
- Packed sequences for MoE with EP+PP (#1685)
- FlashOptim integration (#1492)
- Scheduler-driven python GC (#1391)
- fp32 RMSNorm backend + cast_model_to_dtype for improved stability (#1493)
- Native Comet ML experiment tracking (#1411, Logan Vegna, community)
- Added .generate() with KV-cache for Nemotron v3 (#1332, Piotr Żelasko)
- Added output_hidden_states for NemotronHForCausalLM (#1386, Desh Raj)
- FSDP2
-
Launcher & CLI
-
Checkpoint and convergence robustness
- Checkpointing: End-to-end finetune → vLLM-deploy testing (#1606)
- Models covered:
- Gemma 3
- Nemotron (Flash 1B, Super v3, Nano 9B, Nano v3)
- Phi 4, Llama 3.2, Qwen 2.5
- Qwen 3 MoE, GPT-OSS.
- What this catches: prediction divergence, packaging gaps, vLLM loading issues.
- Models covered:
- Convergence harness (#1554, #1577, #1602)
- Pipeline: Tulu-3 data prep → model verification → training → eval
- Models covered:
- GPT-OSS 20B (FlashAdamW + TE FusedAdam).
- Moonlight 16B (3 configs incl. EP8+CP2).
- Qwen3 4B (3 configs incl. CP1/CP2 variants).
- Qwen3 MoE 30B (2 configs + experiments/).
- Checkpointing: End-to-end finetune → vLLM-deploy testing (#1606)
-
Datasets
- Neat packing (greedy knapsack) for LLM and VLM (#1485 – Zhiqi Li)
- Pretokenization support for VLM.(Zhiqi Li)
- MultiImage dataset support for Qwen family (Zhiqi Li)
- Qwen family video training support (Zhiqi Li)
- LengthGroupedSampler (#1618 – Zhiqi Li)
- Chat datasets THD/BSHD + CP, padding fixes (#1416).
- reasoning_content + tool-calling support in ChatDataset (#1644, Zeel Desai, community).
- Custom chat_template override for VLM finetuning (#1525, Bambuuai, community).
- NEFTune noisy embeddings (#1686, stanley1208, community).
- JSONL malformed-line skip (#1694, Somshubra Majumdar).
-
Documentation
-
Contributions – we are grateful for all contributions 🙇
- Khazzz1c
- optimized resolve_yaml_env_vars from scanning runtime data in instantiate() (#1827)
- additional contributions in r0.5.0.
- Logan Vegna: added native Comet ML experiment tracking support (#1411).
- Harsha Pasham: fixed error with aten::equal operator on meta tensors (#1769).
- Aditya Saxena: added SkyPilot support. (#1590).
- SwekeR-463:
- stanley1208
- Zeel Desai
- Added reasoning_content and tool-calling support to ChatDataset (#1644).
- Additional contributions in the next release.
- Bambuuai: enabled custom chat_template override for VLM fine-tuning (#1525).
- Zakir Jiwani: Fixed instantiation issue in yaml parsing (issue #1496) (#1654).
- Khazzz1c
-
Known Issues
- Minor memory regression in cohere_command_r_7b_hellaswag_fp8 and glm_4_9b_chat_hf_hellaswag_fp8
- Qwen3_5_4b_neat_packing hangs during checkpoint saving
- MegatronFSDP support postponed for 26.06
- ~2% of checkpoint loading currently exercise a less-optimized path, which is being addressed in follow-up work.
Changelog Details
- refactor: extract initialize_model_weights from load_base_model by @hemildesai :: PR: #1356
- fix: prefer moe_config for num_experts in apply_ac by @hemildesai :: PR: #1361
- fix: FSDP pre-shard combined projections on dim 1 for Qwen2.5-7B support by @ZhiyuLi-Nvidia :: PR: #1357
- ci: Update release workflow to include changelog and docs by @chtruong814 :: PR: #1320
- feat: Add
.generate()function with KV cache support for Nemotron v3 by @pzelasko :: PR: #1332 - fix: loss masking with pad eos collision by @akoumpa :: PR: #1338
- feat: add Qwen3.5 35b by @HuiyingLi :: PR: #1373
- feat: refactor retriever code by @adil-a :: PR: #1166
- fix: resolve retrieval dataset corpus paths relative to training file by @oliverholworthy :: PR: #1367
- docs: Replace latest docs with nightly by @chtruong814 :: PR: #1358
- fix: EP collective deadlock with variable-length token counts by @ShiftyBlock :: PR: #1365
- fix: guard AutoConfig.from_pretrained in PP mask precomputation by @hemildesai :: PR: #1378
- docs: fix broken links across documentation guides by @chenopis :: PR: #1374
- fix: Handle check_model_inputs removal in transformers 5.2.0 by @oliverholworthy :: PR: #1369
- fix: coverage for customizer_retrieval tests by @akoumpa :: PR: #1382
- docs: add nano-v3 full sft benchmarks by @adil-a :: PR: #1387
- docs: Added installation guidance by @onel :: PR: #1371
- docs: update readme and docs by @akoumpa :: PR: #1370
- feat: make MoE parallelizer mixed precision policy configurable via recipes by @hemildesai :: PR: #1392
- ci: Add-credentials-for-docs by @ko3n1g :: PR: #1389
- feat: add pp_seq_len field to PipelineConfig by @hemildesai :: PR: #1390
- feat: add onnx export for biencoder by @akoumpa :: PR: #1276
- feat: add scheduler-driven manual garbage collection across recipes by @hemildesai :: PR: #1391
- fix: skip instantiation of nested configs overridden by kwargs in ConfigNode by @oliverholworthy :: PR: #1397
- fix: MoE lora adapter layout by @akoumpa :: PR: #1395
- fix: update GLM 4.7 Flash TE DeepEP finetuning config by @hemildesai :: PR: #1401
- fix: read rope config from rope_parameters across all models by @hemildesai :: PR: #1400
- docs: Ensure all docs updates from main are nightly by @chtruong814 :: PR: #1402
- feat: add output_hidden_states support to NemotronHForCausalLM by @desh2608 :: PR: #1386
- refactor: use auto_map for faster init by @akoumpa :: PR: #1405
- feat: allow disabling top-k expert utilization logging in MoE metrics by @hemildesai :: PR: #1418
- feat: add TE FusedAdam QuantizedTensor compatibility patch...
Contributors
- @Separius
- @oliverholworthy
- @HuiyingLi
- @onel
- @titu1994
- @krishnakalyan3
- @desh2608
- @zpqiu
- @terrykong
- @hemildesai
- @thomasdhc
- @pzelasko
- @ko3n1g
- @chenopis
- @adil-a
- @athitten
- @yuki-97
- @stanley1208
- @ShiftyBlock
- @zeel2104
- @aasthajh
- @JiwaniZakir
- @SwekeR-463
- @Anakintano
- @RayenTian
- @Bambuuai
- @davidoneilai
- @akoumpa
- @LoganVegnaSHOP
- @chtruong814
- @ZhiyuLi-Nvidia
- @pthombre
- @svcnvidia-nemo-ci
- @zyzhou5
Assets 2
NVIDIA NeMo-Automodel 0.3.0
9e9472f Release Notes
- Hugging Face Transformers v5
- Upgraded to Transformers v5 with a new device-mesh-only model initialization API
- Drop-in API compatibility:
NeMoAutoModelForCausalLM,NeMoAutoModelForImageTextToText,NeMoAutoModelForSequenceClassification,NeMoAutoTokenizermirror the standard TransformersAuto*APIs
- Model Support
- LLM
- DeepSeek V3.2
- Step 3.5 Flash
- MiniMax M2
- Nemotron-3-Nano v3 (30B-A3B)
- Nemotron Flash 1B
- GLM 4.7, GLM 4.7 Flash
- Devstral-Small-2-24B
- FunctionGemma (tool-calling)
- Ministral3 (3B, 8B, 14B)
- VLM & OMNI
- Kimi-VL-A3B
- Kimi K2.5 VL
- Nemotron Parse v1.1
- Qwen3 VL MoE (30B, 235B)
- Ministral3 VLM (3B, 8B, 14B)
- Embedding & Retrieval
- NeMo Biencoder training pipeline with Llama-Embed-Nemotron-8B support
- Hard negative mining for retrieval training
- LLM
- PEFT
- DoRA (Weight-Decomposed Low-Rank Adaptation)
- LoRA for MoE models (DeepSeek MoE, Qwen MoE)
- LoRA support for Biencoder
- Parallelism
- Pipeline parallelism for VLMs
- GroupedExpertsTE backend (prerequisite for MoE FP8)
- TE RoPE fusion for custom MoE models
- Norm fusion and RoPE cache for dense models
- Dataset support for
- VLM multi-turn chat
- Inline text dataset format for retrieval
- Databricks DeltaLake streaming dataset
- Parquet file support for Megatron dataset preprocessing
- xLAM tool-calling dataset
- Answer-only masking in ColumnMappedDataset
- Checkpointing & logging
- Streaming safetensors writer for reduced peak memory during checkpoint saving
- Explicit
restore_fromfor checkpoint loading (replaces auto-loading behavior) - Checkpoint custom model code files alongside weights
- Configurable remote logging frequency via
step_scheduler
- Optimizers
- Dion optimizer (Muon/orthogonal family)
- Performance
- Faster FP8 dequant kernels for DeepSeek V3
- Meta device initialization enabled by default for reduced peak memory during model setup
- Combine projection refactor for dense models
- SDPA as default attention backend when FlashAttention is unavailable
- Misc Infrastructure
- Databricks integration (DeltaLake datasets, Unity Catalog checkpointing, DBFS consolidation)
- Nsys profiling support with model layer name scoping
- Environment variable dereferencing in YAML configs
- Improved import time
- Resolved from 0.2.0
- MoE perf regression with DeepSeek V3 (resolved via faster FP8 dequant kernels and GroupedExpertsTE backend)
- PEFT (LoRA) support for MoE models (now available)
- Validation for packed sequences with TE attention (fixed)
- Validation support for pipeline parallelism (added)
- Known Issues
- Qwen3-next unsupported on blackwell due to FLA lacking support.
- TransformerEngine’s Fused Adam is not working with DTensor, resolved in the upcoming version.
- LoRA with TE backend is not supported
Community Contributions
- We gratefully acknowledge the following contributions from the OSS community:
- @onel (Andrei Onel) – Founder of @askmanu, Dublin
- @ooooo-create – Community contributor (PaddlePaddle ecosystem)
- @Sparlitu – Community contributor
- fix: leave num_epochs unset if max_steps is specified (#1107)
- @yuhezhang-ai (Yuhe Zhang) – Engineer at Polarr
- @therealnaveenkamal (Naveenraj Kamalakannan) – NYU graduate student
- feat: Implement DoRA (#1150)
- @dongs0104 (Dong Shin) – Samsung Research, Samsung Electronics
- fix: resolving errors in the hf decorator function (#983)
- @jbross-ibm-research (Juergen Bross) – IBM Research
Changelog Details
- feat: auto detect base weights dequant by @adil-a :: PR: #796
- fix: Raise informative error in
MegatronPretrainingif no glob paths found by @jrbourbeau :: PR: #803 - feat: enable meta device by default by @adil-a :: PR: #797
- feat: checkpoint custom model code files by @adil-a :: PR: #810
- feat: qwen3 vl 30b ootb recipe by @HuiyingLi :: PR: #808
- fix: Include megatron
Makefilein package data by @jrbourbeau :: PR: #798 - perf: add qwen2.5 32b lora perf by @ZhiyuLi-Nvidia :: PR: #802
- feat: Add NeMo Biencoder by @meatybobby :: PR: #745
- ci: Add mamba-ssm and causal-conv1d dep by @thomasdhc :: PR: #811
- fix: torchrun single proc by @akoumpa :: PR: #814
- feat: refactor model init kwargs + whitelist NVIDIA models by @adil-a :: PR: #809
- fix: Set
max_stepsconfig option for nanogpt pretraining by @jrbourbeau :: PR: #817 - fix: ep shard state dict conversion by @adil-a :: PR: #815
- fix: Cast norm to fp32 in clip_grad_norm by @hemildesai :: PR: #825
- feat: add answer only masking in ColumnMappedDataset by @adil-a :: PR: #832
- feat: add internvl recipe by @HuiyingLi :: PR: #823
- fix: adding safety during model init by @adil-a :: PR: #833
- build: Add OSS NOTICES.txt file to docker build by @chtruong814 :: PR: #838
- ci: Build bitsandbytes from source by @thomasdhc :: PR: #837
- feat: ckpt val loss + run val at ckpt + symlink best ckpt by @anubhutivyas :: PR: #828
- fix: add num_nodes for alignment in benchmark recipe by @ZhiyuLi-Nvidia :: PR: #839
- docs: Update version and contrib by @thomasdhc :: PR: #849
- fix: deepseek v3 pretrain config parallelizer by @hemildesai :: PR: #851
- feat: sft qat support by @akoumpa :: PR: #704
- feat: combine projection refactor by @ZhiyuLi-Nvidia :: PR: #804
- fix: sequence classification metric and training bugs #780 by @yuhezhang-ai :: PR: #841
- fix: fix qwen3 omni config by @HuiyingLi :: PR: #855
- fix: revert recipe change for memory fragmentation OOM in Llama3 70B by @ZhiyuLi-Nvidia :: PR: #818
- fix: update moe finetuning configs to use from_pretrained by @hemildesai :: PR: #863
- feat: adding flags for special tokens & chat template in column mapped dataset by @adil-a :: PR: #844
- ci: Add additional dep for model support by @thomasdhc :: PR: #861
- fix: remove validation for packed seq moe configs by @hemildesai :: PR: #867
- fix: test process launcher error propagation by @akoumpa :: PR: #871
- fix: no meta init when
force_hfby @adil-a :: PR: #874 - feat: add glm 4.5 air finetuning config by @hemildesai :: PR: #873
- fix: NeMoAutoTokenizer by @akoumpa :: PR: #878
- feat: add custom implementation for qwen3vlmoe by @HuiyingLi :: PR: #843
- build: bumping timm version by @adil-a :: PR: #886
- fix: moving registry imports under a try catch block by @adil-a :: PR: #889
- fix: consolidate qwen3omni recipes by @HuiyingLi :: PR: #885
- fix: perf regressions for custom MoEs by @hemildesai :: PR: #881
- fix: update weight initialization method in LinearLoRA class by @RayenTian :: PR: #896
- ci: ci: Update changelog for Automodel 0.2.0 by @akoumpa :: PR: #894
- fix: support validation for packed sequences when using TE attention by @hemildesai :: PR: #892
- docs: Update nvidia-sphinx-theme by @chtruong814 :: PR: #906
- feat: configurable max clip grad by @akoumpa :: PR: #904
- feat: vlm multiturn chat support and dataset by @HuiyingLi :: PR: #899
- ci: Update transformers to latest version 4.57.3 by @thomasdhc :: PR: #890
- docs: Update VLM table by @akoumpa :: PR: #917
- ci: Initial PR template by @thomasdhc :: PR: #925
- ci: Bump to 0.2.0 by @thomasdhc :: PR: #927
- feat: update change log for r.0.2.0 by @akoumpa :: PR: #921
- ci: Reorganize optional dependency by @thomasdhc :: PR: #926
- feat: add ministral3 configs and improve tie_emb detection by @HuiyingLi :: PR: #915
- feat: port ministral3 to transformers v4 by @HuiyingLi :: PR: #934
- fix: mute ministral3 autodocstring warning by @HuiyingLi :: PR: #946
- fix: handle zero active experts for 1 ep rank in GroupedExperts by @hemildesai :: PR: #935
- fix: fix dataset load when split is not specified by @HuiyingLi :: PR: #943
- fix: torch buffer warning by @adil-a :: PR: #948
- docs: Fix images not rendering in docs by @jrbourbeau :: PR: #954
- feat: improve yaml logging to stdout by @akoumpa :: PR: #882
- fix: Biencoder consolidated checkpoint and transformers issue by @meatybobby :: PR: #936
- feat: Support for Llama-Embed-Nemotron-8B Training Pipeline by @ybabakhin :: PR: #963
- feat: nano v3 configs and FSDP fix by @adil-a :: PR: #964
- feat: add more PEFT lora recipes by @ZhiyuLi-Nvidia :: PR: #959
- ci: update owners by @akoumpa :: PR: #958
- feat: add nsys model layer name scope and benchmark support (with nsys) in app by @ZhiyuLi-Nvidia :: PR: #951
- docs: update vlm coverage by @akoumpa :: PR: #961
- docs: Update news section for nano-v3 in README.md by @snowmanwwg :: PR: #969
- feat: add nano-v3 to README by @adil-a :: PR: #978
- fix: move print_trainable_parameters calculation to device by @akoumpa :: PR: #966
- feat: simplify from_pretrained/from_config by @akoumpa :: PR: #967
- docs: Update LLM coverage table by @akoumpa :: PR: #982
- fix: misplaced parenthesis; Thanks @jbross-ibm-research by @akoumpa :: PR: #973
- feat: add xlam toolcall dataset by @HuiyingLi :: PR: #975
- feat: add functiongemma yaml by @HuiyingLi :: PR: #985
- docs: functiongemma docs by @HuiyingLi :: PR: #986
- feat: allow passing model-id to from_config by @akoumpa :: PR: #984
- feat: add support for parquet files by @akoumpa :: PR: #919
- docs: update readme with FunctionGemma by @Huiyin...
Contributors
- @oliverholworthy
- @HuiyingLi
- @onel
- @radekosmulski
- @yuhezhang-ai
- @hemildesai
- @thomasdhc
- @meatybobby
- @jrbourbeau
- @roclark
- @dongs0104
- @ybabakhin
- @jbross-ibm-research
- @anubhutivyas
- @adil-a
- @Sparlitu
- @therealnaveenkamal
- @aschilling-nv
- @snowmanwwg
- @ooooo-create
- @RayenTian
- @shan-nvidia
- @akoumpa
- @askmanu
- @chtruong814
- @ZhiyuLi-Nvidia
- @svcnvidia-nemo-ci
Assets 2
NVIDIA NeMo-Automodel 0.2.0
0be83ba - Fast Model Implementations
- LLM
- GPT-OSS 20B and 120B
- Qwen3 next and Qwen3-235B
- GLM-4.5-344BA32B, GLM-4.6, GLM-4.5-Air
- VLM & OMNI
- Qwen3-vl
- Qwen2-5-vl
- Qwen3-omni-30b-a3b
- Intern-vl-4B (ootb)
- LLM
- Parallelism
- Improved support for CP and sequence packing with MoE models
- Optimized TP plan for LoRA
- Dataset support for
- Single-turn tool calling
- Multi-turn tool calling
- Streaming dataset
- Chat dataset with OpenAI format
- Improved support for truncation/padding
- Checkpointing & logging
- Support for asynchronous checkpointing with DCP
- Symbolic links (LATEST, LOWEST_VAL) pointing to the latest and lowest validation score checkpoints
- MLFlow support
- Task support
- QAT for SFT
- Sequence classification
- Known issues
- Minor perf regression with DSv3
- Sequence parallel plan incorrect for Qwen3
- Support for GPT-OSS 120B with DeepEP will be included in the next patch release
- Validation is not functional for custom models with TE when using packed sequence and pipeline parallel size of 1
- Limitations
- PEFT (LoRA) support for MoE models is scheduled for the 26.02 release
- For non-MoE models, CP support requires the model leveraging the PyTorch SDPA API
NeMo-Automodel 25.11 Container
The 0.2.0 release is also included the NeMo Automodel 25.11 container on NGC at https://registry.ngc.nvidia.com/orgs/nvidia/containers/nemo-automodel.
Here are the major software components included in the container:
| Software Component | Version |
|---|---|
| CUDA | 13.0 |
| cuDNN | 9.13.0.50-1 |
| Pytorch | 2.9.0a0 |
| NeMo-Automodel | 0.2.0 |
| Transformer Engine | 2.8.0 |
| Transformers | 4.57.1 |
Assets 2
NVIDIA NeMo-Automodel 0.1.2
-
Features:
- Included support for limiting the number of samples with the ColumnMappedDataset
-
Bug Fixes (step scheduler):
- Switched to zero-based indexing
- Epoch length accounts for accumulation steps
Assets 2
NVIDIA NeMo-Automodel 0.1.0
7146809 New Features
- Pretraining support for
- Models under 40B with PyT FSDP2
- Larger models by applying PyT PP
- TP can also be used for models with a TP plan
- Large MOE via custom implementations
- Knowledge distillation for LLMs (requires same tokenizer)
- FP8 with torchao (requires torch.compile)
- Parallelism
- HSDP with FSDP2
- Auto Pipelining Support
- Checkpointing
- Pipeline support (load and save)
- Parallel load with meta device
- Data
- ColumnMapped Dataset for single-turn SFT
- Pretrain Data: Megatron-Core and Nano-gpt compatible data
- Performance https://docs.nvidia.com/nemo/automodel/latest/performance-summary.html
- Pretraining benchmark for Large MoE user-defined models
- Fast DeepSeek v3 implementation with DeepEP
- Megatron FSDP support
- Packed sequence support
- Triton kernels for LoRA
Assets 2
NVIDIA NeMo-Automodel 0.1.0rc0
d36402d Prerelease: NVIDIA NeMo-Automodel 0.1.0rc0 (2025年09月17日)