Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: NVIDIA-NeMo/Automodel

NVIDIA NeMo-Automodel 0.4.0

28 Apr 20:04
@svcnvidia-nemo-ci svcnvidia-nemo-ci
b651aa8
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Release Notes

  • Highlights

    • Expanded VLM line-up: Gemma 4, Mistral 4, Qwen3.5 VL
    • Diffusion and discrete-diffusion LLM (new tracks)
    • NeMo Retriever – bi-encoder + cross-encoder / reranker
    • Knowledge Distillation scaled to TP > 1 and PP (Sepehr Sameni)
    • MoE infrastructure deepening – UCCL-EP, HybridEP, grouped_mm
    • SkyPilot launcher backend (Aditya Saxena, community)
    • End-to-end checkpoint + convergence robustness framework
  • Model Support – newly supported families in r0.4.0

    • LLM
    • VLM / OMNI
      • Gemma 4 family – 2B, 4B, 31B, 26B-A4B MoE (#1658, #1660, #1731)
      • Mistral Small 4 (#1556)
      • Qwen3.5 VL dense – 4B, 9B (#1427)
      • Qwen3.5 VL MoE – 35B (#1373)
    • Diffusion
      • Flux T2I, Hunyuan T2V, Wan 2.1 T2V (see "Diffusion" section)
    • Discrete diffusion LLM
      • LLaDA (see "Discrete Diffusion LLM" section)
  • Diffusion – new track in r0.4.0

    • HuggingFace Diffuser integration
    • r0.4.0 ships full pretrain / finetune / generate pipelines with LoRA support for diffusion models (T2V, T2I)
    • Wan integrated with multi-resolution DataLoader (#1475)
    • Inference utility for diffusion (#1491)
    • LoRA for diffusion (#1653, Linnan Wang)
    • Diffusion processor registry (#1379)
    • Models / recipes shipped
      • Flux T2I – pretrain, SFT, LoRA, generate
      • Hunyuan T2V – SFT, LoRA, generate
      • Wan 2.1 T2V – pretrain, SFT, LoRA, generate
    • Documentation guides for dataset preprocessing and finetuning.
  • Discrete Diffusion LLM (dLLM) – new track in r0.4.0

    • Discrete diffusion LLM SFT support added (#1665)
    • LLaDA SFT recipe (#1672)
    • dLLM generation pipeline (#1692)
  • NeMo Retriever (bi-encoder + cross-encoder)

    • Refactored cross-encoder / reranker training loop (new in r0.4.0) — (#1449).
    • Bi-encoder datasets can be loaded directly from the HuggingFace Hub (#1380)
    • Bi-encoder masking + consistent attn_implementation default (#1349)
    • Resolve retrieval dataset corpus paths relative to training file (#1367)
    • Docs: docs/guides/retrieval/finetune.md
  • Knowledge Distillation — Sepehr Sameni

    • Enable TP > 1 in KD (#1297)
    • TP-aware KDLoss with distributed softmax + T2 scaling (#1499)
    • Pipeline-parallelism support for KD (#1500)
  • Parallelism / Performance / Train-loop

    • FSDP2
      • FSDP2 weight prefetching + async TP optimization (#1711)
    • Context Parallel
      • Qwen3.5 dense & MoE CP (#1710, #1560 — alexchiu / Zhaopeng Qiu)
      • Mamba CP for hybrid Nemotron v3 (#1441)
      • 3D mRoPE position_ids sharding under CP (#1482)
      • CP attention-mask hooks for dense / non-TE (#1470)
    • Pipeline Parallel
      • PP shape-inference optimization + pp_seq_len field in PipelineConfig (#1195, #1390)
      • Variable length for PP (#1689 – Zhiqi Li & Hemil Desai)
    • Activation checkpointing
      • Gradient_checkpointing overhead reduction i[n transformers 5.3 (#1621 — Yuki Huang)
    • MoE infrastructure
      • UCCL-EP alternative dispatcher (#1635 – Zhiqi Li & Hemil Desai)
      • HybridEP (#1333, #1666)
      • DeepEP-on-H100 RDMA fallback detection (#1275 — Piotr Żelasko)
      • torch._grouped_mm expert backend (#1228)
      • TE FusedAdam QuantizedTensor compatibility patch (#1417)
      • MoE LoRA rank scaling + torch_mm path (#1300, #1392)
      • Expert / diversity metrics (#1232, #1506), top-k utilization (#1418)
      • Packed sequences for MoE with EP+PP (#1685)
    • FlashOptim integration (#1492)
    • Scheduler-driven python GC (#1391)
    • fp32 RMSNorm backend + cast_model_to_dtype for improved stability (#1493)
    • Native Comet ML experiment tracking (#1411, Logan Vegna, community)
    • Added .generate() with KV-cache for Nemotron v3 (#1332, Piotr Żelasko)
    • Added output_hidden_states for NemotronHForCausalLM (#1386, Desh Raj)
  • Launcher & CLI

    • SkyPilot backend (#1590 — Aditya Saxena, community contributor)
    • CLI app + launching refactor (#1406)
      • Shim scripts under examples/ will be deprecated post 26.04.
    • Launcher CLI flags no longer leak into recipe YAML overrides (#1766)
    • MFU logging in train recipes (#1413 — SwekeR, community)
  • Checkpoint and convergence robustness

    • Checkpointing: End-to-end finetune → vLLM-deploy testing (#1606)
      • Models covered:
        • Gemma 3
        • Nemotron (Flash 1B, Super v3, Nano 9B, Nano v3)
        • Phi 4, Llama 3.2, Qwen 2.5
        • Qwen 3 MoE, GPT-OSS.
      • What this catches: prediction divergence, packaging gaps, vLLM loading issues.
    • Convergence harness (#1554, #1577, #1602)
      • Pipeline: Tulu-3 data prep → model verification → training → eval
      • Models covered:
        • GPT-OSS 20B (FlashAdamW + TE FusedAdam).
        • Moonlight 16B (3 configs incl. EP8+CP2).
        • Qwen3 4B (3 configs incl. CP1/CP2 variants).
        • Qwen3 MoE 30B (2 configs + experiments/).
  • Datasets

    • Neat packing (greedy knapsack) for LLM and VLM (#1485 – Zhiqi Li)
    • Pretokenization support for VLM.(Zhiqi Li)
    • MultiImage dataset support for Qwen family (Zhiqi Li)
    • Qwen family video training support (Zhiqi Li)
    • LengthGroupedSampler (#1618 – Zhiqi Li)
    • Chat datasets THD/BSHD + CP, padding fixes (#1416).
    • reasoning_content + tool-calling support in ChatDataset (#1644, Zeel Desai, community).
    • Custom chat_template override for VLM finetuning (#1525, Bambuuai, community).
    • NEFTune noisy embeddings (#1686, stanley1208, community).
    • JSONL malformed-line skip (#1694, Somshubra Majumdar).
  • Documentation

    • Per-model coverage pages (#1683).
    • Diffusion docs (#1495).
    • Gemma 4 tutorial (#1657).
    • Nemotron Parse fine-tuning notebook + assets (#1655, Krishna Kalyan).
    • Finetune-process + container-usage docs (#1484, Krishna Kalyan).
    • MLflow/Databricks docs (#1170, Andrei Onel).
  • Contributions – we are grateful for all contributions 🙇

    • Khazzz1c
      • optimized resolve_yaml_env_vars from scanning runtime data in instantiate() (#1827)
      • additional contributions in r0.5.0.
    • Logan Vegna: added native Comet ML experiment tracking support (#1411).
    • Harsha Pasham: fixed error with aten::equal operator on meta tensors (#1769).
    • Aditya Saxena: added SkyPilot support. (#1590).
    • SwekeR-463:
      • Added MFU logging in train recipes (#1413).
      • Added embeddings utility functions for 15 models (#1288).
    • stanley1208
      • Implemented NEFTune noisy embeddings for fine-tuning (#1686).
      • Added best_metric_key field in CheckpointingConfig (#1641).
    • Zeel Desai
      • Added reasoning_content and tool-calling support to ChatDataset (#1644).
      • Additional contributions in the next release.
    • Bambuuai: enabled custom chat_template override for VLM fine-tuning (#1525).
    • Zakir Jiwani: Fixed instantiation issue in yaml parsing (issue #1496) (#1654).
  • Known Issues

    • Minor memory regression in cohere_command_r_7b_hellaswag_fp8 and glm_4_9b_chat_hf_hellaswag_fp8
    • Qwen3_5_4b_neat_packing hangs during checkpoint saving
    • MegatronFSDP support postponed for 26.06
    • ~2% of checkpoint loading currently exercise a less-optimized path, which is being addressed in follow-up work.
Changelog Details
  • refactor: extract initialize_model_weights from load_base_model by @hemildesai :: PR: #1356
  • fix: prefer moe_config for num_experts in apply_ac by @hemildesai :: PR: #1361
  • fix: FSDP pre-shard combined projections on dim 1 for Qwen2.5-7B support by @ZhiyuLi-Nvidia :: PR: #1357
  • ci: Update release workflow to include changelog and docs by @chtruong814 :: PR: #1320
  • feat: Add.generate() function with KV cache support for Nemotron v3 by @pzelasko :: PR: #1332
  • fix: loss masking with pad eos collision by @akoumpa :: PR: #1338
  • feat: add Qwen3.5 35b by @HuiyingLi :: PR: #1373
  • feat: refactor retriever code by @adil-a :: PR: #1166
  • fix: resolve retrieval dataset corpus paths relative to training file by @oliverholworthy :: PR: #1367
  • docs: Replace latest docs with nightly by @chtruong814 :: PR: #1358
  • fix: EP collective deadlock with variable-length token counts by @ShiftyBlock :: PR: #1365
  • fix: guard AutoConfig.from_pretrained in PP mask precomputation by @hemildesai :: PR: #1378
  • docs: fix broken links across documentation guides by @chenopis :: PR: #1374
  • fix: Handle check_model_inputs removal in transformers 5.2.0 by @oliverholworthy :: PR: #1369
  • fix: coverage for customizer_retrieval tests by @akoumpa :: PR: #1382
  • docs: add nano-v3 full sft benchmarks by @adil-a :: PR: #1387
  • docs: Added installation guidance by @onel :: PR: #1371
  • docs: update readme and docs by @akoumpa :: PR: #1370
  • feat: make MoE parallelizer mixed precision policy configurable via recipes by @hemildesai :: PR: #1392
  • ci: Add-credentials-for-docs by @ko3n1g :: PR: #1389
  • feat: add pp_seq_len field to PipelineConfig by @hemildesai :: PR: #1390
  • feat: add onnx export for biencoder by @akoumpa :: PR: #1276
  • feat: add scheduler-driven manual garbage collection across recipes by @hemildesai :: PR: #1391
  • fix: skip instantiation of nested configs overridden by kwargs in ConfigNode by @oliverholworthy :: PR: #1397
  • fix: MoE lora adapter layout by @akoumpa :: PR: #1395
  • fix: update GLM 4.7 Flash TE DeepEP finetuning config by @hemildesai :: PR: #1401
  • fix: read rope config from rope_parameters across all models by @hemildesai :: PR: #1400
  • docs: Ensure all docs updates from main are nightly by @chtruong814 :: PR: #1402
  • feat: add output_hidden_states support to NemotronHForCausalLM by @desh2608 :: PR: #1386
  • refactor: use auto_map for faster init by @akoumpa :: PR: #1405
  • feat: allow disabling top-k expert utilization logging in MoE metrics by @hemildesai :: PR: #1418
  • feat: add TE FusedAdam QuantizedTensor compatibility patch...
Read more

NVIDIA NeMo-Automodel 0.3.0

02 Mar 18:57
@svcnvidia-nemo-ci svcnvidia-nemo-ci
9e9472f
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Release Notes

  • Hugging Face Transformers v5
    • Upgraded to Transformers v5 with a new device-mesh-only model initialization API
    • Drop-in API compatibility: NeMoAutoModelForCausalLM, NeMoAutoModelForImageTextToText, NeMoAutoModelForSequenceClassification, NeMoAutoTokenizer mirror the standard Transformers Auto* APIs
  • Model Support
    • LLM
      • DeepSeek V3.2
      • Step 3.5 Flash
      • MiniMax M2
      • Nemotron-3-Nano v3 (30B-A3B)
      • Nemotron Flash 1B
      • GLM 4.7, GLM 4.7 Flash
      • Devstral-Small-2-24B
      • FunctionGemma (tool-calling)
      • Ministral3 (3B, 8B, 14B)
    • VLM & OMNI
      • Kimi-VL-A3B
      • Kimi K2.5 VL
      • Nemotron Parse v1.1
      • Qwen3 VL MoE (30B, 235B)
      • Ministral3 VLM (3B, 8B, 14B)
    • Embedding & Retrieval
      • NeMo Biencoder training pipeline with Llama-Embed-Nemotron-8B support
      • Hard negative mining for retrieval training
  • PEFT
    • DoRA (Weight-Decomposed Low-Rank Adaptation)
    • LoRA for MoE models (DeepSeek MoE, Qwen MoE)
    • LoRA support for Biencoder
  • Parallelism
    • Pipeline parallelism for VLMs
    • GroupedExpertsTE backend (prerequisite for MoE FP8)
    • TE RoPE fusion for custom MoE models
    • Norm fusion and RoPE cache for dense models
  • Dataset support for
    • VLM multi-turn chat
    • Inline text dataset format for retrieval
    • Databricks DeltaLake streaming dataset
    • Parquet file support for Megatron dataset preprocessing
    • xLAM tool-calling dataset
    • Answer-only masking in ColumnMappedDataset
  • Checkpointing & logging
    • Streaming safetensors writer for reduced peak memory during checkpoint saving
    • Explicit restore_from for checkpoint loading (replaces auto-loading behavior)
    • Checkpoint custom model code files alongside weights
    • Configurable remote logging frequency via step_scheduler
  • Optimizers
    • Dion optimizer (Muon/orthogonal family)
  • Performance
    • Faster FP8 dequant kernels for DeepSeek V3
    • Meta device initialization enabled by default for reduced peak memory during model setup
    • Combine projection refactor for dense models
    • SDPA as default attention backend when FlashAttention is unavailable
  • Misc Infrastructure
    • Databricks integration (DeltaLake datasets, Unity Catalog checkpointing, DBFS consolidation)
    • Nsys profiling support with model layer name scoping
    • Environment variable dereferencing in YAML configs
    • Improved import time
  • Resolved from 0.2.0
    • MoE perf regression with DeepSeek V3 (resolved via faster FP8 dequant kernels and GroupedExpertsTE backend)
    • PEFT (LoRA) support for MoE models (now available)
    • Validation for packed sequences with TE attention (fixed)
    • Validation support for pipeline parallelism (added)
  • Known Issues
    • Qwen3-next unsupported on blackwell due to FLA lacking support.
    • TransformerEngine’s Fused Adam is not working with DTensor, resolved in the upcoming version.
    • LoRA with TE backend is not supported

Community Contributions

  • We gratefully acknowledge the following contributions from the OSS community:
    • @onel (Andrei Onel) – Founder of @askmanu, Dublin
      • docs: Add documentation for the new ChatDataset class (#990)
      • docs: Added MLflow guide (#1045)
      • docs: Created guide for quantization aware training (#1088)
      • docs: Documentation update for release 0.2.0 (#1041)
      • docs: Update docs/guides/dataset-overview.md (#1145)
    • @ooooo-create – Community contributor (PaddlePaddle ecosystem)
      • fix: Add DeepEP fallback logic and tests (#1000)
      • fix: respect trust_remote_code when building AutoConfig (#1007)
    • @Sparlitu – Community contributor
      • fix: leave num_epochs unset if max_steps is specified (#1107)
    • @yuhezhang-ai (Yuhe Zhang) – Engineer at Polarr
      • feat: Support LoRA for custom MoEs (#1010)
      • fix: sequence classification metric and training bugs #780 (#841)
    • @therealnaveenkamal (Naveenraj Kamalakannan) – NYU graduate student
      • feat: Implement DoRA (#1150)
    • @dongs0104 (Dong Shin) – Samsung Research, Samsung Electronics
      • fix: resolving errors in the hf decorator function (#983)
    • @jbross-ibm-research (Juergen Bross) – IBM Research
Changelog Details
Read more

NVIDIA NeMo-Automodel 0.2.0

04 Dec 21:22
@chtruong814 chtruong814
0be83ba
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

  • Fast Model Implementations
    • LLM
      • GPT-OSS 20B and 120B
      • Qwen3 next and Qwen3-235B
      • GLM-4.5-344BA32B, GLM-4.6, GLM-4.5-Air
    • VLM & OMNI
      • Qwen3-vl
      • Qwen2-5-vl
      • Qwen3-omni-30b-a3b
      • Intern-vl-4B (ootb)
  • Parallelism
    • Improved support for CP and sequence packing with MoE models
    • Optimized TP plan for LoRA
  • Dataset support for
    • Single-turn tool calling
    • Multi-turn tool calling
    • Streaming dataset
    • Chat dataset with OpenAI format
    • Improved support for truncation/padding
  • Checkpointing & logging
    • Support for asynchronous checkpointing with DCP
    • Symbolic links (LATEST, LOWEST_VAL) pointing to the latest and lowest validation score checkpoints
    • MLFlow support
  • Task support
    • QAT for SFT
    • Sequence classification
  • Known issues
    • Minor perf regression with DSv3
    • Sequence parallel plan incorrect for Qwen3
    • Support for GPT-OSS 120B with DeepEP will be included in the next patch release
    • Validation is not functional for custom models with TE when using packed sequence and pipeline parallel size of 1
  • Limitations
    • PEFT (LoRA) support for MoE models is scheduled for the 26.02 release
    • For non-MoE models, CP support requires the model leveraging the PyTorch SDPA API

NeMo-Automodel 25.11 Container

The 0.2.0 release is also included the NeMo Automodel 25.11 container on NGC at https://registry.ngc.nvidia.com/orgs/nvidia/containers/nemo-automodel.
Here are the major software components included in the container:

Software Component Version
CUDA 13.0
cuDNN 9.13.0.50-1
Pytorch 2.9.0a0
NeMo-Automodel 0.2.0
Transformer Engine 2.8.0
Transformers 4.57.1
Loading
akoumpa reacted with hooray emoji
1 person reacted

NVIDIA NeMo-Automodel 0.1.2

23 Oct 19:24
@chtruong814 chtruong814
45ad729
This commit was signed with the committer’s verified signature.
ko3n1g oliver könig
GPG key ID: 2A0D811D627CDD85
Verified
Learn about vigilant mode.

Choose a tag to compare

  • Features:

    • Included support for limiting the number of samples with the ColumnMappedDataset
  • Bug Fixes (step scheduler):

    • Switched to zero-based indexing
    • Epoch length accounts for accumulation steps
Loading

NVIDIA NeMo-Automodel 0.1.0

08 Oct 14:18
@chtruong814 chtruong814
7146809
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

New Features

  • Pretraining support for
    • Models under 40B with PyT FSDP2
    • Larger models by applying PyT PP
    • TP can also be used for models with a TP plan
    • Large MOE via custom implementations
  • Knowledge distillation for LLMs (requires same tokenizer)
  • FP8 with torchao (requires torch.compile)
  • Parallelism
    • HSDP with FSDP2
    • Auto Pipelining Support
  • Checkpointing
    • Pipeline support (load and save)
    • Parallel load with meta device
  • Data
    • ColumnMapped Dataset for single-turn SFT
    • Pretrain Data: Megatron-Core and Nano-gpt compatible data
  • Performance https://docs.nvidia.com/nemo/automodel/latest/performance-summary.html
    • Pretraining benchmark for Large MoE user-defined models
    • Fast DeepSeek v3 implementation with DeepEP
  • Megatron FSDP support
  • Packed sequence support
  • Triton kernels for LoRA
Loading

NVIDIA NeMo-Automodel 0.1.0rc0

17 Sep 13:59
@chtruong814 chtruong814
d36402d
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

Pre-release

Prerelease: NVIDIA NeMo-Automodel 0.1.0rc0 (2025年09月17日)

Loading

AltStyle によって変換されたページ (->オリジナル) /