Releases: NVIDIA-NeMo/Automodel

NVIDIA NeMo-Automodel 0.4.0

28 Apr 20:04

@svcnvidia-nemo-ci svcnvidia-nemo-ci

v0.4.0

b651aa8

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

NVIDIA NeMo-Automodel 0.4.0 Latest

Latest

Release Notes

Highlights
- Expanded VLM line-up: Gemma 4, Mistral 4, Qwen3.5 VL
- Diffusion and discrete-diffusion LLM (new tracks)
- NeMo Retriever – bi-encoder + cross-encoder / reranker
- Knowledge Distillation scaled to TP > 1 and PP (Sepehr Sameni)
- MoE infrastructure deepening – UCCL-EP, HybridEP, grouped_mm
- SkyPilot launcher backend (Aditya Saxena, community)
- End-to-end checkpoint + convergence robustness framework
Model Support – newly supported families in r0.4.0
- LLM
  - GLM 5 (#1372)
  - GLM 5.1 (#1720)
  - MiniMax M2.5 (#1280)
- VLM / OMNI
  - Gemma 4 family – 2B, 4B, 31B, 26B-A4B MoE (#1658, #1660, #1731)
  - Mistral Small 4 (#1556)
  - Qwen3.5 VL dense – 4B, 9B (#1427)
  - Qwen3.5 VL MoE – 35B (#1373)
- Diffusion
  - Flux T2I, Hunyuan T2V, Wan 2.1 T2V (see "Diffusion" section)
- Discrete diffusion LLM
  - LLaDA (see "Discrete Diffusion LLM" section)
Diffusion – new track in r0.4.0
- HuggingFace Diffuser integration
- r0.4.0 ships full pretrain / finetune / generate pipelines with LoRA support for diffusion models (T2V, T2I)
- Wan integrated with multi-resolution DataLoader (#1475)
- Inference utility for diffusion (#1491)
- LoRA for diffusion (#1653, Linnan Wang)
- Diffusion processor registry (#1379)
- Models / recipes shipped
  - Flux T2I – pretrain, SFT, LoRA, generate
  - Hunyuan T2V – SFT, LoRA, generate
  - Wan 2.1 T2V – pretrain, SFT, LoRA, generate
- Documentation guides for dataset preprocessing and finetuning.
Discrete Diffusion LLM (dLLM) – new track in r0.4.0
- Discrete diffusion LLM SFT support added (#1665)
- LLaDA SFT recipe (#1672)
- dLLM generation pipeline (#1692)
NeMo Retriever (bi-encoder + cross-encoder)
- Refactored cross-encoder / reranker training loop (new in r0.4.0) — (#1449).
- Bi-encoder datasets can be loaded directly from the HuggingFace Hub (#1380)
- Bi-encoder masking + consistent attn_implementation default (#1349)
- Resolve retrieval dataset corpus paths relative to training file (#1367)
- Docs: docs/guides/retrieval/finetune.md
Knowledge Distillation — Sepehr Sameni
- Enable TP > 1 in KD (#1297)
- TP-aware KDLoss with distributed softmax + T2 scaling (#1499)
- Pipeline-parallelism support for KD (#1500)
Parallelism / Performance / Train-loop
- FSDP2
  - FSDP2 weight prefetching + async TP optimization (#1711)
- Context Parallel
  - Qwen3.5 dense & MoE CP (#1710, #1560 — alexchiu / Zhaopeng Qiu)
  - Mamba CP for hybrid Nemotron v3 (#1441)
  - 3D mRoPE position_ids sharding under CP (#1482)
  - CP attention-mask hooks for dense / non-TE (#1470)
- Pipeline Parallel
  - PP shape-inference optimization + pp_seq_len field in PipelineConfig (#1195, #1390)
  - Variable length for PP (#1689 – Zhiqi Li & Hemil Desai)
- Activation checkpointing
  - Gradient_checkpointing overhead reduction i[n transformers 5.3 (#1621 — Yuki Huang)
- MoE infrastructure
  - UCCL-EP alternative dispatcher (#1635 – Zhiqi Li & Hemil Desai)
  - HybridEP (#1333, #1666)
  - DeepEP-on-H100 RDMA fallback detection (#1275 — Piotr Żelasko)
  - torch._grouped_mm expert backend (#1228)
  - TE FusedAdam QuantizedTensor compatibility patch (#1417)
  - MoE LoRA rank scaling + torch_mm path (#1300, #1392)
  - Expert / diversity metrics (#1232, #1506), top-k utilization (#1418)
  - Packed sequences for MoE with EP+PP (#1685)
- FlashOptim integration (#1492)
- Scheduler-driven python GC (#1391)
- fp32 RMSNorm backend + cast_model_to_dtype for improved stability (#1493)
- Native Comet ML experiment tracking (#1411, Logan Vegna, community)
- Added .generate() with KV-cache for Nemotron v3 (#1332, Piotr Żelasko)
- Added output_hidden_states for NemotronHForCausalLM (#1386, Desh Raj)
Launcher & CLI
- SkyPilot backend (#1590 — Aditya Saxena, community contributor)
- CLI app + launching refactor (#1406)
  - Shim scripts under examples/ will be deprecated post 26.04.
- Launcher CLI flags no longer leak into recipe YAML overrides (#1766)
- MFU logging in train recipes (#1413 — SwekeR, community)
Checkpoint and convergence robustness
- Checkpointing: End-to-end finetune → vLLM-deploy testing (#1606)
  - Models covered:
    - Gemma 3
    - Nemotron (Flash 1B, Super v3, Nano 9B, Nano v3)
    - Phi 4, Llama 3.2, Qwen 2.5
    - Qwen 3 MoE, GPT-OSS.
  - What this catches: prediction divergence, packaging gaps, vLLM loading issues.
- Convergence harness (#1554, #1577, #1602)
  - Pipeline: Tulu-3 data prep → model verification → training → eval
  - Models covered:
    - GPT-OSS 20B (FlashAdamW + TE FusedAdam).
    - Moonlight 16B (3 configs incl. EP8+CP2).
    - Qwen3 4B (3 configs incl. CP1/CP2 variants).
    - Qwen3 MoE 30B (2 configs + experiments/).
Datasets
- Neat packing (greedy knapsack) for LLM and VLM (#1485 – Zhiqi Li)
- Pretokenization support for VLM.(Zhiqi Li)
- MultiImage dataset support for Qwen family (Zhiqi Li)
- Qwen family video training support (Zhiqi Li)
- LengthGroupedSampler (#1618 – Zhiqi Li)
- Chat datasets THD/BSHD + CP, padding fixes (#1416).
- reasoning_content + tool-calling support in ChatDataset (#1644, Zeel Desai, community).
- Custom chat_template override for VLM finetuning (#1525, Bambuuai, community).
- NEFTune noisy embeddings (#1686, stanley1208, community).
- JSONL malformed-line skip (#1694, Somshubra Majumdar).
Documentation
- Per-model coverage pages (#1683).
- Diffusion docs (#1495).
- Gemma 4 tutorial (#1657).
- Nemotron Parse fine-tuning notebook + assets (#1655, Krishna Kalyan).
- Finetune-process + container-usage docs (#1484, Krishna Kalyan).
- MLflow/Databricks docs (#1170, Andrei Onel).
Contributions – we are grateful for all contributions 🙇
- Khazzz1c
  - optimized resolve_yaml_env_vars from scanning runtime data in instantiate() (#1827)
  - additional contributions in r0.5.0.
- Logan Vegna: added native Comet ML experiment tracking support (#1411).
- Harsha Pasham: fixed error with aten::equal operator on meta tensors (#1769).
- Aditya Saxena: added SkyPilot support. (#1590).
- SwekeR-463:
  - Added MFU logging in train recipes (#1413).
  - Added embeddings utility functions for 15 models (#1288).
- stanley1208
  - Implemented NEFTune noisy embeddings for fine-tuning (#1686).
  - Added best_metric_key field in CheckpointingConfig (#1641).
- Zeel Desai
  - Added reasoning_content and tool-calling support to ChatDataset (#1644).
  - Additional contributions in the next release.
- Bambuuai: enabled custom chat_template override for VLM fine-tuning (#1525).
- Zakir Jiwani: Fixed instantiation issue in yaml parsing (issue #1496) (#1654).
Known Issues
- Minor memory regression in cohere_command_r_7b_hellaswag_fp8 and glm_4_9b_chat_hf_hellaswag_fp8
- Qwen3_5_4b_neat_packing hangs during checkpoint saving
- MegatronFSDP support postponed for 26.06
- ~2% of checkpoint loading currently exercise a less-optimized path, which is being addressed in follow-up work.

Changelog Details

refactor: extract initialize_model_weights from load_base_model by @hemildesai :: PR: #1356
fix: prefer moe_config for num_experts in apply_ac by @hemildesai :: PR: #1361
fix: FSDP pre-shard combined projections on dim 1 for Qwen2.5-7B support by @ZhiyuLi-Nvidia :: PR: #1357
ci: Update release workflow to include changelog and docs by @chtruong814 :: PR: #1320
feat: Add.generate() function with KV cache support for Nemotron v3 by @pzelasko :: PR: #1332
fix: loss masking with pad eos collision by @akoumpa :: PR: #1338
feat: add Qwen3.5 35b by @HuiyingLi :: PR: #1373
feat: refactor retriever code by @adil-a :: PR: #1166
fix: resolve retrieval dataset corpus paths relative to training file by @oliverholworthy :: PR: #1367
docs: Replace latest docs with nightly by @chtruong814 :: PR: #1358
fix: EP collective deadlock with variable-length token counts by @ShiftyBlock :: PR: #1365
fix: guard AutoConfig.from_pretrained in PP mask precomputation by @hemildesai :: PR: #1378
docs: fix broken links across documentation guides by @chenopis :: PR: #1374
fix: Handle check_model_inputs removal in transformers 5.2.0 by @oliverholworthy :: PR: #1369
fix: coverage for customizer_retrieval tests by @akoumpa :: PR: #1382
docs: add nano-v3 full sft benchmarks by @adil-a :: PR: #1387
docs: Added installation guidance by @onel :: PR: #1371
docs: update readme and docs by @akoumpa :: PR: #1370
feat: make MoE parallelizer mixed precision policy configurable via recipes by @hemildesai :: PR: #1392
ci: Add-credentials-for-docs by @ko3n1g :: PR: #1389
feat: add pp_seq_len field to PipelineConfig by @hemildesai :: PR: #1390
feat: add onnx export for biencoder by @akoumpa :: PR: #1276
feat: add scheduler-driven manual garbage collection across recipes by @hemildesai :: PR: #1391
fix: skip instantiation of nested configs overridden by kwargs in ConfigNode by @oliverholworthy :: PR: #1397
fix: MoE lora adapter layout by @akoumpa :: PR: #1395
fix: update GLM 4.7 Flash TE DeepEP finetuning config by @hemildesai :: PR: #1401
fix: read rope config from rope_parameters across all models by @hemildesai :: PR: #1400
docs: Ensure all docs updates from main are nightly by @chtruong814 :: PR: #1402
feat: add output_hidden_states support to NemotronHForCausalLM by @desh2608 :: PR: #1386
refactor: use auto_map for faster init by @akoumpa :: PR: #1405
feat: allow disabling top-k expert utilization logging in MoE metrics by @hemildesai :: PR: #1418
feat: add TE FusedAdam QuantizedTensor compatibility patch...

Contributors

Separius, oliverholworthy, and 32 other contributors

Assets 2

4 people reacted

NVIDIA NeMo-Automodel 0.3.0

02 Mar 18:57

@svcnvidia-nemo-ci svcnvidia-nemo-ci

v0.3.0

9e9472f

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

NVIDIA NeMo-Automodel 0.3.0

Release Notes

Hugging Face Transformers v5
- Upgraded to Transformers v5 with a new device-mesh-only model initialization API
- Drop-in API compatibility: NeMoAutoModelForCausalLM, NeMoAutoModelForImageTextToText, NeMoAutoModelForSequenceClassification, NeMoAutoTokenizer mirror the standard Transformers Auto* APIs
Model Support
- LLM
  - DeepSeek V3.2
  - Step 3.5 Flash
  - MiniMax M2
  - Nemotron-3-Nano v3 (30B-A3B)
  - Nemotron Flash 1B
  - GLM 4.7, GLM 4.7 Flash
  - Devstral-Small-2-24B
  - FunctionGemma (tool-calling)
  - Ministral3 (3B, 8B, 14B)
- VLM & OMNI
  - Kimi-VL-A3B
  - Kimi K2.5 VL
  - Nemotron Parse v1.1
  - Qwen3 VL MoE (30B, 235B)
  - Ministral3 VLM (3B, 8B, 14B)
- Embedding & Retrieval
  - NeMo Biencoder training pipeline with Llama-Embed-Nemotron-8B support
  - Hard negative mining for retrieval training
PEFT
- DoRA (Weight-Decomposed Low-Rank Adaptation)
- LoRA for MoE models (DeepSeek MoE, Qwen MoE)
- LoRA support for Biencoder
Parallelism
- Pipeline parallelism for VLMs
- GroupedExpertsTE backend (prerequisite for MoE FP8)
- TE RoPE fusion for custom MoE models
- Norm fusion and RoPE cache for dense models
Dataset support for
- VLM multi-turn chat
- Inline text dataset format for retrieval
- Databricks DeltaLake streaming dataset
- Parquet file support for Megatron dataset preprocessing
- xLAM tool-calling dataset
- Answer-only masking in ColumnMappedDataset
Checkpointing & logging
- Streaming safetensors writer for reduced peak memory during checkpoint saving
- Explicit restore_from for checkpoint loading (replaces auto-loading behavior)
- Checkpoint custom model code files alongside weights
- Configurable remote logging frequency via step_scheduler
Optimizers
- Dion optimizer (Muon/orthogonal family)
Performance
- Faster FP8 dequant kernels for DeepSeek V3
- Meta device initialization enabled by default for reduced peak memory during model setup
- Combine projection refactor for dense models
- SDPA as default attention backend when FlashAttention is unavailable
Misc Infrastructure
- Databricks integration (DeltaLake datasets, Unity Catalog checkpointing, DBFS consolidation)
- Nsys profiling support with model layer name scoping
- Environment variable dereferencing in YAML configs
- Improved import time
Resolved from 0.2.0
- MoE perf regression with DeepSeek V3 (resolved via faster FP8 dequant kernels and GroupedExpertsTE backend)
- PEFT (LoRA) support for MoE models (now available)
- Validation for packed sequences with TE attention (fixed)
- Validation support for pipeline parallelism (added)
Known Issues
- Qwen3-next unsupported on blackwell due to FLA lacking support.
- TransformerEngine’s Fused Adam is not working with DTensor, resolved in the upcoming version.
- LoRA with TE backend is not supported

Community Contributions

We gratefully acknowledge the following contributions from the OSS community:
- @onel (Andrei Onel) – Founder of @askmanu, Dublin
  - docs: Add documentation for the new ChatDataset class (#990)
  - docs: Added MLflow guide (#1045)
  - docs: Created guide for quantization aware training (#1088)
  - docs: Documentation update for release 0.2.0 (#1041)
  - docs: Update docs/guides/dataset-overview.md (#1145)
- @ooooo-create – Community contributor (PaddlePaddle ecosystem)
  - fix: Add DeepEP fallback logic and tests (#1000)
  - fix: respect trust_remote_code when building AutoConfig (#1007)
- @Sparlitu – Community contributor
  - fix: leave num_epochs unset if max_steps is specified (#1107)
- @yuhezhang-ai (Yuhe Zhang) – Engineer at Polarr
  - feat: Support LoRA for custom MoEs (#1010)
  - fix: sequence classification metric and training bugs #780 (#841)
- @therealnaveenkamal (Naveenraj Kamalakannan) – NYU graduate student
  - feat: Implement DoRA (#1150)
- @dongs0104 (Dong Shin) – Samsung Research, Samsung Electronics
  - fix: resolving errors in the hf decorator function (#983)
- @jbross-ibm-research (Juergen Bross) – IBM Research

Changelog Details

feat: auto detect base weights dequant by @adil-a :: PR: #796
fix: Raise informative error in MegatronPretraining if no glob paths found by @jrbourbeau :: PR: #803
feat: enable meta device by default by @adil-a :: PR: #797
feat: checkpoint custom model code files by @adil-a :: PR: #810
feat: qwen3 vl 30b ootb recipe by @HuiyingLi :: PR: #808
fix: Include megatron Makefile in package data by @jrbourbeau :: PR: #798
perf: add qwen2.5 32b lora perf by @ZhiyuLi-Nvidia :: PR: #802
feat: Add NeMo Biencoder by @meatybobby :: PR: #745
ci: Add mamba-ssm and causal-conv1d dep by @thomasdhc :: PR: #811
fix: torchrun single proc by @akoumpa :: PR: #814
feat: refactor model init kwargs + whitelist NVIDIA models by @adil-a :: PR: #809
fix: Set max_steps config option for nanogpt pretraining by @jrbourbeau :: PR: #817
fix: ep shard state dict conversion by @adil-a :: PR: #815
fix: Cast norm to fp32 in clip_grad_norm by @hemildesai :: PR: #825
feat: add answer only masking in ColumnMappedDataset by @adil-a :: PR: #832
feat: add internvl recipe by @HuiyingLi :: PR: #823
fix: adding safety during model init by @adil-a :: PR: #833
build: Add OSS NOTICES.txt file to docker build by @chtruong814 :: PR: #838
ci: Build bitsandbytes from source by @thomasdhc :: PR: #837
feat: ckpt val loss + run val at ckpt + symlink best ckpt by @anubhutivyas :: PR: #828
fix: add num_nodes for alignment in benchmark recipe by @ZhiyuLi-Nvidia :: PR: #839
docs: Update version and contrib by @thomasdhc :: PR: #849
fix: deepseek v3 pretrain config parallelizer by @hemildesai :: PR: #851
feat: sft qat support by @akoumpa :: PR: #704
feat: combine projection refactor by @ZhiyuLi-Nvidia :: PR: #804
fix: sequence classification metric and training bugs #780 by @yuhezhang-ai :: PR: #841
fix: fix qwen3 omni config by @HuiyingLi :: PR: #855
fix: revert recipe change for memory fragmentation OOM in Llama3 70B by @ZhiyuLi-Nvidia :: PR: #818
fix: update moe finetuning configs to use from_pretrained by @hemildesai :: PR: #863
feat: adding flags for special tokens & chat template in column mapped dataset by @adil-a :: PR: #844
ci: Add additional dep for model support by @thomasdhc :: PR: #861
fix: remove validation for packed seq moe configs by @hemildesai :: PR: #867
fix: test process launcher error propagation by @akoumpa :: PR: #871
fix: no meta init when force_hf by @adil-a :: PR: #874
feat: add glm 4.5 air finetuning config by @hemildesai :: PR: #873
fix: NeMoAutoTokenizer by @akoumpa :: PR: #878
feat: add custom implementation for qwen3vlmoe by @HuiyingLi :: PR: #843
build: bumping timm version by @adil-a :: PR: #886
fix: moving registry imports under a try catch block by @adil-a :: PR: #889
fix: consolidate qwen3omni recipes by @HuiyingLi :: PR: #885
fix: perf regressions for custom MoEs by @hemildesai :: PR: #881
fix: update weight initialization method in LinearLoRA class by @RayenTian :: PR: #896
ci: ci: Update changelog for Automodel 0.2.0 by @akoumpa :: PR: #894
fix: support validation for packed sequences when using TE attention by @hemildesai :: PR: #892
docs: Update nvidia-sphinx-theme by @chtruong814 :: PR: #906
feat: configurable max clip grad by @akoumpa :: PR: #904
feat: vlm multiturn chat support and dataset by @HuiyingLi :: PR: #899
ci: Update transformers to latest version 4.57.3 by @thomasdhc :: PR: #890
docs: Update VLM table by @akoumpa :: PR: #917
ci: Initial PR template by @thomasdhc :: PR: #925
ci: Bump to 0.2.0 by @thomasdhc :: PR: #927
feat: update change log for r.0.2.0 by @akoumpa :: PR: #921
ci: Reorganize optional dependency by @thomasdhc :: PR: #926
feat: add ministral3 configs and improve tie_emb detection by @HuiyingLi :: PR: #915
feat: port ministral3 to transformers v4 by @HuiyingLi :: PR: #934
fix: mute ministral3 autodocstring warning by @HuiyingLi :: PR: #946
fix: handle zero active experts for 1 ep rank in GroupedExperts by @hemildesai :: PR: #935
fix: fix dataset load when split is not specified by @HuiyingLi :: PR: #943
fix: torch buffer warning by @adil-a :: PR: #948
docs: Fix images not rendering in docs by @jrbourbeau :: PR: #954
feat: improve yaml logging to stdout by @akoumpa :: PR: #882
fix: Biencoder consolidated checkpoint and transformers issue by @meatybobby :: PR: #936
feat: Support for Llama-Embed-Nemotron-8B Training Pipeline by @ybabakhin :: PR: #963
feat: nano v3 configs and FSDP fix by @adil-a :: PR: #964
feat: add more PEFT lora recipes by @ZhiyuLi-Nvidia :: PR: #959
ci: update owners by @akoumpa :: PR: #958
feat: add nsys model layer name scope and benchmark support (with nsys) in app by @ZhiyuLi-Nvidia :: PR: #951
docs: update vlm coverage by @akoumpa :: PR: #961
docs: Update news section for nano-v3 in README.md by @snowmanwwg :: PR: #969
feat: add nano-v3 to README by @adil-a :: PR: #978
fix: move print_trainable_parameters calculation to device by @akoumpa :: PR: #966
feat: simplify from_pretrained/from_config by @akoumpa :: PR: #967
docs: Update LLM coverage table by @akoumpa :: PR: #982
fix: misplaced parenthesis; Thanks @jbross-ibm-research by @akoumpa :: PR: #973
feat: add xlam toolcall dataset by @HuiyingLi :: PR: #975
feat: add functiongemma yaml by @HuiyingLi :: PR: #985
docs: functiongemma docs by @HuiyingLi :: PR: #986
feat: allow passing model-id to from_config by @akoumpa :: PR: #984
feat: add support for parquet files by @akoumpa :: PR: #919
docs: update readme with FunctionGemma by @Huiyin...

Contributors

oliverholworthy, HuiyingLi, and 25 other contributors

Assets 2

3 people reacted

NVIDIA NeMo-Automodel 0.2.0

04 Dec 21:22

@chtruong814 chtruong814

v0.2.0

0be83ba

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

NVIDIA NeMo-Automodel 0.2.0

Fast Model Implementations
- LLM
  - GPT-OSS 20B and 120B
  - Qwen3 next and Qwen3-235B
  - GLM-4.5-344BA32B, GLM-4.6, GLM-4.5-Air
- VLM & OMNI
  - Qwen3-vl
  - Qwen2-5-vl
  - Qwen3-omni-30b-a3b
  - Intern-vl-4B (ootb)
Parallelism
- Improved support for CP and sequence packing with MoE models
- Optimized TP plan for LoRA
Dataset support for
- Single-turn tool calling
- Multi-turn tool calling
- Streaming dataset
- Chat dataset with OpenAI format
- Improved support for truncation/padding
Checkpointing & logging
- Support for asynchronous checkpointing with DCP
- Symbolic links (LATEST, LOWEST_VAL) pointing to the latest and lowest validation score checkpoints
- MLFlow support
Task support
- QAT for SFT
- Sequence classification
Known issues
- Minor perf regression with DSv3
- Sequence parallel plan incorrect for Qwen3
- Support for GPT-OSS 120B with DeepEP will be included in the next patch release
- Validation is not functional for custom models with TE when using packed sequence and pipeline parallel size of 1
Limitations
- PEFT (LoRA) support for MoE models is scheduled for the 26.02 release
- For non-MoE models, CP support requires the model leveraging the PyTorch SDPA API

NeMo-Automodel 25.11 Container

The 0.2.0 release is also included the NeMo Automodel 25.11 container on NGC at https://registry.ngc.nvidia.com/orgs/nvidia/containers/nemo-automodel.
Here are the major software components included in the container:

Software Component	Version
CUDA	13.0
cuDNN	9.13.0.50-1
Pytorch	2.9.0a0
NeMo-Automodel	0.2.0
Transformer Engine	2.8.0
Transformers	4.57.1

Assets 2

1 person reacted

NVIDIA NeMo-Automodel 0.1.2

23 Oct 19:24

@chtruong814 chtruong814

v0.1.2

45ad729

This commit was signed with the committer’s verified signature.

ko3n1g oliver könig

GPG key ID: 2A0D811D627CDD85

Verified

Learn about vigilant mode.

NVIDIA NeMo-Automodel 0.1.2

Features:
- Included support for limiting the number of samples with the ColumnMappedDataset
Bug Fixes (step scheduler):
- Switched to zero-based indexing
- Epoch length accounts for accumulation steps

Assets 2

NVIDIA NeMo-Automodel 0.1.0

08 Oct 14:18

@chtruong814 chtruong814

v0.1.0

7146809

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

NVIDIA NeMo-Automodel 0.1.0

New Features

Pretraining support for
- Models under 40B with PyT FSDP2
- Larger models by applying PyT PP
- TP can also be used for models with a TP plan
- Large MOE via custom implementations
Knowledge distillation for LLMs (requires same tokenizer)
FP8 with torchao (requires torch.compile)
Parallelism
- HSDP with FSDP2
- Auto Pipelining Support
Checkpointing
- Pipeline support (load and save)
- Parallel load with meta device
Data
- ColumnMapped Dataset for single-turn SFT
- Pretrain Data: Megatron-Core and Nano-gpt compatible data
Performance https://docs.nvidia.com/nemo/automodel/latest/performance-summary.html
- Pretraining benchmark for Large MoE user-defined models
- Fast DeepSeek v3 implementation with DeepEP

Megatron FSDP support
Packed sequence support
Triton kernels for LoRA

Assets 2

NVIDIA NeMo-Automodel 0.1.0rc0

17 Sep 13:59

@chtruong814 chtruong814

v0.1.0rc0

d36402d

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

NVIDIA NeMo-Automodel 0.1.0rc0 Pre-release

Pre-release

Prerelease: NVIDIA NeMo-Automodel 0.1.0rc0 (2025年09月17日)

Assets 2

Releases: NVIDIA-NeMo/Automodel

NVIDIA NeMo-Automodel 0.4.0

Contributors

Uh oh!

NVIDIA NeMo-Automodel 0.3.0

Contributors

Uh oh!

NVIDIA NeMo-Automodel 0.2.0

NeMo-Automodel 25.11 Container

Uh oh!

NVIDIA NeMo-Automodel 0.1.2

Uh oh!

NVIDIA NeMo-Automodel 0.1.0

New Features

Uh oh!

NVIDIA NeMo-Automodel 0.1.0rc0

Uh oh!