Spaces:

subhankarg
/

MagpieTTS_Internal_Demo

Runtime error

App Files Files Community

MagpieTTS_Internal_Demo / CHANGELOG.md

subhankarg

Upload folder using huggingface_hub

0558aa4 verified 8 days ago

preview code

raw

history blame contribute delete

168 kB

Changelog

NVIDIA Neural Modules 2.6.0

Highlights

Speech
- Add Timestamps to streaming ASR PR
- Add Streaming decoding policies (Wait-K and AlignAtt) for Canary model PR
- Add NeMo Voice Agent PR
- Hybrid RNNT-CTC Prompted Parakeet Model support PR
- [New] MT-Parakeet Streaming Models release
Removed the Automodel module. Automodel is available in the repo https://github.com/NVIDIA-NeMo/Automodel.
Removed the Deploy module. Export & Deploy is available in the repo https://github.com/NVIDIA-NeMo/Export-Deploy.
Non-Speech NeMo 2.0 collections are deprecated and will be removed in a later release. Their functionality is available in the Megatron Bridge repo at https://github.com/NVIDIA-NeMo/Megatron-Bridge.

Known Issues

NeMo voice agent pipecat connecting issues

Detailed Changelogs:

ASR

Changelog

fixing kernel restarting when transcribing by @weiqingw4ng :: PR: #14665
Downgrade "datasets" library version in ASR tutorial to ensure compatibility with HF Datasets used by @KunalDhawan :: PR: #14679
Fixing Sortformer training tutorial notebook by @tango4j :: PR: #14680
Fix for "EncDecRNNTBPEModel transcribe() failed with TypeError" by @andrusenkoau :: PR: #14698
Force activations and weights cast to FP32 Jasper Encoder Squeeze-Excite (merge to main) by @erastorgueva-nv :: PR: #14743
Use lhotse dataloader for ASR models to support in-manifest channel selection for multichannel recordings by @racoiaws :: PR: #14586
add transducer timestamps without alignments, timestamps to streaming by @lilithgrigoryan :: PR: #14766
Adding bf16 Sortformer train and inference by @tango4j :: PR: #14627
Replace texterrors with kaldialign library by @andrusenkoau :: PR: #14775
fix: Use shutil.copy fallback to handle file metadata permission errors by @vipnydav :: PR: #14639
Add Customization Capabilities to Cache-Aware Models by @artbataev :: PR: #14757
Documentation for gpu-based phrase boosting by @andrusenkoau :: PR: #14800
Streaming decoding policies (Wait-K and AlignAtt) for Canary model by @andrusenkoau :: PR: #14765
Add tests for streaming buffered and cache-aware transducer models by @artbataev :: PR: #14823
Merge updates of Multi-Talker Parakeet Model, Modules, Dataloader and Utils PR 01 by @weiqingw4ng :: PR: #14905
Merge updates of Multi-Talker Parakeet - Unit tests and CI tests PR 02 by @weiqingw4ng :: PR: #14932
Add Parakeet Hybrid RNNT CTC BPE Model with Prompt support by @ealbasiri :: PR: #14561
fix notebooks by @nithinraok :: PR: #15079
cherry pick #15070 by @nithinraok :: PR: #15082

TTS

Changelog

Remove outdated TTS Tutorials by @blisc :: PR: #14660
Add KokoroTTS support for voice agent framework by @tango4j :: PR: #14910
remove language_modeling by @dimapihtar :: PR: #14192

NLP / NMT

Changelog

Add gpt-oss by @cuichenx :: PR: #14457
Fix sequence packing loss calculation by @rayandasoriya :: PR: #14437
[Perf script] Llama and GPT3 perf script use mlp cast fusion by @guyueh1 :: PR: #14575
Delete tutorials/llm/llama/biomedical-qa directory by @cuichenx :: PR: #14653
Add gpt-oss lora exporter by @cuichenx :: PR: #14589
Replace MegatronTokenizer with MegatronLegacyTokenizer by @chtruong814 :: PR: #14721
Update ModelCommPGs API from megatron-core by @yaoyu-33 :: PR: #14578
feat: Compatibility modification of megatron-fsdp by @shjwudp :: PR: #14593
imported get_moe_layer_wise_logging_tracker from megatron core moe_utils by @prathamk-tw :: PR: #14694
Fix gpt-oss yarn_original_max_position_embeddings value by @cuichenx :: PR: #14706
Update docs per guidance by @pablo-garay :: PR: #14841
Fixing three mcore links by @aschilling-nv :: PR: #14839
Documentation for gpu-based phrase boosting by @andrusenkoau :: PR: #14800
Update gpt-oss configs by @cuichenx :: PR: #14674
remove language_modeling by @dimapihtar :: PR: #14192
cp: remove ExportDeploy into r2.6.0 by @pablo-garay :: PR: #15053
cherry pick #15070 by @nithinraok :: PR: #15082

Export

Changelog

fix: fix missing rope scaling in exporting llama embedding model by @ZhiyuLi-Nvidia :: PR: #14523
Add gpt-oss lora exporter by @cuichenx :: PR: #14589
Skip trt-llm and vllm install in install test by @chtruong814 :: PR: #14663
Fix deepseek export dtype by @cuichenx :: PR: #14307
Remove export-deploy, automodel, and eval tutorials by @chtruong814 :: PR: #14790
cp: remove ExportDeploy into r2.6.0 by @pablo-garay :: PR: #15053

Uncategorized:

Changelog

Version bump to 2.6.0rc0.dev0 by @github-actions[bot] :: PR: #14512
[Audio]: added conformer U-Net model for SE by @nasretdinovr :: PR: #14442
hyena/evo2: Make sure to convert to real after fp32 conversion by @antonvnv :: PR: #14515
Force-set restore path for student in KD mode by @AAnoosheh :: PR: #14532
Skip PTQ if PTQ model path exists by @jenchen13 :: PR: #14536
Support QwenVL for inference API by @meatybobby :: PR: #14534
Hyena: Allow to use unfused RMSNorm + TELinear to restore accuracy and some speed by @antonvnv :: PR: #14542
[Audio]: added streaming mode to SpectrogramToAudio by @nasretdinovr :: PR: #14524
Update evo2 defaults so converted checkpoints have the right parameters by @jstjohn :: PR: #14514
deprecate t0 scripts by @dimapihtar :: PR: #14585
cfg typo correction by @malay-nagda :: PR: #14588
[Perf script] Add use_te_activation_func and activation_func_fp8_input_store flags by @guyueh1 :: PR: #14522
Modify logging message to signal that RestoreConfig will be used by @balvisio :: PR: #14469
Bump TE and Mcore by @chtruong814 :: PR: #14568
Avoid host-device sync in PTL logging by @WanZzzzzz :: PR: #14489
Integrate implicit filter kernel with Hyena layer by @farhadrgh :: PR: #14621
Fix kv_channels configuration for Gemma2 27b by @ananthsub :: PR: #14590
[Flux] small fixes by @CarlosGomes98 :: PR: #14333
[Flux] Add MXFP8 Support by @alpha0422 :: PR: #14473
Use hugginface_hub for downloading the FLUX checkpoint by @suiyoubi :: PR: #14638
Fine-tune embedding models (E5-Large-V2 and LLaMA-3.2-1B) on the allnli triplet dataset with NeMo Framework by @girihemant19 :: PR: #14584
remove service launch scripts by @dimapihtar :: PR: #14647
Warn instead of error when chat template doesn't contain generation keyword by @jenchen13 :: PR: #14641
Fix function calling notebook by @cuichenx :: PR: #14643
[Audio]: fixed bug in conformer unet by @nasretdinovr :: PR: #14626
Fix code checkout during test by @chtruong814 :: PR: #14658
Fix Flux seed as optional Arg by @suiyoubi :: PR: #14652
Remove PEFT scheme condition from recipe by @JRD971000 :: PR: #14661
Add NeMo Voice Agent by @stevehuang52 :: PR: #14325
Update get_tensor_shapes function whose signature was refactored by @AAnoosheh :: PR: #14594
Delete nemo1 notebooks by @cuichenx :: PR: #14677
Bump latest Mcore 020abf01 by @chtruong814 :: PR: #14676
[Flux] correct vae_downscale_factor by @CarlosGomes98 :: PR: #14425
Bump modelopt to 0.35.0 and remove safe_import("modelopt") in llm collection by @kevalmorabia97 :: PR: #14656
Canary tutorial fix by @nune-tadevosyan :: PR: #14699
Add option for LoRA with Transformer Engine op fuser by @timmoon10 :: PR: #14411
add load-in-4bit param by @dimapihtar :: PR: #14636
Support NVFP4 recipe by @WanZzzzzz :: PR: #14625
Fix broken link in Reasoning-SFT.ipynb by @cuichenx :: PR: #14716
Remove artificial block to vortex fp8 TP by @jstjohn :: PR: #14684
Drop speech_llm example suite by @yaoyu-33 :: PR: #14683
remove env var by @malay-nagda :: PR: #14739
detach arg option for run scripts by @malay-nagda :: PR: #14722
Randomized shard slicing for tarred data by @pzelasko :: PR: #14558
Data prediction objective for flow matching speech enhancement models by @racoiaws :: PR: #14749
Fix Some Failures by @alpha0422 :: PR: #14763
Support additional Slurm parameters (#14701) by @bdubauski :: PR: #14742
[Flux] Remove Redundant Host & Device Sync by @alpha0422 :: PR: #14711
[Flux] Full Iteration CUDA Graph by @alpha0422 :: PR: #14744
Update prune-distill notebooks to Qwen3 + simplify + mmlu eval by @kevalmorabia97 :: PR: #14785
ci: Automodel deprecation warning by @thomasdhc :: PR: #14787
Bug in MXFP8 recipe by @adityavavreNVDA :: PR: #14793
feat: Disable blank Issues by @pablo-garay :: PR: #14788
ci: Add community label bot by @chtruong814 :: PR: #14796
Add mistral small3 24B config and recipe by @eagle705 :: PR: #14784
Update changelog for r2.3.0 by @github-actions[bot] :: PR: #14812
QWEN2.5-VL 7B FP8 Recipe by @tomlifu :: PR: #14801
Feat: Disk space management: for nemo install test by @pablo-garay :: PR: #14822
Evo2 address rare over-masking in 1m context dataset by @jstjohn :: PR: #14821
Update cherry-pick workflow to use version 0.63.0 by @pablo-garay :: PR: #14832
Removing automodel items by @aschilling-nv :: PR: #14840
Update changelog for v2.4.1 by @github-actions[bot] :: PR: #14828
Fix lm_eval installation in pruning tutorial for 25.09 container by @kevalmorabia97 :: PR: #14865
Add nemotron-nano-v2 support to voice agent by @stevehuang52 :: PR: #14704
Update changelog for 2.5.0 by @chtruong814 :: PR: #14890
[Qwen3] Fix the flop cal for Qwen3 by @gdengk :: PR: #14897
[lhotse][aistore] added support input_cfg.yaml directly from aistore bucket by @XuesongYang :: PR: #14891
Harden _is_target_allowed by adding runtime class validation on top of prefix checks to prevent unsafe target resolution by @KunalDhawan :: PR: #14540
Enable simplified DistOpt checkpoint formats by @mikolajblaz :: PR: #14428
Fix the load checkpointing issue -- onelogger callback gets called multiple time in some case. by @liquor233 :: PR: #14945
Revert "new changelog-build" by @pablo-garay :: PR: #14949
feat: new changelog-build by @pablo-garay :: PR: #14950
Update llama4 utils kwargs by @yaoyu-33 :: PR: #14924
Update README.md by @snowmanwwg :: PR: #14917
Update all outdated NeMo Curator links by @sarahyurick :: PR: #14760
Freeze tags in in r2.6.0 by @github-actions[bot] :: PR: #14957
cp: Bump MCore, TE, Pytorch, and modelopt for 25.11 (14946) into r2.6.0 by @chtruong814 :: PR: #14976
cp: Update ctc-segmentation (14991) into r2.6.0 by @chtruong814 :: PR: #14998
cherry-pick of #14962 by @dimapihtar :: PR: #15000
cp: Pass timeout when running speech functional tests (15012) into r2.6.0 by @chtruong814 :: PR: #15013
cp: check asr models (14989) into r2.6.0 by @chtruong814 :: PR: #15002
cp: Enable EP in PTQ (15015) into r2.6.0 by @chtruong814 :: PR: #15026
cp: Update numba to numba-cuda and update cuda python bindings usage (15018) into r2.6.0 by @chtruong814 :: PR: #15024
cp: Add import guards for mcore lightning module (14970) into r2.6.0 by @chtruong814 :: PR: #14981
cp: fix loading of hyb ctc rnnt bpe models when using from pretrained (15042) into r2.6.0 by @chtruong814 :: PR: #15045
cp: fix: fix update-buildcache workflow after ED remove (15051) into r2.6.0 by @chtruong814 :: PR: #15052
cp: chore: update Lightning requirements version (15004) into r2.6.0 by @chtruong814 :: PR: #15049
cp: update notebook (15093) into r2.6.0 by @chtruong814 :: PR: #15094
cp: Fix: Obsolete Attribute [SDE] (15105) into r2.6.0 by @chtruong814 :: PR: #15106
cp: Upgrade NeMo ASR tutorials from Mozilla/CommonVoice to Google/FLEURS (15103) into r2.6.0 by @chtruong814 :: PR: #15107
cp: chore: Remove Automodel module (15044) into r2.6.0 by @chtruong814 :: PR: #15084
cp: Add deprecation notice to modules (15050) into r2.6.0 by @chtruong814 :: PR: #15110

NVIDIA Neural Modules 2.5.3

Highlights

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at [email protected]
Update nv-one-logger
Update ctc-segmentation

Detailed Changelogs:

Text Normalization / Inverse Text Normalization

Changelog

chore: update Lightning requirement by @liquor233 :: PR: #15005

Uncategorized:

Changelog

cp: Update ctc-segmentation (14991) into r2.5.0 by @chtruong814 :: PR: #15020
Bump to 2.5.3 by @chtruong814 :: PR: #15022

NVIDIA Neural Modules 2.5.2

Detailed Changelogs:

Text Normalization / Inverse Text Normalization

Changelog

cp: Add import guards for mcore lightning module (#14970) into r2.5.0 by @chtruong814 :: PR: #14982

Uncategorized:

Changelog

Bump to 2.5.2 by @chtruong814 :: PR: #14983

NVIDIA Neural Modules 2.5.1

Highlights

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at [email protected]
Adds nv-one-logger
Adds fixes related to Megatron FSDP

Detailed Changelogs:

ASR

Changelog

Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811

TTS

Changelog

Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811

NLP / NMT

Changelog

Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811
Megatron FSDP r2.5.0 cherry-pick by @BoxiangW :: PR: #14922

Uncategorized:

Changelog

Bump to 2.5.1 by @chtruong814 :: PR: #14898
Cherry pick Feat: Disk space management: for nemo install test (14822) into r2.5.0 by @chtruong814 :: PR: #14937
cp: Fix the load checkpointing issue -- onelogger callback gets called multiple time in some case. (14945) into r2.5.0 by @chtruong814 :: PR: #14948

NVIDIA Neural Modules 2.5.0

Highlights

Collections:
- LLM
  - Nano v2 12B and 9B
- Speech
  - New SpeechLM2 collection
  - Streaming Softformer model
  - Deprecate Confidence Ensemble models
  - parakeet-tdt-0.6b-v3 and canary-1b-v2 models
  - Added chunk inference support with .transcribe() for canary based models
  - Enable prediction of timestamps with streaming ASR
  - Improve ASR models’ invariance to padding/batch size
  - Qwen prompt format support, SALM generation fixes
  - High-level SALM model.generate API closely resembling HF models
  - SALM model initialization with time/memory optimization
  - SpeechLM2: fixed excessive padding, support on-the-fly resampling for SALM
Automodel and Export-Deploy functionality are available in their individual repositories respectively and deprecated in NeMo2

Detailed Changelogs:

ASR

Changelog

Modernize logger interface by @emmanuel-ferdman :: PR: #13783
Higher-level API for SALM.generate by @pzelasko :: PR: #14034
add/refactor docs for asr lm customization by @lilithgrigoryan :: PR: #14088
Improve NEST GPU Utilization 1/N by @MahmoudAshraf97 :: PR: #14086
Improve ASR models' invariance to padding/batch size by @pzelasko :: PR: #13827
Clean up transducer decoding initialization by @artbataev :: PR: #14112
Improve NEST GPU Utilization 2/N by @MahmoudAshraf97 :: PR: #14089
GPU-accelerated Phrase-Boosting (GPU-PB) for AED decoding by @andrusenkoau :: PR: #14108
Fix decoding with ngpu-lm when training (#13994) by @hoangtran9122 :: PR: #13995
fix eval_beamsearch_ngram_ctc script by @lilithgrigoryan :: PR: #14238
fix wrong typing for ctc-ws context graph by @andrusenkoau :: PR: #14262
fix frame vad by @stevehuang52 :: PR: #14337
Improve NEST GPU Utilization 3/N by @MahmoudAshraf97 :: PR: #14234
remove confidence ensemble models by @lilithgrigoryan :: PR: #14343
Fix ASR decoding issues with CUDA graphs in training by @artbataev :: PR: #14184
Streaming Sortformer release PR01: uploading bugfixes, refactored variables and yaml file name changes by @tango4j :: PR: #14416
Streaming Sortformer release PR02: unit tests for streaming models and modules by @tango4j :: PR: #14417
GPU-accelerated Phrase-Boosting (GPU-PB) for CTC, RNN-T, and TDT decoding by @andrusenkoau :: PR: #14277
Fix subsampling chunking test by @monica-sekoyan :: PR: #14452
Canary2 with NFA by @monica-sekoyan :: PR: #14121
Initial Chunking by @nune-tadevosyan :: PR: #14321
Chunking fix by @nune-tadevosyan :: PR: #14482
Tutorial and doc update by @nune-tadevosyan :: PR: #14484
Streaming Sortformer release PR03: NeMo documentations and tutorial notebook by @tango4j :: PR: #14388
Add wget_from_nemo by @nune-tadevosyan :: PR: #14623
Downgrade "datasets" library version in ASR tutorial to ensure compatibility with HF Datasets used by @KunalDhawan :: PR: #14685
Canary tutorial fix by @nune-tadevosyan :: PR: #14708
Force activations and weights cast to FP32 Jasper Encoder Squeeze-Excite by @erastorgueva-nv :: PR: #14715

TTS

Changelog

Improve ASR models' invariance to padding/batch size by @pzelasko :: PR: #13827
remove nlp modules by @dimapihtar :: PR: #14127
Temporarily Remove Encoder PP Support by @yaoyu-33 :: PR: #14167
Remove T5-TTS by @blisc :: PR: #14252

NLP / NMT

Changelog

add extra params for MegatronDataSampler by @dimapihtar :: PR: #13956
Modernize logger interface by @emmanuel-ferdman :: PR: #13783
remove dialogue collection by @dimapihtar :: PR: #14087
remove QA collection by @dimapihtar :: PR: #14092
remove text nlp collection by @dimapihtar :: PR: #14110
remove nlp modules by @dimapihtar :: PR: #14127
remove rag collection by @dimapihtar :: PR: #14157
remove nmt collection by @dimapihtar :: PR: #14191
Fix importerror in transformer_lm_model after nlp module removals by @chtruong814 :: PR: #14199
fix QA comments NVBug by @huvunvidia :: PR: #14196
Temporarily Remove Encoder PP Support by @yaoyu-33 :: PR: #14167
remove mixins collections by @dimapihtar :: PR: #14281
feat: print expert groups on megatron init by @clumsy :: PR: #13874
[speechlm2] [lhotse] sharegpt data and testloader by @huckiyang :: PR: #14294
Add notebook for LoRA on GPT-OSS-20B by @shashank3959 :: PR: #14439
Sketch dist-ckpt content versioning by @mikolajblaz :: PR: #13839
Change to enable full iteration CUDA graph for LLMs by @vasunvidia :: PR: #14077

Text Normalization / Inverse Text Normalization

Changelog

Check lightning and core imports in install test by @chtruong814 :: PR: #14403

Export

Changelog

ci: Set L2_NeMo_2_Export_Deploy_Query_In_Framework to be optional by @chtruong814 :: PR: #13946
Remove old export doc by @oyilmaz-nvidia :: PR: #14292
Llama4 Export: Remove outdated MLP weight transform by @suiyoubi :: PR: #14297
Update mllama hf import/export for transformers 4.53 by @meatybobby :: PR: #14327

Bugfixes

Changelog

Bugfix for Hyena to the get_t function which comes up when doing longer context inference by @jstjohn :: PR: #14256
fix skipped cuHyena kernel while training by @farhadrgh :: PR: #14365
Remove flaky Evo2 dataset performance test by @jstjohn :: PR: #14371
Use module prefix in restore_modelopt_state by @jenchen13 :: PR: #14384

Uncategorized:

Changelog

Version bump to 2.5.0rc0.dev0 by @github-actions[bot] :: PR: #13944
[Llama4] Enable tp comm overlap for llama4 by @gdengk :: PR: #13940
Fix for Squad Dataset Download by @rhmukundan :: PR: #13893
add nmh HF conversion by @JRD971000 :: PR: #13941
Speechlm2 SALM improvements by @pzelasko :: PR: #13829
fix dataset issue by @dimapihtar :: PR: #13953
Editing MMLU to pull from the correct repo by @ruchaa-apte :: PR: #13991
move classes to module to use target feature (#14023) by @nithinraok :: PR: #14031
Add Nemotron-H prompt format, fix cut-to-conversation custom attr propagation by @pzelasko :: PR: #13963
Bump release_library template to v0.40.0 by @chtruong814 :: PR: #14046
[automodel] add support for layer-freezing by @akoumpa :: PR: #14000
[Qwen3] Recipe config bug fix by @gdengk :: PR: #14084
Add TE import guard in qwen2vl vision module by @chtruong814 :: PR: #14091
Update bitsandbytes dependency to v0.46.0 by @pramodk :: PR: #14050
Update FSDP2 docstring by @BoxiangW :: PR: #14105
Interface to enable fsdp-double-buffer without enabling NCCL-UB by @youngeunkwon0405 :: PR: #14076
SpeechLM2 SALM: load ckpt faster, with less GPU memory by @pzelasko :: PR: #14113
Add object_storage_cache_path to PreTrainingDataModule by @shunjiad :: PR: #14103
Update changelog for r2.3.0 by @github-actions[bot] :: PR: #14160
Fix FLUX test with correct env var by @suiyoubi :: PR: #14149
add mmap_bin_files param by @dimapihtar :: PR: #14122
Add option to suppress import checks in Dockerfile.speech by @artbataev :: PR: #14185
Safely import optional python packages by @roclark :: PR: #13936
Set flux test as optional by @chtruong814 :: PR: #14190
Revert "Safely import optional python packages (#13936)" by @chtruong814 :: PR: #14197
Fix "Safely import optional python packages (#13936)" by @chtruong814 :: PR: #14198
Add fix for evo2 generate/inference by @jwilber :: PR: #14027
Fixing file path suffix by @gautham-kollu :: PR: #14179
Update AVLM finetune example for vanilla fine-tuning by @huvunvidia :: PR: #14232
[finetune] Add dataset_kwargs to prepare packed sequence data by @jiajunly :: PR: #14169
Allow exception in hf ckpt load attempt before fallback to standard l… by @trvachov :: PR: #14214
Load master weights from checkpoint by @kunlunl :: PR: #14072
Add deploy lora adapter portion by @ruchaa-apte :: PR: #14255
fix speechlm lhotse loading nemo_tarred by @stevehuang52 :: PR: #14314
Update changelog for r2.4.0 by @github-actions[bot] :: PR: #14334
Flaky test timing out: @pytest.mark.pleasefixme by @pablo-garay :: PR: #14351
Support dump perf recipe diff from base recipe by @guyueh1 :: PR: #14206
Bugfix degenerate bases evo2 dataset by @jstjohn :: PR: #14359
Hyena support for flash decode API by @jstjohn :: PR: #14315
Fix Gemma2/3 & Llava (Next) & Llama4 conversion issue with latest transformers by @suiyoubi :: PR: #14367
fix: reduce the excessive test time of test_msdd_diar_inference by @tango4j :: PR: #14366
SpeechLM2: S2S->S2T data reader, excessive padding fixes by @pzelasko :: PR: #14124
chore: Release 2.5.0rc0 by @ko3n1g :: PR: #14389
Add pyxis flag for container writable. by @sudostock :: PR: #14395
[MoE] Partial Cudagraph support for MoE by @gdengk :: PR: #14362
Revert "[MoE] Partial Cudagraph support for MoE (#14362)" by @chtruong814 :: PR: #14402
Update AVLM recipes for NeMo-CI runs by @huvunvidia :: PR: #14397
Remove nemo1 multimodal and vision by @yaoyu-33 :: PR: #14095
Fix LazyNeMoIterator supervision for multi-channel cuts by @anteju :: PR: #14409
Bump Mcore to 7f7439f by @chtruong814 :: PR: #14373
Use cuhyena rearrange when available. by @moradza :: PR: #14383
Fix model training/eval state after PTL validation loop by @paul-gibbons :: PR: #14152
Add deprecation notice to eval code by @athitten :: PR: #14316
Streaming Sortformer release PR04: Adding functional tests for streaming sortformer by @tango4j :: PR: #14435
QWEN2.5-VL 7B Performance Recipe by @tomlifu :: PR: #14401
Discount FLOPs in dot-product att by @erhoo82 :: PR: #14424
Bump to pytorch 25.06 and newer TE commit by @chtruong814 :: PR: #14423
Enable precision aware optimizer for dsv3 by @guyueh1 :: PR: #14444
Make VBoost activation conditional by @bdubauski :: PR: #14458
cuHyena FFTConv support for Hyena Long Implicit (LI) Layer by @farhadrgh :: PR: #14396
Alit/nano v2 by @JRD971000 :: PR: #14464
Fix reuse_grad_buf_for_mxfp8_param_ag for mxfp8 by @guyueh1 :: PR: #14445
Fix loss mask for chat datasets by @cuichenx :: PR: #14369
Rename to subquadratic_ops by @farhadrgh :: PR: #14486
Allows using other signals (than SIGTERM) with PreemptionPlugin by @zachmoshe :: PR: #14248
Qwen2.5-VL 32B Performance Recipe by @tomlifu :: PR: #14485
Alit/nanov2 12b by @JRD971000 :: PR: #14483
Freeze tags in in r2.5.0 by @github-actions[bot] :: PR: #14513
deprecate t0 by @dimapihtar :: PR: #14599
Cherry pick Use hugginface_hub for downloading the FLUX checkpoint (14638) into r2.5.0 by @chtruong814 :: PR: #14640
Cherry pick Fix function calling notebook (14643) into r2.5.0 by @chtruong814 :: PR: #14650
Cherry pick remove service launch scripts (14647) into r2.5.0 by @chtruong814 :: PR: #14648
Cherry pick Delete tutorials/llm/llama/biomedical-qa directory (14653) into r2.5.0 by @chtruong814 :: PR: #14654
Cherry pick Remove PEFT scheme condition from recipe (14661) into r2.5.0 by @chtruong814 :: PR: #14662
Cherry pick fixing kernel restarting when transcribing (14665) into r2.5.0 by @chtruong814 :: PR: #14672
Delete nemo 1 notebooks by @cuichenx :: PR: #14675
Cherry pick Fixing Sortformer training tutorial notebook (14680) into r2.5.0 by @chtruong814 :: PR: #14681
Cherry-pick Update get_tensor_shapes function whose signature was refactored (14594) into r2.5.0 by @chtruong814 :: PR: #14678
Cherry pick Skip trt-llm and vllm install in install test (14663) into r2.5.0 by @chtruong814 :: PR: #14697
Cherry pick Fix for \EncDecRNNTBPEModel transcribe() failed with TypeError\ (14698) into r2.5.0 by @chtruong814 :: PR: #14709
Cherry pick Fix broken link in Reasoning-SFT.ipynb (14716) into r2.5.0 by @chtruong814 :: PR: #14717
cherry-pick add load-in-4bit param (14636) into r2.5.0 by @dimapihtar :: PR: #14719
Cherry pick Fix deepseek export dtype (14307) into r2.5.0 by @chtruong814 :: PR: #14682
Cherry pick remove env var (14739) into r2.5.0 by @chtruong814 :: PR: #14746
Cherry-pick 'Bump modelopt to 0.35.0 and remove safe_import("modelopt") in llm collection (#14656)' into 'r2.5.0' by @chtruong814 :: PR: #14771
Cherry pick Update prune-distill notebooks to Qwen3 + simplify + mmlu eval (14785) into r2.5.0 by @chtruong814 :: PR: #14789
Cherry pick Remove export-deploy, automodel, and eval tutorials (14790) into r2.5.0 by @chtruong814 :: PR: #14792
Cherry pick ci: Automodel deprecation warning (14787) into r2.5.0 by @chtruong814 :: PR: #14791

NVIDIA Neural Modules 2.4.1

Detailed Changelogs:

Uncategorized:

Changelog

Update package_info.py by @ko3n1g :: PR: #14400
Patch to address issue 14392 by @youngeunkwon0405 :: PR: #14398
Cherry pick Fix callbacks in DSV3 script (14350) into r2.4.0 by @chtruong814 :: PR: #14370
Cherry pick Change Llama Embedding Tutorial to use SFT by default (14231) into r2.4.0 by @chtruong814 :: PR: #14303
Cherrypick calculate_per_token_loss requirement for context parallel (#14065) (#14282) into r2.4.0 by @chtruong814 :: PR: #14448
Pin nvidia-lm-eval to 25.6.1 by @chtruong814 :: PR: #14470

NVIDIA Neural Modules 2.3.3

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at [email protected]
Pin nvidia-lm-eval to 25.5

NVIDIA Neural Modules 2.4.0

Highlights

Collections:
- Speech
  - Batched beam search for transducers (RNN-T and TDT)
    - RNNT/TDT buffered/streaming inference + batched decoding support in cache-aware
    - add support for CTC batched beam search with GPU-LM
    - Key fixes
      - Punctuation Marks in Timestamps
      - Fix timestamps when cuda graphs enabled
      - Fix masking of <pad> tokens in AED inference
      - TDT streaming inference fix
- LLM
  - Qwen 3 235B-A22B Perf Optimized
  - DeepSeek V3 Perf Optimized
  - Gemma3 support from Google
  - Embedding and Reranker models
- MM
  - Llama 4
  - AVLM
Training performance (speed)
- NVL sharp + IB sharp for DP/FSDP-communications on H100 and B200
- MXFP8 with TP communication overlap
- MXFP8 with reduced memory allocation
- FP8 sub-channel recipe (128x128 for weight and 1x128 for activation)
- cudnn fused attention for MLA (both Hopper and Blackwell)
- Advanced custom asymmetric pipelining (for MTP, loss func, and embd)
- BF16 optimizer for model memory saving
- CUDA graph fix for fine-tuning benchmarks
- CUDA graph support for LLAMA4

Detailed Changelogs

ASR

Changelog

ci: Fix ASR container by @ko3n1g :: PR: #13288
Set L2_Segmentation_Tool_Parallel_ctc_segmentation test to be optional by @chtruong814 :: PR: #13296
Revert "WebDataset URL refactoring" by @ko3n1g :: PR: #13421
Update flagged docs links by @erastorgueva-nv :: PR: #13391
[Docs] Fix incorrectly formatted reference tags by @erastorgueva-nv :: PR: #13445
Update CP by @pablo-garay :: PR: #13532
Tdt buffered inference fix by @hainan-xv :: PR: #13500
Fix transcribe when nbest hypotheses are returned by @lilithgrigoryan :: PR: #13540
Set ASR test to be optional by @chtruong814 :: PR: #13633
Enabling chunked inference for AED models in asr_evaluator by @melllinia :: PR: #13674
Ko3n1g/chore/asr only by @ko3n1g :: PR: #13704
decompressing joblib file before checking it by @Ssofja :: PR: #13732
Revert "decompressing joblib file before checking it (#13732)" by @chtruong814 :: PR: #13791
Punctuation Marks in Timestamps by @monica-sekoyan :: PR: #13353
AIStore with Webdataset by @monica-sekoyan :: PR: #13604
Update to add default for dataclass variables by @nithinraok :: PR: #13814
This PR addresses to known security issues by @Ssofja :: PR: #13804
remove model_stride var by @nithinraok :: PR: #13867
add CTC batched beam search by @lilithgrigoryan :: PR: #13337
Clean up streaming ASR script and tests by @artbataev :: PR: #13894
add NGPU-LM fusion during CTC greedy by @lilithgrigoryan :: PR: #13917

TTS

Changelog

Revert "WebDataset URL refactoring" by @ko3n1g :: PR: #13421
Update flagged docs links by @erastorgueva-nv :: PR: #13391
[Docs] Fix incorrectly formatted reference tags by @erastorgueva-nv :: PR: #13445
Update CP by @pablo-garay :: PR: #13532
fix: vpp stage refactoring to match mcore by @ZhiyuLi-Nvidia :: PR: #13673
AIStore with Webdataset by @monica-sekoyan :: PR: #13604

NLP / NMT

Changelog

Migrate Hyena to Megatron inference_context. by @cspades :: PR: #13436
Update CP by @pablo-garay :: PR: #13532
fix broken links by @dimapihtar :: PR: #13544
Add nlp import checks by @thomasdhc :: PR: #13563
PTQ model support, quant_cfg, and documentation updates by @janekl :: PR: #13519
feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences by @soluwalana :: PR: #13367
fix: vpp stage refactoring to match mcore by @ZhiyuLi-Nvidia :: PR: #13673
Fix resume with MegatronPretrainingBatchSampler by @ashors1 :: PR: #13565
Punctuation Marks in Timestamps by @monica-sekoyan :: PR: #13353
Revert Adding more doc-strings to megatron_parallel.py #12767 by @ko3n1g :: PR: #13824
reasoning model evaluation mmlu gpqa by @ruchaa-apte :: PR: #13880
Remove unused DynamicRetrievalServer and Bert dataset loader classes by @dimapihtar :: PR: #14209
Huvu/avlm qafix cherrypick from by @huvunvidia :: PR: #14253

Export

Changelog

Improve Nemo2Exporter for Models Using Custom Modelling Files on HF by @suiyoubi :: PR: #13400
Adding more export tests by @oyilmaz-nvidia :: PR: #13410
Add Warning to Export when output_path exists by @suiyoubi :: PR: #13465
Move libsox-fmt-all from Dockerfile.ci.export_deploy to Dockerfile.ci by @chtruong814 :: PR: #13452
ci: Remove trt-llm breakpoint by @ko3n1g :: PR: #13499
Add Qwen2VL export_ckpt by @AtsunoriFujita :: PR: #13398
Add MLlama export_ckpt by @AtsunoriFujita :: PR: #13346
Update vLLMExporter to use vLLM V1 by @janekl :: PR: #13498
Add vLLM Mixtral and TRT-LLM qnemo export tests (plus a couple of bugfixes) by @janekl :: PR: #13697
Fix Qwen3 export + misc by @cuichenx :: PR: #13679
Extra int cast for successful tracing during ONNX export by @janekl :: PR: #13782
FP8 lora export by @cuichenx :: PR: #13748
Add PEFT export check by @cuichenx :: PR: #13835
Update llm api import_ckpt/export_ckpt docstring by @meatybobby :: PR: #13714
Use modelopt export and disable dataset calibration for weight only PTQ by @jenchen13 :: PR: #13756

Bugfixes

Changelog

[automodel] move liger kernel patching by @akoumpa :: PR: #13579

Uncategorized

Changelog

build: various bumps by @ko3n1g :: PR: #13285
ci: Fixes to selective triggering by @ko3n1g :: PR: #13287
ci: Set timeout by @ko3n1g :: PR: #13294
Set L2_NeMo_2_T5_Pretraining test as optional by @chtruong814 :: PR: #13282
Add test environment approval step for CI by @chtruong814 :: PR: #13297
update num nodes in deepseek v3 finetune recipe by @cuichenx :: PR: #13314
ci: Increase cache pool by @ko3n1g :: PR: #13306
Rename adam_with_cosine_annealing as adam since cosin LR is not setup by @ShriyaRishab :: PR: #13315
ci: Update test queue bot to not assume a workflow is launched from a PR by @chtruong814 :: PR: #13318
Fix TE pytorch attention doc link by @thomasdhc :: PR: #13327
ci: Add all recent buildcaches to update-buildcache job by @ko3n1g :: PR: #13289
Fix neva notebook by @yaoyu-33 :: PR: #13334
Fix transformer offline for CI/CD llama4 tests by @yaoyu-33 :: PR: #13339
[automodel] convert lm head to full tensor before passing to lce by @yuanzhedong :: PR: #13319
ci: No dups in queue by @ko3n1g :: PR: #13352
ci(hotfix): VLM CPU unit tests by @ko3n1g :: PR: #13348
vLLM==0.8.5 update by @janekl :: PR: #13350
ci: Allow bypassing approval by @ko3n1g :: PR: #13365
Avoid the need to specify optional attributes for lhotse/nemo reader functions by @pzelasko :: PR: #13307
ci: Fix selective-triggering for non-PR events by @ko3n1g :: PR: #13374
ci: Revert no-concurrency-group-on-main by @ko3n1g :: PR: #13375
ci: Improve no-fail-fast mechanism by @ko3n1g :: PR: #13370
2d buckets estimation fix by @monica-sekoyan :: PR: #13377
ci: Fix scheduled runs by @ko3n1g :: PR: #13378
Ko3n1g/ci/fix nightly runs by @ko3n1g :: PR: #13382
[automodel] fix none issue in dataset for qwen model by @yuanzhedong :: PR: #13311
update table by @akoumpa :: PR: #13397
Improve test coverage for audio modules by @anteju :: PR: #13333
Disable failing maxine loss test by @anteju :: PR: #13361
Ko3n1g/ci/no notification on cancel by @ko3n1g :: PR: #13403
document fp8_recipe by @akoumpa :: PR: #13405
Weekly bump main by @ko3n1g :: PR: #13408
Handle boolean args for performance scripts and log received config by @guyueh1 :: PR: #13291
[automodel] add FirstRankPerNode by @akoumpa :: PR: #13373
tests: Disable flaky audio test by @ko3n1g :: PR: #13429
ci: Disable flaky audio test by @ko3n1g :: PR: #13435
Fix loss compute and reduction by @xrennvidia :: PR: #13295
ci: Skip link check on github links by @chtruong814 :: PR: #13425
Add NCCL cfg interface to perf scripts by @erhoo82 :: PR: #13407
ci: Success only if Run CICD label attached by @ko3n1g :: PR: #13430
ci: Add tests to selective triggering by @ko3n1g :: PR: #13404
ci: Remove jq by @ko3n1g :: PR: #13440
ci: Fix deps tree for tests by @ko3n1g :: PR: #13443
Ko3n1g/ci/fix dependency tree by @ko3n1g :: PR: #13448
Adding additional unit tests for the deploy module by @pthombre :: PR: #13411
[Audio] fix a flaky test (and also make some tests run faster) by @racoiaws :: PR: #13439
[automodel] ignore tail padding in TPS calculation by @akoumpa :: PR: #13329
Ko3n1g/ci/selective triggering 3 by @ko3n1g :: PR: #13460
ci: Disable broken neva tests by @ko3n1g :: PR: #13461
fix speechlm data module by @stevehuang52 :: PR: #13362
ci: Enter queue only with passing linting by @ko3n1g :: PR: #13462
Adding tests for Schroedinger Bridge model by @nasretdinovr :: PR: #13401
add more detailed description by @dimapihtar :: PR: #13464
[Audio] tests for score-based and flow matching enhancement models by @racoiaws :: PR: #13406
Use expandable cuda memory segmentation by @erhoo82 :: PR: #13418
Fix llava tokenizer caused nan issue by @yaoyu-33 :: PR: #13466
Remove cuda method from ModelPT by @erastorgueva-nv :: PR: #13394
Fix BNR 2 unit test + input, case where input length was not specified by @nitin9252 :: PR: #13467
ci: Do not run any tests if no match is found by @ko3n1g :: PR: #13479
Ko3n1g/ci/selective triggering 4 by @ko3n1g :: PR: #13489
Fix typo in the performance script by @youngeunkwon0405 :: PR: #13487
ci: No runs on main by @ko3n1g :: PR: #13490
ci: Upload on schedule by @ko3n1g :: PR: #13491
ci: Run selective triggering on dockerfiles and dependencies by @ko3n1g :: PR: #13493
[automodel] fallback FP8 + LCE -> FP8 + CE by @akoumpa :: PR: #13349
Update changelog for r2.3.0 by @github-actions[bot] :: PR: #13501
Update 2.3.0 changelog by @chtruong814 :: PR: #13504
Enabling flash decode for float16 precision only by @pthombre :: PR: #13471
Fix changelog formatting by @chtruong814 :: PR: #13505
Updating the long context performance number for B200 by @youngeunkwon0405 :: PR: #13468
ci: Add more files to filter by @ko3n1g :: PR: #13517
Improve error message when HF checkpoint cannot be loaded by @ashors1 :: PR: #13513
Add Resume_path to llama_nemotron models by @suiyoubi :: PR: #13515
Add Llama4 GHA by @suiyoubi :: PR: #13442
add memory profile interface to perf scripts by @erhoo82 :: PR: #13413
Add fp8_param argument back to mixed precision plugin for backward compatibility by @guyueh1 :: PR: #13522
[automodel] add find_unused_parameters=True for DDP by @akoumpa :: PR: #13366
ci: Update success message by @ko3n1g :: PR: #13541
[Audio] TransformerUNet: predictive model support added by @nasretdinovr :: PR: #13470
Test Hyena mixer CP equivalency by @farhadrgh :: PR: #13330
use null tokenizer by @malay-nagda :: PR: #13480
ci: Remove optional marker by @ko3n1g :: PR: #13469
Update extra_requires and requirements by @thomasdhc :: PR: #13359
Fix default config for LlamaNemotron Ultra by @suiyoubi :: PR: #13542
[audio] Improve test coverage for audio losses by @anteju :: PR: #13309
deepseek finetuning callback error change by @SDcodehub :: PR: #13483
ci(fix): Add __init__ to selective-triggering by @ko3n1g :: PR: #13577
nsys profile filename ranks info by @malay-nagda :: PR: #13576
chore: Update setup.py by @ko3n1g :: PR: #13566
Fix Llama importer by @suiyoubi :: PR: #13583
[automodel] fix --mbs/gbs dtype and chat-template by @akoumpa :: PR: #13602
Reconfigure 'limit_<train|val>_batches' by @maanug-nv :: PR: #13523
ci: Optional speech tests by @ko3n1g :: PR: #13606
[Automodel] Fix CP device_mesh issue, use PTL distsampler by @BoxiangW :: PR: #13473
[automodel] fix log message by @akoumpa :: PR: #13612
Tests for evaluation with NVIDIA Evals Factory by @chtruong814 :: PR: #13627
Fix ptl import in notebooks by @maanug-nv :: PR: #13608
[automodel] dist.abort -> dist.destroy_process_group by @akoumpa :: PR: #13578
Skip eval unit test by @chtruong814 :: PR: #13635
Fix image_processor config in Energon path by @AtsunoriFujita :: PR: #13618
Add Gemma3 VL model by @xiangxu-google :: PR: #13536
Set L2_NeMo_2_EVAL as optional by @chtruong814 :: PR: #13644
Update install to use pip install by @thomasdhc :: PR: #13605
Multi node settings for evaluation nemo-run script by @athitten :: PR: #13568
[Llama4] Fix the missing args in the recipe by @gdengk :: PR: #13649
Bump nvidia-modelopt to 0.29.0 by @AAnoosheh :: PR: #13599
Update README.md for 25.04 release by @snowmanwwg :: PR: #13654
[automodel] consolidate sft peft scripts by @akoumpa :: PR: #13634
Qwen3 by @cuichenx :: PR: #13554
Set env variables for eval tests by @marta-sd :: PR: #13658
build: multimodal-only by @ko3n1g :: PR: #13665
[Audio] TransformerUNet: predictive model tests added by @nasretdinovr :: PR: #13648
[automodel] consolidate vllm scripts by @akoumpa :: PR: #13670
build: Pin transformers by @ko3n1g :: PR: #13675
ci: Enable codecov checks by @ko3n1g :: PR: #13497
ci: Add init-file-checker by @ko3n1g :: PR: #13684
Add use_sharp and use user buffer registration args in perf scripts by @youngeunkwon0405 :: PR: #13521
Remove is-optional marker for L2_NeMo_2_EVAL by @marta-sd :: PR: #13669
gpu type and #devices CLI args by @malay-nagda :: PR: #13620
perf scripts updates by @malay-nagda :: PR: #13456
Use audio codec without discriminators in SpeechLM2 tests by @pzelasko :: PR: #13711
Update changelog for r2.3.1 by @github-actions[bot] :: PR: #13719
Recipe default value fix for Llama4 by @suiyoubi :: PR: #13696
build: Lift numba by @ko3n1g :: PR: #13735
New key override for timestamps by @melllinia :: PR: #13743
Fixed Mllama Energon config by @AtsunoriFujita :: PR: #13574
Update convert_to_tarred_audio_dataset.py by @ssh-meister :: PR: #13755
Enable dropout recompute in LoRA by @michal2409 :: PR: #13745
Address VDR feedback for NeMo FW evaluations by @athitten :: PR: #13701
remove blocks unused to increase coverage by @romanbrickie :: PR: #13511
Fix Flux Recipe for FSDP/DDP by @suiyoubi :: PR: #13715
Try soften protobuf version requirement by @pablo-garay :: PR: #13747
Flux FP8 recipe by @Victor49152 :: PR: #13584
Gemma3 Fix and Tests by @suiyoubi :: PR: #13661
Disable local gradient checker in performance scripts by @erhoo82 :: PR: #13768
[Audio] Tests: training for mask, pred and SB models by @nasretdinovr :: PR: #13736
Refactor MSC integration in exp manager by @shunjiad :: PR: #13626
[fix] vpp error in Gemma3 by @ZhiyuLi-Nvidia :: PR: #13784
ci: Ensure approval queue fetches all CICD workflows using pagnation by @chtruong814 :: PR: #13798
ci: make_request in approval test queue appends next url for status checks only by @chtruong814 :: PR: #13802
Remove guard for masking tests and improve coverage by @anteju :: PR: #13787
fix: After mcore bump by @ko3n1g :: PR: #13781
Fix Gemma3VL training bugs by @sharanmayank :: PR: #13766
[NeMo 2.0] Remove the restriction of load_model_state_dict for cfsdp by @shjwudp :: PR: #13512
Add option to construct Llama model with Transformer Engine op fuser by @timmoon10 :: PR: #13776
[Evaluation] Add support for simple-evals and tasks that require logprobs by @marta-sd :: PR: #13647
remove stale section by @akoumpa :: PR: #13759
fix moe_router_pre_softmax for Mixtral by @akoumpa :: PR: #13678
fix: improve sequence length handling to fix nan in loss when turning on cudagraph by @katec846 :: PR: #13779
Gemma3 Energon Dataset by @suiyoubi :: PR: #13813
Rectify BLEU evaluation by @ankitapasad :: PR: #13762
ci: Moved workflows by @ko3n1g :: PR: #13828
ci: Moved templates by @ko3n1g :: PR: #13830
[Build] Bump bitsandbytes dependency to 0.45.5 (ubuntu 22.04 compatibility) by @pramodk :: PR: #13789
update for PYTORCH_CUDA_ALLOC_CONF env var by @malay-nagda :: PR: #13837
[Llama4] Enable VLM Dec cudagraph by @gdengk :: PR: #13767
Support MSC URL in LLM checkpointing by @shunjiad :: PR: #13805
additional metrics by @dimapihtar :: PR: #13754
Expand modelopt version range by @chtruong814 :: PR: #13850
Alit/nmh4b by @JRD971000 :: PR: #13481
[Tutorial] Train your own reasoning model in 48 hours on a single GPU by @Maghoumi :: PR: #13853
Enabled C2C-PCie bridge through NCCL by @sanandaraj5597 :: PR: #13621
Added safe loading of models by @nithinraok :: PR: #13607
Add NemotronH Performance Script by @guyueh1 :: PR: #13528
Hyena SE/MR B2B Kernel integration by @farhadrgh :: PR: #13518
chore: Destroy buildcache by @ko3n1g :: PR: #13869
tests: Fix Qwen test by @ko3n1g :: PR: #13888
fix: improve error handling in is_multistorageclient_url by @shunjiad :: PR: #13885
feat(eval): adds benchmark adapters that allow specisal reasoning models by @agronskiy :: PR: #13709
perf scripts 25.07 refactor by @malay-nagda :: PR: #13875
Fix E5 and LlamaEmbedding Conversion by @suiyoubi :: PR: #13890
Bug fix for NCCL vars by @sanandaraj5597 :: PR: #13908
Reranker Model Support by @suiyoubi :: PR: #13876
numa cmd in bash by @malay-nagda :: PR: #13914
Fix BERT issue with PP by @suiyoubi :: PR: #13916
[Llama4] Fix Vp_stage to enable VP for VLM llama4 by @gdengk :: PR: #13873
Enable NVTX profiling in MCore by @minitu :: PR: #13820
[Qwen3-MoE] Add Qwen3 MoE perf recipe for 30b and 235b by @gdengk :: PR: #13895
lazy import bnbconfig by @akoumpa :: PR: #13919
Set TRANSFORMERS_OFFLINE=1 and HF_HUB_OFFLINE=1 in CI tests by @chtruong814 :: PR: #13932
[peft] align adapter output shape with wrapped module output shape by @guyueh1 :: PR: #13922
[automodel] move only lora adapters to cpu by @akoumpa :: PR: #13931
Fix vp_stage not found when fsdp by @gautham-kollu :: PR: #13817
Fix single optional import if ModelOpt not installed by @AAnoosheh :: PR: #13923
Revert "Set TRANSFORMERS_OFFLINE=1 and HF_HUB_OFFLINE=1 in CI tests by @chtruong814 :: PR: #13938
Enable LoRA for TELinear layers by @cuichenx :: PR: #13929
Freeze tags in in r2.4.0 by @github-actions[bot] :: PR: #13945
Cherry pick Use jiwer less than 4.0.0 (13997) into r2.4.0 by @ko3n1g :: PR: #13998
Cherry pick Remove container license reference (14010) into r2.4.0 by @ko3n1g :: PR: #14017
move classes to module to use target feature by @nithinraok :: PR: #14023
Cherry pick bf16 grads for bf16 jobs (14016) into r2.4.0 by @ko3n1g :: PR: #14020
Cherry pick Remove nemo1 stable diffusion test (14018) into r2.4.0 by @ko3n1g :: PR: #14019
Version bump to 2.4.0rc1.dev0 by @github-actions[bot] :: PR: #14047
Cherry pick Fix Loading Custom Quantization Config (13934) into r2.4.0 by @ko3n1g :: PR: #13950
Cherry pick [automodel] fix sft notebook (14002) into r2.4.0 by @ko3n1g :: PR: #14003
Cherry pick Use average reduction in FSDP grad reduce-scatter when grad dtype is … (13981) into r2.4.0 by @ko3n1g :: PR: #14004
Cherry pick GPU memory logging update (13982) into r2.4.0 by @ko3n1g :: PR: #14021
Cherry pick Remove kaldiio (14006) into r2.4.0 by @ko3n1g :: PR: #14032
Cherry pick Set L2_NeMo_2_Flux_Import_Test to be optional (14056) into r2.4.0 by @ko3n1g :: PR: #14058
Cherry pick Bump protobuf to 5.29.5 (14045) into r2.4.0 by @ko3n1g :: PR: #14060
Cherry pick Detect hardware before enabling DeepEP (14022) into r2.4.0 by @ko3n1g :: PR: #14068
Version bump to 2.4.0rc2.dev0 by @github-actions[bot] :: PR: #14115
Cherry pick Fix SFT Dataset Bug (13918) into r2.4.0 by @ko3n1g :: PR: #14074
Cherry pick Align adapter shape with base linear output shape (14009) into r2.4.0 by @ko3n1g :: PR: #14083
Cherry pick [MoE] Update the fp8 precision interface for llama4 and qwen3 (14094) into r2.4.0 by @ko3n1g :: PR: #14104
Cherry pick [Llama4] Tokenizer naming update (14114) into r2.4.0 by @ko3n1g :: PR: #14123
Cherry pick Bump to pytorch 25.05 container along with TE update (13899) into r2.4.0 by @ko3n1g :: PR: #14145
Cherry pick Perf scripts updates (14005) into r2.4.0 by @ko3n1g :: PR: #14129
Cherry pick Remove unstructured (14070) into r2.4.0 by @ko3n1g :: PR: #14147
Version bump to 2.4.0rc3.dev0 by @github-actions[bot] :: PR: #14165
Cherry pick Add checkpoint info for NIM Embedding Expor Tutorial (14177) into r2.4.0 by @ko3n1g :: PR: #14178
Cherry pick Fix dsv3 script (14007) into r2.4.0 by @ko3n1g :: PR: #14182
Cherry pick 405b perf script updates (14176) into r2.4.0 by @chtruong814 :: PR: #14195
Cherry pick Fix nemotronh flops calculator (14161) into r2.4.0 by @chtruong814 :: PR: #14202
Cherry pick Add option to disable gloo process groups (#14156) into r2.4.0 by @chtruong814 :: PR: #14220
Cherry pick Remove g2p_en (14204) into r2.4.0 by @chtruong814 :: PR: #14212
Cherry pick diffusion mock data null args (14173) into r2.4.0 by @chtruong814 :: PR: #14217
Cherry pick perf-scripts: Change b200 config to EP8 (14207) into r2.4.0 by @chtruong814 :: PR: #14223
Cherry pick Change RerankerSpecter Dataset question key (14200) into r2.4.0 by @chtruong814 :: PR: #14224
Cherry pick Fix the forward when final_loss_mask is not present (14201) into r2.4.0 by @chtruong814 :: PR: #14225
Cherry pick Fix Llama Nemotron Nano Importer (14222) into r2.4.0 by @chtruong814 :: PR: #14226
Cherry pick [automodel] fix loss_mask pad token (14150) into r2.4.0 by @chtruong814 :: PR: #14227
[Performance script] FSDP-UBR related recipe update (#14208) by @youngeunkwon0405 :: PR: #14233
Fix for MCore dist ckpt loading #14229 by @stevehuang52 :: PR: #14239
cherry-pick fix eval beam search ctc script by @lilithgrigoryan :: PR: #14242
Cherry pick Moving export security fixes over here (14254) into r2.4.0 by @chtruong814 :: PR: #14261
Cherry pick Confidence fix for tutorial (14250) into r2.4.0 by @chtruong814 :: PR: #14266
Cherry pick added new models to documentation (14264) into r2.4.0 by @chtruong814 :: PR: #14278
Cherry-pick FIx Flux & Flux_Controlnet initialization issue (#14263) into r2.4.0 by @chtruong814 :: PR: #14273
Cherry pick update ffmpeg install (14237) into r2.4.0 by @chtruong814 :: PR: #14279

NVIDIA Neural Modules 2.3.2

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at [email protected]

NVIDIA Neural Modules 2.3.1

Highlights

Collections
- LLM
  - Llama 4: Fixed an accuracy issue caused by MoE probability normalization. Improved pre-train and fine-tune performance.
Export & Deploy
- Updated vLLMExporter to use vLLM V1 to address a security vulnerability.
AutoModel
- Improved chat-template handling.
Fault Tolerance
- Local checkpointing: Fixed support for auto-inserted metric names for resuming from local checkpoints.

Detailed Changelogs

Export

Changelog

Cherry-pick Update vLLMExporter to use vLLM V1 (#13498) into r2.3.0 by @chtruong814 :: PR: #13631

Uncategorized

Changelog

Bump to 2.3.1 by @chtruong814 :: PR: #13507
Cherry pick Use explicitly cached canary-1b-flash in CI tests (13237) into r2.3.0 by @ko3n1g :: PR: #13508
Cherry pick [automodel] bump liger-kernel to 0.5.8 + fallback (13260) into r2.3.0 by @ko3n1g :: PR: #13308
Cherry-pick Add recipe and ci scripts for qwen2vl to r2.3.0 by @romanbrickie :: PR: #13336
Cherry pick Fix skipme handling (13244) into r2.3.0 by @ko3n1g :: PR: #13376
Cherry pick Allow fp8 param gather when using FSDP (13267) into r2.3.0 by @ko3n1g :: PR: #13383
Cherry pick Handle boolean args for performance scripts and log received config (13291) into r2.3.0 by @ko3n1g :: PR: #13416
Cherry pick new perf configs (13110) into r2.3.0 by @ko3n1g :: PR: #13431
Cherry pick Adding additional unit tests for the deploy module (13411) into r2.3.0 by @ko3n1g :: PR: #13449
Cherry pick Adding more export tests (13410) into r2.3.0 by @ko3n1g :: PR: #13450
Cherry pick [automodel] add FirstRankPerNode (13373) into r2.3.0 by @ko3n1g :: PR: #13559
Cherry pick [automodel] deprecate global_batch_size dataset argument (13137) into r2.3.0 by @ko3n1g :: PR: #13560
Cherry-pick [automodel] fallback FP8 + LCE -> FP8 + CE (#13349) into r2.3.0 by @chtruong814 :: PR: #13561
Cherry pick [automodel] add find_unused_parameters=True for DDP (13366) into r2.3.0 by @ko3n1g :: PR: #13601
Cherry pick Add CI test for local checkpointing (#13012) into r2.3.0 by @ananthsub :: PR: #13472
Cherry pick [automodel] fix --mbs/gbs dtype and chat-template (13598) into r2.3.0 by @akoumpa :: PR: #13613
Cherry-pick Update t5.py (#13082) to r2.3.0 and bump mcore to f98b1a0 by @chtruong814 :: PR: #13642
[Automodel] Fix CP device_mesh issue, use PTL distsampler (#13473) by @akoumpa :: PR: #13636
[Llama4] Fix the recipe bug - cherrypick #13649 by @gdengk :: PR: #13650
build: Pin transformers (#13675) by @ko3n1g :: PR: #13692

NVIDIA Neural Modules 2.3.0

Highlights

Export & Deploy
- NeMo 2.0 export path for NIM
- ONNX and TensorRT Export for NIM Embedding Container
- In-framework deployment for HF Models
- TRT-LLM deployment for HF Models in NeMo Framework
Evaluation
- Integrate nvidia-lm-eval to NeMo FW for evaluations with OpenAI API compatible in-framework deployment
AutoModel
- VLM AutoModelForImageForTextToText
- FP8 for AutoModel
- Support CP with FSDP2
- Support TP with FSDP2
- Performance Optimization
  - add support for cut cross entropy & liger kernel
  - Gradient Checkpointing
Fault Tolerance
- Integrate NVRx v0.3 Local checkpointing
Collections
- LLM
  - Llama4
  - Llama Nemotron Ultra
  - Llama Nemotron Super
  - Llama Nemotron Nano
  - Nemotron-h/5
  - DeepSeek V3 Pretraining
  - Evo2
  - Qwen 2.5
  - LoRA for Qwen3-32B and Qwen3-30B-A3B
- MultiModal
  - FLUX
  - Gemma 3
  - Qwen2-VL
- ASR
  - NeMo Run support for ASR training
  - N-Gram LM on GPU for AED
  - N-Gram LM on GPU + Transducer greedy decoding (RNN-T, TDT)
  - Timestamps support for AED timestamp supported models
  - Migrate SpeechLM to NeMo 2.0
  - Canary-1.1
  - Replace ClassificationModels class with LabelModels
Performance
- Functional MXFP8 support for (G)B200
- Current scaling recipe with TP communication overlap and FP8 param gathers
- Custom FSDP support that fully utilizes GB200 NVL72

Detailed Changelogs

ASR

Changelog

Added model config params for Canary-1B-Flash, Canary-180M-Flash models by @KunalDhawan :: PR: #12588
Canary tutorial by @ankitapasad :: PR: #12613
Canary tutorial fix timestamp by @ankitapasad :: PR: #12677
revert config by @nithinraok :: PR: #12689
canary longform inference script with timestamps option by @krishnacpuvvada :: PR: #12653
Fix default timestamps value for Hybrid ASR models by @artbataev :: PR: #12681
Fix k2 installation with PyTorch 2.6.0 by @artbataev :: PR: #12686
Improve time and RTFx report for ASR by @artbataev :: PR: #12680
Modify train args by @ankitapasad :: PR: #12700
Fix asr doc warnings by @nithinraok :: PR: #12720
Rename FastNGramLM -> NGramGPULanguageModel by @artbataev :: PR: #12755
transcribe fix for new hypotheses by @nune-tadevosyan :: PR: #12801
Fix timestamps when cuda graphs enabled by @monica-sekoyan :: PR: #12808
update streaming conformer by @stevehuang52 :: PR: #12846
AED Decoding with N-Gram LM by @artbataev :: PR: #12730
update notebook by @nithinraok :: PR: #13088
bugfix ASR_Context_Biasing.ipynb by @lilithgrigoryan :: PR: #13109
Change branch for installation from main to r2.3.0 by @ankitapasad :: PR: #13266

TTS

Changelog

Add Magpie-TTS and Updates NeMo Audio Codecs by @blisc :: PR: #12606
fix bug from prior commit (#13264) by @blisc :: PR: #13328

NLP / NMT

Changelog

Remove old peft docs by @cuichenx :: PR: #12675
Add code coverage for llm gpt models conversion tests by @suiyoubi :: PR: #12665
Make BERT TransformerBlockWithPostLNSupport accept more inputs from Mcore by @suiyoubi :: PR: #12685
remove gifs from documentation by @dimapihtar :: PR: #12732
Rename FastNGramLM -> NGramGPULanguageModel by @artbataev :: PR: #12755
fix NeMo documentation by @dimapihtar :: PR: #12754
GPT Model/Data/Recipe Unit Test by @suiyoubi :: PR: #12757
ci: Exclude nlp, mm, vision collections by @ko3n1g :: PR: #12816
Add vocab size as attr to GPT and T5 Configs, use file name based logger in llm.gpt.data by @hemildesai :: PR: #12862
Fix transformer layer api with megatron cbc89b3 by @yaoyu-33 :: PR: #12885

Text Normalization / Inverse Text Normalization

Changelog

Rename FastNGramLM -> NGramGPULanguageModel by @artbataev :: PR: #12755

Export

Changelog

GHA Conversion Test and Importer/Exporter Refactor by @suiyoubi :: PR: #12597
Fix Llama Embedding Model Exporting keys by @suiyoubi :: PR: #12691
build: Add trtllm by @ko3n1g :: PR: #12672
Fix trt-llm install by @chtruong814 :: PR: #12827
Update LLaVA's next HF exporter to load ViT checkpoint from YAML by @eagle705 :: PR: #12841
Support huggingface export to tensorrtllm by @pthombre :: PR: #12889
Adds a built stage for the trt-llm wheel to reduce the overall test image size by @chtruong814 :: PR: #12883

Uncategorized

Changelog

Update changelog-build.yml by @ko3n1g :: PR: #12584
Update changelog for r2.2.0 by @github-actions[bot] :: PR: #12585
Add comments for requirements by @thomasdhc :: PR: #12603
[automodel] FSDP2Strategy: move to device if using a single-device by @akoumpa :: PR: #12593
build: Remove numba pin by @ko3n1g :: PR: #12604
docs: Update installation guides by @ko3n1g :: PR: #12596
Change Llama Scaling Factor type to Float by @suiyoubi :: PR: #12616
ci: Test multiple python versions by @ko3n1g :: PR: #12619
ci: Disable reformat by @ko3n1g :: PR: #12620
Updating ModelOpt to 0.25.0 by @janekl :: PR: #12633
[automodel] add additional hf_dataset tests by @akoumpa :: PR: #12646
[automodel] add jit_transform tests by @akoumpa :: PR: #12645
[automodel] init eos_token_id inside data module by @yuanzhedong :: PR: #12610
[automodel] grad ckpt by @akoumpa :: PR: #12644
bugfix(llm/LLaMa) - dropout_position can never be equal to extended string by @soluwalana :: PR: #12649
Fix inference pipeline quality issue by @Victor49152 :: PR: #12639
[automodel] switch to direct=True to propage return codes in nemorun by @akoumpa :: PR: #12651
add Auto Conf support for bert, t5, qwen, starcoder models by @dimapihtar :: PR: #12601
ci: Upload coverage by @ko3n1g :: PR: #12668
ci: Re-enable changed-files action by @ko3n1g :: PR: #12683
build: Pin sox by @ko3n1g :: PR: #12701
add neva quantization by @linnanwang :: PR: #12698
Clip coverage by @abhinavg4 :: PR: #12696
GHA CI test: Remove unnecessary directive by @pablo-garay :: PR: #12714
minor perf fixes by @malay-nagda :: PR: #12656
Add DeepSeek V2 Lite into llm init.py by @suiyoubi :: PR: #12664
Add Llama-Nemotron Nano and 70B models by @suiyoubi :: PR: #12712
Save batch norm running stats in PEFT checkpoints by @cuichenx :: PR: #12666
Fix document Readme under nemo to add more information by @yaoyu-33 :: PR: #12699
Fix ub_overlap_ag by @cuichenx :: PR: #12721
Toggle fast tokenizer if error occurs by @cuichenx :: PR: #12722
Update README.md for blackwell and AutoModel by @snowmanwwg :: PR: #12612
Raise error on import_ckpt with overwrite=False plus README for checkpoint_converters by @janekl :: PR: #12693
[automodel] fix validation_step by @soluwalana :: PR: #12659
[automodel] vlm tests by @akoumpa :: PR: #12716
Auto Configurator code coverage by @dimapihtar :: PR: #12694
[automodel] fix automodle benchmark script by @yuanzhedong :: PR: #12605
Remove unnecessary directives by @pablo-garay :: PR: #12743
Add recipe tests for coverage by @cuichenx :: PR: #12737
Add Qwen2.5 in NeMo2 by @suiyoubi :: PR: #12731
add fallback_module to safe_import_from by @akoumpa :: PR: #12726
Update quantization scripts & relax modelopt requirement specifier by @janekl :: PR: #12709
Import guard fasttext by @thomasdhc :: PR: #12758
[automodel] chunked cross entropy by @akoumpa :: PR: #12752
Add fsdp automodel test by @BoxiangW :: PR: #12718
[automodel] if peft move only adapters to cpu by @akoumpa :: PR: #12735
[automodel] update hf mockdataset by @akoumpa :: PR: #12643
[automodel] remove unused cell in multinode notebook by @yuanzhedong :: PR: #12624
Yash/llava next coverage by @yashaswikarnati :: PR: #12745
Tidy code: remove unneeded statements/lines by @pablo-garay :: PR: #12771
Pass tensor instead of raw number in _mock_loss_function in PTQ by @janekl :: PR: #12769
ci: Run on nightly schedule by @ko3n1g :: PR: #12775
Add logs for checkpoint saving start and finalization by @lepan-google :: PR: #12697
Alit/test coverage by @JRD971000 :: PR: #12762
Fix loss mask with packed sequence by @ashors1 :: PR: #12642
Add pruning recipe by @kevalmorabia97 :: PR: #12602
Update qwen2-v1 to use NeMo quick_gelu by @thomasdhc :: PR: #12787
[doc] Fixes for audio doc warnings by @anteju :: PR: #12736
ci: Measure multiprocessing by @ko3n1g :: PR: #12778
ci: Fix flaky LLM tests by @ko3n1g :: PR: #12807
Add BERT/Qwen2.5 Unit test and Refactor all GHA Conversion Tests by @suiyoubi :: PR: #12785
Fix TransformerBlock cuda_graphs compatibility with MCore by @buptzyb :: PR: #12779
ci: Remove --branch by @ko3n1g :: PR: #12809
ci: Move scripts fully down to files by @ko3n1g :: PR: #12802
add init.py to make this a package by @akoumpa :: PR: #12814
Update changelog for r2.2.1 by @github-actions[bot] :: PR: #12818
add finetune support for Auto Configurator by @dimapihtar :: PR: #12770
[automodel] add cpu:gloo to backend by @akoumpa :: PR: #12832
add missing call to _apply_liger_kernel_to_instance by @akoumpa :: PR: #12806
Prune docker images in GHA older than 8hrs by @chtruong814 :: PR: #12838
[audio] Adding tests for predictive models by @anteju :: PR: #12823
Update resiliency example notebook readme and add links to the brev launchable by @ShriyaRishab :: PR: #12843
[automodel] qlora peft by @yzhang123 :: PR: #12817
ci: Increase prune time by @ko3n1g :: PR: #12860
Update base container in Dockerfile.speech by @artbataev :: PR: #12859
Fix qwen2.5 1.5b configuration inheritance bug by @Aprilistic :: PR: #12852
Update modelopt upperbound to 0.27 by @thomasdhc :: PR: #12788
Non-blocking checkpoint cleanup failure by @jstjohn :: PR: #12804
Improve evo2 dataset test and testability by @jstjohn :: PR: #12857
Expand test converage neva / mllama by @yaoyu-33 :: PR: #12715
Weekly bump by @ko3n1g :: PR: #12891
ci: Optional_L2_NeMo_2_SSM_Finetuning by @ko3n1g :: PR: #12893
docs: Update guide to PEP508 by @ko3n1g :: PR: #12890
Replace lm-eval with nvidia-lm-eval by @chtruong814 :: PR: #12888
Handle CUDA_DEVICE_MAX_CONNECTIONS before job launch by @guyueh1 :: PR: #12833
add nemotron5 by @JRD971000 :: PR: #12660
Bump vllm 0.8.2 by @Laplasjan107 :: PR: #12753
DeepseekV3 SFT finetuning perf config by @gdengk :: PR: #12829
add apply_chat_template method to TokenizerSpec + AutoTokenizer by @akoumpa :: PR: #12878
add accelerate to dependencies by @akoumpa :: PR: #12871
[automodel] Add FSDPv2-compatible context parallelism support. by @cspades :: PR: #12821
[fault tolerance] Add local checkpointing support by @ananthsub :: PR: #12839
ci: Bump release-freeze by @ko3n1g :: PR: #12914
ci: Use PAT for code-freeze by @ko3n1g :: PR: #12915
ci: Use correct environment by @ko3n1g :: PR: #12917
Freeze tags in in r2.3.0 by @github-actions[bot] :: PR: #12919
chore: Bump version to 2.3.0.rc2 by @chtruong814 :: PR: #12920
Version bump to 2.3.0rc3.dev0 by @github-actions[bot] :: PR: #12921
Cherry pick [automodel] Add linear ce loss support (12825) into r2.3.0 by @ko3n1g :: PR: #12922
Cherry pick DeepSeek V3 Multi Token Prediction (12550) into r2.3.0 by @ko3n1g :: PR: #12928
Cherry pick Set L2_NeMo_2_EVAL test to be optional (12949) into r2.3.0 by @ko3n1g :: PR: #12951
Cherry pick GB200 LLM performance scripts tuning (12791) into r2.3.0 by @ko3n1g :: PR: #12923
Cherry pick Allow configuration of PP communication backend to UCC in nemo2 (11755) into r2.3.0 by @ko3n1g :: PR: #12946
Cherry pick guard bitsandbytes based on cuda availability (12937) into r2.3.0 by @ko3n1g :: PR: #12958
Cherry pick Hugging Face model deployment support (12628) into r2.3.0 by @ko3n1g :: PR: #12962
Cherry pick fix macro-acc for pair-audio eval (12908) into r2.3.0 by @ko3n1g :: PR: #12963
Cherry pick Add energon dataset support for Qwen2VL (12831) into r2.3.0 by @ko3n1g :: PR: #12966
Cherry pick Make TETransformerLayerAutocast Support Cuda Graph (12075) into r2.3.0 by @ko3n1g :: PR: #12967
Cherry pick Use nvidia-lm-eval for evaluation (12902) into r2.3.0 by @ko3n1g :: PR: #12971
Cherry pick [NeMo 2.0] Interface for using MXFP8 and FP8 current scaling recipes (12503) into r2.3.0 by @ko3n1g :: PR: #12974
Cherry pick Fix trtllm and lightning conflict (12943) into r2.3.0 by @ko3n1g :: PR: #12981
Cherry pick Update v3 finetuning recipe (12950) and Specify PP first/last in strategy (12992) into r2.3.0 by @ko3n1g :: PR: #12984
Cherry pick Resolve an issue in custom megatron FSDP config setting (12948) into r2.3.0 by @ko3n1g :: PR: #12987
Cherry pick Remove getattr_proxy to avoid problematic edge cases (12176) into r2.3.0 by @ko3n1g :: PR: #12990
Cherry pick Enable async requests for in-fw deployment with OAI compatible server (12980) into r2.3.0 by @ko3n1g :: PR: #12994
Cherry pick initialize model with metadata (12496) into r2.3.0 by @ko3n1g :: PR: #12997
Cherry pick Bugfix for logits support for hf deployment (12965) into r2.3.0 by @ko3n1g :: PR: #13001
Cherry pick Update nvidia-resiliency-ext to be >= 0.3.0 (12925) into r2.3.0 by @ko3n1g :: PR: #13000
Cherry-pick Fix params_dtype for distillation and GPT HF Exporter head_dim for pruning to r2.3.0 by @kevalmorabia97 :: PR: #13002
Install nvidia-pytriton on arm (#13011) by @thomasdhc :: PR: #13013
Version bump to 2.3.0rc4.dev0 by @github-actions[bot] :: PR: #13041
Cherry pick Alit/nemotron h (12942) into r2.3.0 by @ko3n1g :: PR: #13007
Cherry pick [Automodel] Add TP/SP support with default llama-like sharding plan (12796) into r2.3.0 by @ko3n1g :: PR: #13017
Cherry pick Add initial docs broken link check (12977) into r2.3.0 by @ko3n1g :: PR: #13045
Cherry pick Fix MoE Init to not use Bias in test_strategy_lib.py (13009) into r2.3.0 by @ko3n1g :: PR: #13014
Cherry pick cleaner tflops log name (13005) into r2.3.0 by @ko3n1g :: PR: #13024
Cherry pick Improve t5 test coverage (12803) into r2.3.0 by @ko3n1g :: PR: #13025
Cherry pick put the warning on the right place (12909) into r2.3.0 by @ko3n1g :: PR: #13035
Cherry pick Temporary disable CUDA graphs in DDP mode for transducer decoding (12907) into r2.3.0 by @ko3n1g :: PR: #13036
Cherry pick [automodel] peft fix vlm (13010) into r2.3.0 by @ko3n1g :: PR: #13037
Cherry pick Only run the docs link check on the container (13068) into r2.3.0 by @ko3n1g :: PR: #13070
Cherry pick Add fp8 recipe option to perf script (13032) into r2.3.0 by @ko3n1g :: PR: #13055
Cherry pick Unified ptq export (12786) into r2.3.0 by @ko3n1g :: PR: #13062
Cherry pick Fix VP list index out of range from Custom FSDP (13021) into r2.3.0 by @ko3n1g :: PR: #13077
Cherry pick Add logging to cancel out PTL's warning about dataloader not being resumable (13072) into r2.3.0 by @ko3n1g :: PR: #13100
Cherry pick Fix long sequence generation after new arg introduced in mcore engine (13049) into r2.3.0 by @ko3n1g :: PR: #13104
Cherry pick Support Mamba models quantization (12631) into r2.3.0 by @ko3n1g :: PR: #13105
Cherry pick Add track_io to user buffer configs (13071) into r2.3.0 by @ko3n1g :: PR: #13111
ci: Onboard 8-GPU runner (#13115) by @ko3n1g :: PR: #13121
Cherry pick Add fine-tuning dataset function for FineWeb-Edu and update automodel… (13027) into r2.3.0 by @ko3n1g :: PR: #13118
Cherry pick Re-add sox to asr requirements (13092) into r2.3.0 by @ko3n1g :: PR: #13120
Cherry pick Update Mllama cross attn signature to match update MCore (13048) into r2.3.0 by @ko3n1g :: PR: #13122
Cherry pick Fix Exporter for baichuan and chatglm (13095) into r2.3.0 by @ko3n1g :: PR: #13126
ci: Faster builds (#13142) by @ko3n1g :: PR: #13144
Version bump to 2.3.0rc5.dev0 by @github-actions[bot] :: PR: #13146
ci: Fix mcore install in test container (#13152) by @ko3n1g :: PR: #13159
ci: Fix race-condition of container setup (#13162) by @ko3n1g :: PR: #13163
Cherry pick Guard decord and triton import (12861) into r2.3.0 by @ko3n1g :: PR: #13132
Cherry pick Bump TE version and apply patch (13087) into r2.3.0 by @ko3n1g :: PR: #13139
Cherry pick Update Llama-Minitron pruning-distillation notebooks from NeMo1 to NeMo2 + NeMoRun (12968) into r2.3.0 by @ko3n1g :: PR: #13141
Cherry pick Export and Deploy Tests (13076) into r2.3.0 by @ko3n1g :: PR: #13150
Cherry pick ub fp8 h100 fixes (13131) into r2.3.0 by @ko3n1g :: PR: #13153
Cherry pick Fix Transducer Decoding with CUDA Graphs in DDP with Mixed Precision (12938) into r2.3.0 by @ko3n1g :: PR: #13154
Cherry pick build: Pin modelopt (13029) into r2.3.0 by @chtruong814 :: PR: #13170
Cherry pick add fixes for nemotron-h (13073) into r2.3.0 by @JRD971000 :: PR: #13165
Add dsv3 pretrain script, support flops calculation (previous #12947) by @guyueh1 :: PR: #13186
ci: Allow running CI on weekly bump branch by @ko3n1g :: PR: #13233
Cherry pick Add Llama Nemotron Super/Ultra models (13044) into r2.3.0 by @ko3n1g :: PR: #13212
Cherry pick Add Blockwise FP8 to PTQ & EP to modelopt resume (12670) into r2.3.0 by @ko3n1g :: PR: #13239
Cherry pick [OAI Serving] Validate greedy generation args (redo) (13216) into r2.3.0 by @ko3n1g :: PR: #13242
Cherry pick drop sample_alpha in speechlm (13208) into r2.3.0 by @ko3n1g :: PR: #13246
Cherry pick [Eval bugfix] Move global eval-related imports inside the evaluate function (13166) into r2.3.0 by @ko3n1g :: PR: #13249
Cherry pick [Eval bugfix] Change default val of parallel_requests in eval script (13247) into r2.3.0 by @ko3n1g :: PR: #13253
Cherry pick Add tutorial for evaluation with Evals Factory (13259) into r2.3.0 by @ko3n1g :: PR: #13271
Cherry pick Fix default token durations (13168) into r2.3.0 by @ko3n1g :: PR: #13261
Cherry pick [Evaluation] Add support for nvidia-lm-eval==25.04 (13230) into r2.3.0 by @ko3n1g :: PR: #13274
Cherry pick [bug fix] set inference max seq len in inference context (13245) into r2.3.0 by @ko3n1g :: PR: #13276
Cherry pick More export and deploy unit tests (13178) into r2.3.0 by @ko3n1g :: PR: #13283
Cherry pick Reopen 13040 (13199) into r2.3.0 by @ko3n1g :: PR: #13303
Cherry pick Fix nemo1's neva notebook (13218) into r2.3.0 by @ko3n1g :: PR: #13312
Cherry pick build: various bumps (13285) into r2.3.0 by @ko3n1g :: PR: #13313
Cherry-pick ci: Increase cache pool into r2.3.0 by @chtruong814 :: PR: #13317
Cherry pick update num nodes in deepseek v3 finetune recipe (13314) into r2.3.0 by @ko3n1g :: PR: #13316
Cherry pick Fix neva notebook (13334) into r2.3.0 by @ko3n1g :: PR: #13335
Cherry-pick Add Llama4 Scout and Maverick Support (#12898) by @ko3n1g :: PR: #13331
Cherry pick Fix handling Llama Embedding dimensions param and prompt type in the ONNX export tutorial (13262) into r2.3.0 by @ko3n1g :: PR: #13326
Cherry-pick Fix transformer offline for CI/CD llama4 tests (#13339) to r2.3.0 by @chtruong814 :: PR: #13340
Fix llama4 test names by @chtruong814 :: PR: #13358
Cherry pick vLLM==0.8.5 update (13350) into r2.3.0 by @ko3n1g :: PR: #13354
Cherry-pick a test and doc fix to r2.3.0 by @chtruong814 :: PR: #13338
Cherry pick Add llama4 training recipe (12952) into r2.3.0 by @ko3n1g :: PR: #13386

NVIDIA Neural Modules 2.2.1

Highlights

Training
- Fix MoE based models training instability.
- Fix bug in Llama exporter for Llama 3.2 1B and 3B.
- Fix bug in LoRA linear_fc1adapter when different TP is used during saving and loading the adapter checkpoint.

Detailed Changelogs

Uncategorized

Changelog

Re-add reverted commits after 2.2.0 and set next version to be 2.2.1 by @chtruong814 :: PR: #12587
Cherry pick Fix exporter for llama models with shared embed and output layers (12545) into r2.2.0 by @ko3n1g :: PR: #12608
Cherry pick Fix TP for LoRA adapter onlinear_fc1(12519) into r2.2.0 by @ko3n1g :: PR: #12607
Bump mcore to use 0.11.1 by @chtruong814 :: PR: #12634

NVIDIA Neural Modules 2.2.0

Highlights

Training
- Blackwell and Grace Blackwell support
- Pipeline parallel support for distillation
- Improved NeMo Framework installation
Export & Deploy
- vLLM export for NeMo 2.0
Evaluations
- Integrate lm-eval-harness
Collections
- LLM
  - DAPT Example and best practices in nemo 2.0
  - [NeMo 2.0] Enable Tool Learning and add a tutorial
  - Support GPT Embedding Model (Llama 3.2 1B/3B)
  - Qwen2.5, Phi4 (via AutoModel)
  - SFT for Llama 3.3 model (via AutoModel)
  - Support BERT Embedding Model with NeMo 2.0
  - DeepSeek SFT & PEFT Support
- MultiModal
  - Clip
  - SP for NeVA
  - CP for NeVA
  - Intern-VIT
Automodel
- Preview release.
- PEFT and SFT support for LLMs available via Hugging Face’s AutoModelForCausalLM.
- Support for Hugging Face-native checkpoints (full model and adapter only).
- Support for distributed training via DDP and FSDP2.
ASR/TTS
- Lhotse: TPS-free 2D bucket estimation and filtering
- Update model outputs to make all asr outputs to be in consistent format
- Sortformer Release Model

Detailed Changelogs

ASR

Changelog

removed the line which caused a problem in nfa_tutorial by @Ssofja :: PR: #11710
TPS-free 2D bucket estimation and filtering by @pzelasko :: PR: #11738
Update transcribe_utils.py by @stevehuang52 :: PR: #11984
Sortformer Diarizer 4spk v1 model PR Part 4: Sortformer Documents and Notebook Tutorials by @tango4j :: PR: #11707
fix the issue during batched inference of Sortformer diarizer by @tango4j :: PR: #12047
changed asr models outputs to be consistent by @Ssofja :: PR: #11818
chore: Update notebooks by @ko3n1g :: PR: #12161
add ctc segmentation by @ko3n1g :: PR: #12312
clean up VAD tutorial by @stevehuang52 :: PR: #12410
copy from main by @nithinraok :: PR: #12423
ci: Disable ASR tests for now (#12443) by @ko3n1g :: PR: #12466
ASR_CTC_Language_Finetuning.ipynb bugfix by @lilithgrigoryan :: PR: #12538

TTS

Changelog

Add New Transformer Backbone for TTS Models by @blisc :: PR: #11911
changed asr models outputs to be consistent by @Ssofja :: PR: #11818
chore: Update notebooks by @ko3n1g :: PR: #12161

NLP / NMT

Changelog

Use explicit imports from megatronllm_deployable.py by @janekl :: PR: #11705
Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714
gpt moe perf scripts by @malay-nagda :: PR: #11760
Bump mcore by @ko3n1g :: PR: #11740
Enable packed seqs for validation by @jiemingz :: PR: #11748
Revert Mcore update since it caused regression by @pablo-garay :: PR: #11791
Fix Gemma2 Attention Init Args by @suiyoubi :: PR: #11792
Add null tokenizer by @erhoo82 :: PR: #11789
Fix DistCP inference issue by @suiyoubi :: PR: #11801
Add BERT Embedding Models E5 Recipe by @suiyoubi :: PR: #11787
Add rope scaling configs for NeMo 1 by @BoxiangW :: PR: #11807
Fix calculating num_available_samples by @huvunvidia :: PR: #11830
fix sentencepiece tokenizer special tokens by @akoumpa :: PR: #11811
add chat sft dataset to support agent tool calling by @chenrui17 :: PR: #11759
Revert "Revert Mcore update since it caused regression (#11791)" by @ko3n1g :: PR: #11799
fix checkpoint load issue by @dimapihtar :: PR: #11859
Fix nemo 1 packed sequence TE version error by @cuichenx :: PR: #11874
enable loading older TE checkpoints by @dimapihtar :: PR: #11930
ci: Use single runner machines for unit tests by @ko3n1g :: PR: #11937
llm performance scripts by @malay-nagda :: PR: #11736
[MoE] add expert tensor parallelism support for NeMo2.0 MoE by @gdengk :: PR: #11880
add exception when loading ckpt saved by TE < 1.13 by @dimapihtar :: PR: #11988
remove renormalize_blend_weights flag by @dimapihtar :: PR: #11975
Llama3.2 1B Embedding Model Support by @suiyoubi :: PR: #11909
Weekly bump by @ko3n1g :: PR: #11896
Debug Apex distributed optimizer to handle Transformer Engine 2.0 by @timmoon10 :: PR: #12004
throw MegatronOptimizerModule warning only with mcore models by @akoumpa :: PR: #12085
fix nmt dataclass issue by @dimapihtar :: PR: #12081
Propogate dp last changes from mcore by @ryantwolf :: PR: #12012
Add error message when downloading failed. by @yuanzhedong :: PR: #12139
interface for asymmetric pipeline schedule by @erhoo82 :: PR: #12039
chore: Update notebooks by @ko3n1g :: PR: #12161
Cherrypick #12382, #12415 and #12424 by @cuichenx :: PR: #12425
ASR_CTC_Language_Finetuning.ipynb bugfix by @lilithgrigoryan :: PR: #12538

Text Normalization / Inverse Text Normalization

Changelog

surface attn_implementation option by @akoumpa :: PR: #11873
attn_implementation eager fallback by @akoumpa :: PR: #12060

NeMo Tools

Changelog

build: Add sox to SDE by @ko3n1g :: PR: #11882
add ctc segmentation by @ko3n1g :: PR: #12312

Export

Changelog

Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714
In-framework deployment NeMo 2.0 nemo_export.py test by @janekl :: PR: #11749
Fix starcoder2 missing bias in nemo2 config for TRTLLM by @meatybobby :: PR: #11809
Autodetect dtype on exporting to TensorRT-LLM by @janekl :: PR: #11907
PTQ & TRT-LLM updates related to upcoming PyTorch 25.01 bump by @janekl :: PR: #11941
Run Flake8 for nemo.export module by @janekl :: PR: #11728
Skip initialization in hf export by @cuichenx :: PR: #12136
update export io call by @akoumpa :: PR: #12144
add default kwargs for trtllm model runner by @pablo-garay :: PR: #12248
cherry-pick: fix[export]: reshard model correctly handles extra_state when it's a tensor (#12132) by @terrykong :: PR: #12335

Bugfixes

Changelog

added required instalation for sox to process mp3 file by @Ssofja :: PR: #11709
removed the line which caused a problem in nfa_tutorial by @Ssofja :: PR: #11710
Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714

Uncategorized

Changelog

Allow using vocab size from config by @shanmugamr1992 :: PR: #11718
Fix baseline recipes by @erhoo82 :: PR: #11725
Update changelog for r2.1.0 by @github-actions[bot] :: PR: #11745
ci: Fix changelog generator by @ko3n1g :: PR: #11744
Fix 'http_port' parameter name in DeployPyTriton usages and update .qnemo compress=True path by @janekl :: PR: #11747
Conversion NeMo and HF checkpoint script for T5 by @huvunvidia :: PR: #11739
Add BERT Embedding Models by @suiyoubi :: PR: #11737
Add server ready check before starting evaluation by @athitten :: PR: #11731
only install bitsandbytes on x86 by @akoumpa :: PR: #11781
[Bugfix] Skip processing if extra_state loads as None by @janekl :: PR: #11778
chore(beep boop 🤖): Bump MCORE_TAG=4dc8977... (2025-01-07) by @ko3n1g :: PR: #11768
make progress printer compatible with PTL v2.5.0 by @ashors1 :: PR: #11779
Fix Mistral Conversion Issue by @suiyoubi :: PR: #11786
build: Fix build-arg by @ko3n1g :: PR: #11815
Lora ckpt in HF format for NeMo AutoModel by @oyilmaz-nvidia :: PR: #11712
8x22b seq len by @malay-nagda :: PR: #11788
Bugfix for output_generation_logits in tensorrtllm by @athitten :: PR: #11820
handle mistralai/Mistral-7B-Instruct-v0.3 tokenizer correctly by @akoumpa :: PR: #11839
remove tensorstore pin in requirements*.txt by @pstjohn :: PR: #11777
Do not load context for model transform in llm inference by @hemildesai :: PR: #11751
update nemo2sftpeft tutorial container verison by @HuiyingLi :: PR: #11832
Latest News updated for Cosmos by @lbliii :: PR: #11806
Removes tensorstore 0.1.45 pin from requirements_deploy.txt by @pstjohn :: PR: #11858
ci: Prune dangling images by @ko3n1g :: PR: #11885
Disable tests that download datasets from web by @akoumpa :: PR: #11878
Add context_logits for eval accuracy calculation in case of multi token prediction tasks by @athitten :: PR: #11753
add dataset_root to SpecterDataModule by @suiyoubi :: PR: #11837
Support both Path and str for APIs by @maanug-nv :: PR: #11865
Run nsys callback on GBS not on MBS by @akoumpa :: PR: #11861
ci: Set bump-branch to weekly by @ko3n1g :: PR: #11889
chore: Update mcore-tag-bump-bot.yml by @ko3n1g :: PR: #11891
ci: Bump Mcore in weekly PR by @ko3n1g :: PR: #11897
check restore_config first by @akoumpa :: PR: #11890
LinearAdapter: propagate args to _init_adapter by @akoumpa :: PR: #11902
NeMo 2.0 fp8 conversion by @Laplasjan107 :: PR: #11845
nemo ux expert tensor parallel by @akoumpa :: PR: #11903
Add CP support to Neva in NeMo2 by @yaoyu-33 :: PR: #11850
build: Move dependencies by @ko3n1g :: PR: #11790
Add Flux and Flux Controlnet Support to Diffusion folder by @Victor49152 :: PR: #11794
ci: Adjust bump mcore workflow by @ko3n1g :: PR: #11918
ci: Small fix to bump workflow by @ko3n1g :: PR: #11919
Revert #11890 and add a test that would have caught the error by @cuichenx :: PR: #11914
ci: Adjust input argument by @ko3n1g :: PR: #11921
Create test_phi3.py by @mayani-nv :: PR: #11843
Enable NeMo importer and loading dist CKPT for training by @Victor49152 :: PR: #11927
build: Pin triton by @ko3n1g :: PR: #11938
Add sharding for speechlm and vlm by @BoxiangW :: PR: #11876
Update torch load for load from disk by @thomasdhc :: PR: #11963
Add options to add mp_policy and parallel_fn for NeMo automodel fsdp2 by @BoxiangW :: PR: #11956
ci: Add coverage reports by @ko3n1g :: PR: #11912
Add batching support for evaluation by @athitten :: PR: #11934
add use_fast option by @akoumpa :: PR: #11976
improve error and debug messages in model connector by @cuichenx :: PR: #11979
[checkpoint][docs] Fix typos in dist checkpointing docs by @ananthsub :: PR: #11983
callbacks and bf16 grad by @malay-nagda :: PR: #11985
remove --disable-ckpt from tests by @akoumpa :: PR: #11996
nemo automodel sft squad data prep fix by @akoumpa :: PR: #11994
Introduce evaluation API by @Glorf :: PR: #11895
Remove deprecated tests/infer_data_path.py by @janekl :: PR: #11997
Checkpoint saving for automodels via ModelCheckpoint by @akoumpa :: PR: #11998
Mask vocab padding token ids from CE loss by @maanug-nv :: PR: #11999
Add the NeMo2 memory profiling plugin by @gdengk :: PR: #12009
chore(ci): Disable VMs cron job on forks by @mikemckiernan :: PR: #12020
Adding speechlm AutoModel test by @oyilmaz-nvidia :: PR: #11990
minor fix and simplify by @akoumpa :: PR: #12007
ci: Build wheel workflow by @ko3n1g :: PR: #12021
ci: Release workflow by @ko3n1g :: PR: #12022
Version bump to 2.2.0rc1 by @github-actions[bot] :: PR: #12023
ci: Run unit tests on main by @ko3n1g :: PR: #11986
[Audio] Fix extra step in Euler sampler for flow matching inference by @racoiaws :: PR: #11989
Set zarr range to >=2.18.2 and <3.0.0 by @chtruong814 :: PR: #12005
ci: Run linting per domain by @ko3n1g :: PR: #12027
Replace reference of requirements_infer.txt with requirements_deploy.txt by @chtruong814 :: PR: #12029
ci: Always run linting by @ko3n1g :: PR: #12035
ci: Retry on timeout by @ko3n1g :: PR: #11974
[MoE] fix run err in mixtral22B recipe and update its perf config by @gdengk :: PR: #12036
Version bump to 2.2.0rc2.dev0 by @github-actions[bot] :: PR: #12040
ci: Update weekly brain by @ko3n1g :: PR: #12043
ci: Update workflow by @ko3n1g :: PR: #12044
nemo-automodel: fsdp2 support for peft by @akoumpa :: PR: #12008
fix llama-3.1 hf model_id by @AtsunoriFujita :: PR: #11774
Clip Model in Nemo2 by @abhinavg4 :: PR: #11980
Adding TFLOPs callback for Multimodal models and NeVA calculator by @parthmannan :: PR: #11969
ci: Allow skipping docs by @ko3n1g :: PR: #12048
avoid missmatch error when loading older TE checkpoints by @dimapihtar :: PR: #12028
Add padding in mllama vision encoder to align with HF by @meatybobby :: PR: #11808
chore: Add warning for rebase by @ko3n1g :: PR: #12061
ci: Lint Python files only by @ko3n1g :: PR: #12064
Recipe changes for performance by @guyueh1 :: PR: #11763
Pipeline-parallel support for Knowledge Distillation (NeMo 2) by @AAnoosheh :: PR: #11766
add cp_comm_type param to Mistral config by @dimapihtar :: PR: #12049
Conformer-based spectrogram estimator by @anteju :: PR: #12002
Adding nemo CI by @abhinavg4 :: PR: #12052
Update optimization features readme from nemo1 to nemo2 by @yaoyu-33 :: PR: #12071
Add Llama Embedding Tutorial by @suiyoubi :: PR: #12042
Fix Linting by @suiyoubi :: PR: #12079
Fix hf_dataset bug by @BoxiangW :: PR: #12072
set TOKENIZERS_PARALLELISM=True by @akoumpa :: PR: #12083
minor fix in model's summary identation during logging by @akoumpa :: PR: #12084
Refactor VLM modules / Add InternVit submodule support by @yaoyu-33 :: PR: #11851
Fix SBERT with sequence_len_offset by @suiyoubi :: PR: #12057
ci: codecov by @ko3n1g :: PR: #12030
build: Improve installer by @ko3n1g :: PR: #12016
ci: Modular unit tests by @ko3n1g :: PR: #12104
ci: Update bump workflow by @ko3n1g :: PR: #12106
etp docs by @akoumpa :: PR: #12111
build: Better caching by @ko3n1g :: PR: #12109
ci: Fix flaky test by @ko3n1g :: PR: #12113
Ensure nemo.collections.vlm does not strictly require transformer engine by @chtruong814 :: PR: #12108
build: Optimize by @ko3n1g :: PR: #12112
refactor peft module matching; introduce exclude_modules by @akoumpa :: PR: #12066
Update mcore commit (02.06.25) by @pablo-garay :: PR: #12114
ci: Bump Mcore inplace by @ko3n1g :: PR: #12115
ci: Bump bot by @ko3n1g :: PR: #12117
Add neva pretrain script by @yaoyu-33 :: PR: #12033
DAPT playbooks - with NeMo 2.0 by @jvamaraju :: PR: #12067
Malay/bw scripts by @malay-nagda :: PR: #11961
[MoE] Add type annotation for mixtral configs by @gdengk :: PR: #12126
ci: Disable checks by @ko3n1g :: PR: #12129
Add performance-optimized example for llama2 70b LoRA by @vysarge :: PR: #12055
Add Automodel support for Deepseek v3 model by @BoxiangW :: PR: #12099
Bug fix with generation of expert_tensor_parallel_rank by @guyueh1 :: PR: #12125
Rename neva datamodule by @yaoyu-33 :: PR: #12121
Update vLLM to 0.7.2 by @Laplasjan107 :: PR: #12078
Prevent downloading dataset every time in ci test by @cuichenx :: PR: #12095
AudioToAudioModel: fix model->dataloader sample_rate parameter injection by @racoiaws :: PR: #12092
Minor Bug Fixes - LLaMa Embedding by @soluwalana :: PR: #12146
build: Force re-install VCS dependencies by @ko3n1g :: PR: #12155
Cherry pick build: Force re-install VCS dependencies (12155) into r2.2.0 by @ko3n1g :: PR: #12191
Cherry pick Add function calling SFT NeMo2.0 tutorial (11868) into r2.2.0 by @ko3n1g :: PR: #12180
Cherry pick Update TTS code to remove calls to deprecated functions (12153) into r2.2.0 by @ko3n1g :: PR: #12201
Cherry pick Fix multi-GPU in-framework deployment (12090) into r2.2.0 by @ko3n1g :: PR: #12172
Cherry pick disable moe logging to avoid deepseek hang (12168) into r2.2.0 by @ko3n1g :: PR: #12192
Cherry pick build: Pin down transformers (12229) into r2.2.0 by @ko3n1g :: PR: #12230
Cherry pick Fix loading extra states from torch tensor (12185) into r2.2.0 by @ko3n1g :: PR: #12226
Cherry pick nemo-automodel checkpoint-io refactor (12070) into r2.2.0 by @ko3n1g :: PR: #12234
ci: Flaky tests release by @ko3n1g :: PR: #12293
Cherry pick Set L2_Speech_Batch_Size_OOMptimizer_Canary to be optional (12299) into r2.2.0 by @ko3n1g :: PR: #12300
build: Editable nemo install (#12304) by @ko3n1g :: PR: #12308
ci: Fix test workflow by @ko3n1g :: PR: #12311
Cherry pick build: Exclude tensorstore 0.1.72 (12317) into r2.2.0 by @ko3n1g :: PR: #12318
Cherry pick Fix the local path in Sortformer diarizer training tutorial (12135) into r2.2.0 by @ko3n1g :: PR: #12316
Cherry pick Add eval requirement to setup.py (12152) into r2.2.0 by @ko3n1g :: PR: #12277
Cherry pick Add modelopt to requirements_nlp.txt (12261) into r2.2.0 by @ko3n1g :: PR: #12278
cherry pick 12209 by @akoumpa :: PR: #12240
Cherry pick Energon ckpt multimodal (12245) into r2.2.0 by @ko3n1g :: PR: #12307
Cherry pick [nemo1] Fix Mamba/Bert loading from checkpoint after TE extra states were introduced (12275) into r2.2.0 by @ko3n1g :: PR: #12314
Cherry pick fix masked loss calculation (12255) into r2.2.0 by @ko3n1g :: PR: #12286
chore: Cherry pick deepseek by @ko3n1g :: PR: #12324
build: Bump PyT to 25.01 (#11973) by @ko3n1g :: PR: #12323
Cherry pick build: Bump mcore (12320) into r2.2.0 by @ko3n1g :: PR: #12328
Cherry pick [automodel] re-enable FSDP2 tests (12325) into r2.2.0 by @ko3n1g :: PR: #12331
Cherry pick [automodel] fix loss reporting (12303) into r2.2.0 by @ko3n1g :: PR: #12334
build: Bump Mcore by @ko3n1g :: PR: #12340
Cherry-pick Asr fixes 2.2 (#12227) by @ko3n1g :: PR: #12345
Cherry-pick Bug fixes (#12315) by @chtruong814 :: PR: #12346
Cherry pick [automodel] remove fix_progress_bar from fsdp2 strategy (12339) into r2.2.0 by @ko3n1g :: PR: #12347
Cherry pick Fix NeMo1 Bert Embedding Dataset args (12341) into r2.2.0 by @ko3n1g :: PR: #12349
Cherry pick Fix NeMo1 sequence_len_offset in Bert fwd (12350) into r2.2.0 by @ko3n1g :: PR: #12359
Cherry pick Add nemo-run recipe for evaluation (12301) into r2.2.0 by @ko3n1g :: PR: #12352
Cherry pick Add DeepSeek-R1 Distillation NeMo 2.0 tutorial (12187) into r2.2.0 by @ko3n1g :: PR: #12355
chore: Update package_info.py by @ko3n1g :: PR: #12362
Version bump to 2.2.0rc4.dev0 by @github-actions[bot] :: PR: #12363
Bump mcore to latest commit on release branch by @chtruong814 :: PR: #12360
Cherry pick [automodel] add lr scheduler (12351) into r2.2.0 by @ko3n1g :: PR: #12361
Cherry pick [automodel] add distributed data sampler (12326) into r2.2.0 by @ko3n1g :: PR: #12373
Cherry pick [NeVA] Fix for CP+THD (12366) into r2.2.0 by @ko3n1g :: PR: #12375
Cherry pick Ignore attribute error when serializing mcore specs (12353) into r2.2.0 by @ko3n1g :: PR: #12383
Cherry pick Avoid init_ddp for inference (12011) into r2.2.0 by @ko3n1g :: PR: #12385
Cherry pick [docs] fix notebook render (12374) into r2.2.0 by @ko3n1g :: PR: #12394
Cherry pick Neva finetune scripts and PP fix (12387) into r2.2.0 by @ko3n1g :: PR: #12397
Cherry pick [automodel] update runner tags for notebooks (12428) into r2.2.0 by @ko3n1g :: PR: #12431
Cherry pick [automodel] update examples (12411) into r2.2.0 by @ko3n1g :: PR: #12432
Cherry pick Evaluation docs (12348) into r2.2.0 by @ko3n1g :: PR: #12460
Cherry pick Update prompt format (12452) into r2.2.0 by @ko3n1g :: PR: #12455
Cherry pick Fixing a wrong Sortformer Tutorial Notebook path. (12479) into r2.2.0 by @ko3n1g :: PR: #12480
Cherry pick added a needed checks and changes for bugfix (12400) into r2.2.0 by @Ssofja :: PR: #12447
Cherry pick [automodel] fix loss/tps reporting across ranks (12389) into r2.2.0 by @ko3n1g :: PR: #12413
Cherry pick enable fsdp flag for FSDP2Strategy (12392) into r2.2.0 by @ko3n1g :: PR: #12429
Cherry pick Fix lita notebook issue (12474) into r2.2.0 by @ko3n1g :: PR: #12476
Cherrypick multinode tut changes by @BoxiangW :: PR: #12501
Cherry pick Changed the argument types passed to metrics calculation functions (12500) into r2.2.0 by @ko3n1g :: PR: #12502
Cherry pick added needed fixes (12495) into r2.2.0 by @ko3n1g :: PR: #12509
Cherry pick update transformers version requirements (12475) into r2.2.0 by @ko3n1g :: PR: #12507
Cherry pick [checkpoint] Log timings for checkpoint IO save and load (11972) into r2.2.0 by @ko3n1g :: PR: #12520
Cherry pick few checkings needed because of the change of asr models output (12499) into r2.2.0 by @ko3n1g :: PR: #12513
Oyilmaz nvidia/chore/cherry pick 12242 by @oyilmaz-nvidia :: PR: #12523
Cherry pick Remove_attn_implementationinLlamaBidirectionalModelconstructor (12364) into r2.2.0 by @ko3n1g :: PR: #12525
Cherry pick Configure FSDP to keep module params (12074) into r2.2.0 by @ko3n1g :: PR: #12524
Cherry pick [automodel] docs (11942) into r2.2.0 by @ko3n1g :: PR: #12530
Cherry pick [automodel] update examples' comments (12518) and [automodel] Move PEFT to configure_model (#12491) into r2.2.0 by @ko3n1g :: PR: #12529
Cherry pick update readme to include latest pytorch version (12539) into r2.2.0 by @ko3n1g :: PR: #12577
Publish r2.2.0 by @chtruong814 :: PR: #12583

NVIDIA Neural Modules 2.1.0

Highlights

Training
- Fault Tolerance
  - Straggler Detection
  - Auto Relaunch
LLM & MM
- MM models
  - Llava-next
  - Llama 3.2
- Sequence Model Parallel for NeVa
- Enable Energon
- SigLIP (NeMo 1.0 only)
- LLM 2.0 migration
  - Starcoder2
  - Gemma 2
  - T5
  - Baichuan
  - BERT
  - Mamba
  - ChatGLM
- DoRA support
Export
- Nemo 2.0 base model export path for NIM
- PTQ in Nemo 2.0
ASR
- Timestamps with TDT decoder
- Timestamps option with .transcribe()

Detailed Changelogs

ASR

Changelog

[Fix] Fixed sampler override and audio_key in prepare_audio_data by @anteju :: PR: #10980
Akoumparouli/mixtral recipe fix r2.0.0 by @akoumpa :: PR: #10994
TDT compute timestamps option and Extra Whitespace handling for SPE by @monica-sekoyan :: PR: #10875
ci: Switch to CPU only runner by @ko3n1g :: PR: #11035
Fix timestamps tests by @monica-sekoyan :: PR: #11053
ci: Pin release freeze by @ko3n1g :: PR: #11143
Fix RNN-T loss memory usage by @artbataev :: PR: #11144
Added deprecation notice by @Ssofja :: PR: #11133
Fixes for Canary adapters tutorial by @pzelasko :: PR: #11184
add ipython import guard by @nithinraok :: PR: #11191
Self Supervised Pre-Training tutorial Fix by @monica-sekoyan :: PR: #11206
update the return type by @nithinraok :: PR: #11210
Timestamps to transcribe by @nithinraok :: PR: #10950
[Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
Beam search algorithm implementation for TDT models by @lilithgrigoryan :: PR: #10903
Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
Remove pytorch-lightning by @maanug-nv :: PR: #11306
update hypothesis when passed through cfg by @nithinraok :: PR: #11366
Revert "update hypothesis when passed through cfg" by @pablo-garay :: PR: #11373
Fix transcribe speech by @nithinraok :: PR: #11379
Lhotse support for transcribe_speech_parallel by @nune-tadevosyan :: PR: #11249
Sortformer Diarizer 4spk v1 model PR Part 1: models, modules and dataloaders by @tango4j :: PR: #11282
Removing unnecessary lines by @nune-tadevosyan :: PR: #11408
Support for initializing lhotse shar dataloader via field: list[path] mapping by @pzelasko :: PR: #11460
New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations by @pzelasko :: PR: #11058
Fixing Multi_Task_Adapters.ipynb by replacing canary2 with canary_custom by @weiqingw4ng :: PR: #11636

TTS

Changelog

[Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
Add T5TTS by @blisc :: PR: #11193
Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
Remove pytorch-lightning by @maanug-nv :: PR: #11306
Add nvidia/low-frame-rate-speech-codec-22khz model on docs by @Edresson :: PR: #11457

NLP / NMT

Changelog

Move collectiob.nlp imports inline for t5 by @marcromeyn :: PR: #10877
Use a context-manager when opening files by @akoumpa :: PR: #10895
Packed sequence bug fixes by @cuichenx :: PR: #10898
ckpt convert bug fixes by @dimapihtar :: PR: #10878
remove deprecated ci tests by @dimapihtar :: PR: #10922
Update T5 tokenizer (adding additional tokens to tokenizer config) by @huvunvidia :: PR: #10972
Add support and recipes for HF models via AutoModelForCausalLM by @akoumpa :: PR: #10962
gpt3 175b cli by @malay-nagda :: PR: #10985
Fix for crash with LoRA + tp_overlap_comm=false + sequence_parallel=true by @vysarge :: PR: #10920
Update BaseMegatronSampler for compatibility with PTL's _BatchProgress by @ashors1 :: PR: #11016
add deprecation note by @dimapihtar :: PR: #11024
Update ModelOpt Width Pruning example defaults by @kevalmorabia97 :: PR: #10902
switch to NeMo 2.0 recipes by @dimapihtar :: PR: #10948
NeMo 1.0: upcycle dense to moe by @akoumpa :: PR: #11002
Gemma2 in Nemo2 with Recipes by @suiyoubi :: PR: #11037
Add Packed Seq option to GPT based models by @suiyoubi :: PR: #11100
Fix MCoreGPTModel import in llm.gpt.model.base by @hemildesai :: PR: #11109
TP+MoE peft fix by @akoumpa :: PR: #11114
GPT recipes to use full te spec by @JimmyZhang12 :: PR: #11119
Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin by @vysarge :: PR: #11128
update nemo args for mcore flash decode arg change by @HuiyingLi :: PR: #11138
Call ckpt_to_weights_subdir from MegatronCheckpointIO by @ashors1 :: PR: #10897
[Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255
Use MegatronDataSampler in HfDatasetDataModule by @akoumpa :: PR: #11274
Add T5TTS by @blisc :: PR: #11193
ci: Exclude CPU machines from scan by @ko3n1g :: PR: #11300
Revert "fix(export): GPT models w/ bias=False convert properly" by @terrykong :: PR: #11301
remove redundant docs by @sharathts :: PR: #11302
Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
Add attention_bias argument in transformer block and transformer layer modules, addressing change in MCore by @yaoyu-33 :: PR: #11289
Remove pytorch-lightning by @maanug-nv :: PR: #11306
Update T5 attention-mask shapes to be compatible with all attention-backend in new TE versions by @huvunvidia :: PR: #11059
Add support for restoring from 2.0 checkpoint in 1.0 by @hemildesai :: PR: #11347
Fix Gemma2 Attention Args by @suiyoubi :: PR: #11365
mlm conversion & tiktokenizer support by @dimapihtar :: PR: #11349
[Nemo1] Generate sharded optimizer state dicts only if needed for saving by @ananthsub :: PR: #11451
add hindi tn/itn coverage by @mgrafu :: PR: #11382
chore(beep boop 🤖): Bump MCORE_TAG=67a50f2... (2024-11-28) by @ko3n1g :: PR: #11427
Handle exception when importing RetroGPTChunkDatasets by @guyueh1 :: PR: #11415
Update restore from config for gpt type continual training in NeMo1 by @yaoyu-33 :: PR: #11471
ci: Re-enable L2_Megatron_LM_To_NeMo_Conversion by @ko3n1g :: PR: #11484
Apply packed sequence params change for fused rope compatibility by @ananthsub :: PR: #11506
Huvu/tiktoken tokenizer update by @huvunvidia :: PR: #11494

Text Normalization / Inverse Text Normalization

Changelog

Adding support for LightningDataModule inside Fabric-API by @marcromeyn :: PR: #10879
Add registry to register all needed classes with artifacts in nemo.lightning.io by @hemildesai :: PR: #10861
Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
Remove pytorch-lightning by @maanug-nv :: PR: #11306
add hindi tn/itn coverage by @mgrafu :: PR: #11382

Export

Changelog

Update engine build step for TRT-LLM 0.13.0 by @janekl :: PR: #10880
Nemo 2.0 ckpt support in TRT-LLM export by @oyilmaz-nvidia :: PR: #10891
Fix TRTLLM parallel_embedding by @meatybobby :: PR: #10975
Export & deploy updates (part I) by @janekl :: PR: #10941
Add doc-strings to import & export + improve logging by @marcromeyn :: PR: #11078
NeMo-UX: fix nemo-ux export path by @akoumpa :: PR: #11081
Fix TRTLLM nemo2 activation parsing by @meatybobby :: PR: #11062
Support exporting Nemotron-340B for TensorRT-LLM by @jinyangyuan-nvidia :: PR: #11015
vLLM Hugging Face exporter by @oyilmaz-nvidia :: PR: #11124
Fix export of configuration parameters to Weights and Biases by @soluwalana :: PR: #10995
Change activation parsing in TRTLLM by @meatybobby :: PR: #11173
Remove builder_opt param from trtllm-build for TensorRT-LLM >= 0.14.0 by @janekl :: PR: #11259
fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255
fix(export): update API for disabling device reassignment in TRTLLM for Aligner by @terrykong :: PR: #10863
Add openai-gelu in gated activation for TRTLLM export by @meatybobby :: PR: #11293
Revert "fix(export): GPT models w/ bias=False convert properly" by @terrykong :: PR: #11301
Adding alinger export by @shanmugamr1992 :: PR: #11269
Export & deploy updates (part II) by @janekl :: PR: #11344
Introducing TensorRT lazy export and caching option with trt_compile() by @borisfom :: PR: #11266
fix: export converts properly if no model_prefix by @terrykong :: PR: #11477

Bugfixes

Changelog

Change default ckpt name by @maanug-nv :: PR: #11277
Fix patching of NeMo tokenizers for correct Lambada evaluation by @janekl :: PR: #11326

Uncategorized

Changelog

ci: Use Slack group by @ko3n1g :: PR: #10866
Bump Dockerfile.ci (2024-10-14) by @ko3n1g :: PR: #10871
Fix peft resume by @cuichenx :: PR: #10887
call post_init after altering config values by @akoumpa :: PR: #10885
Late import prettytable by @maanug-nv :: PR: #10912
Bump Dockerfile.ci (2024-10-17) by @ko3n1g :: PR: #10919
Warning for missing FP8 checkpoint support for vLLM deployment by @janekl :: PR: #10906
Fix artifact saving by @hemildesai :: PR: #10914
Lora improvement by @cuichenx :: PR: #10918
Huvu/t5 nemo2.0 peft by @huvunvidia :: PR: #10916
perf recipes and Mcore DistOpt params by @malay-nagda :: PR: #10883
ci: Fix cherry pick team by @ko3n1g :: PR: #10945
Fix requirements for MacOS by @artbataev :: PR: #10930
Fix nemo 2.0 recipes by @BoxiangW :: PR: #10915
Akoumparouli/nemo ux fix dir or string artifact by @akoumpa :: PR: #10936
Fix typo in docstring by @ashors1 :: PR: #10955
[Nemo CICD] Remove deprecated tests by @pablo-garay :: PR: #10960
Restore NeMo 2.0 T5 pretraining CICD test by @huvunvidia :: PR: #10952
Convert perf plugin env vars to strings by @hemildesai :: PR: #10947
disable dynamo for ddp checker by @akoumpa :: PR: #10961
Bump Dockerfile.ci (2024-10-21) by @ko3n1g :: PR: #10965
respect warnings' filters by @akoumpa :: PR: #10953
Alit/mamba recipe by @JRD971000 :: PR: #10935
Long context performance doc hot fix by @youngeunkwon0405 :: PR: #10946
Performance mode by @malay-nagda :: PR: #10926
Bump Dockerfile.ci (2024-10-22) by @ko3n1g :: PR: #10979
Add more recipes by @cuichenx :: PR: #10957
ci: Update tests by @ko3n1g :: PR: #10987
Bump Dockerfile.ci (2024-10-23) by @ko3n1g :: PR: #11001
llm.generate fixes by @HuiyingLi :: PR: #10983
use dict in check by @akoumpa :: PR: #11012
LoRA support for HF::AutoModelForCausalLM by @akoumpa :: PR: #10982
Change default for always_save_context to True by @athitten :: PR: #11014
Fix pip install by @marcromeyn :: PR: #11026
Change dist ckpt defaults by @ShriyaPalsamudram :: PR: #10913
Fix _strategy_lib tests by @maanug-nv :: PR: #11033
Basic online dynamic FP8 quantization with vLLM by @janekl :: PR: #10904
Expose packed seq in finetuning recipes by @cuichenx :: PR: #11006
PEFT Inference by @cuichenx :: PR: #11030
added Lhotse online augmentation tutorial for SE by @nasretdinovr :: PR: #10944
Bump Dockerfile.ci (2024-10-27) by @ko3n1g :: PR: #11051
ci: Send team alerts on specific keywords by @ko3n1g :: PR: #10986
Qwen2 Recipe by @suiyoubi :: PR: #10974
Bump Dockerfile.ci (2024-10-28) by @ko3n1g :: PR: #11054
Generalizing Inference pipeline in NeMo 2.0 to support encoder-decoder models by @huvunvidia :: PR: #10924
[Bug fix] In energon MultiModalSampleConfig use default_factory in dataclass by @guyueh1 :: PR: #11041
fix: Resolve mutable default issue in MultiModalSampleConfig dataclass by @michal2409 :: PR: #11061
SC1/SC2 Recipe by @suiyoubi :: PR: #10971
Wrap batch_sampler with_IndexBatchSamplerWrapper by @farhadrgh :: PR: #10934
Performance fine-tuning recipes for llama3 8b + 70b by @vysarge :: PR: #11046
Set TE spec name for NeMo to HF checkpoint converters by @kevalmorabia97 :: PR: #11036
ci: Re-add secrets detector by @ko3n1g :: PR: #11038
Adding nemo-run recipes for NeMo 2.0 T5 by @huvunvidia :: PR: #10964
Minor fixes for NeMo 2.0 PTQ by @Laplasjan107 :: PR: #11079
Add copyright check by @pablo-garay :: PR: #11048
Fix finalize model grad for PEFT by @cuichenx :: PR: #11065
ci: Less verbose infra alerts by @ko3n1g :: PR: #11080
Add copyright notice by @pablo-garay :: PR: #11085
ci: Fix cron schedule by @ko3n1g :: PR: #11076
ci: Use code-freeze via Nemo-FW-Templates by @ko3n1g :: PR: #11073
Akoumparouli/hf lit module peft ckpt bugfix by @akoumpa :: PR: #11022
PEFT perf and TE spec fixes by @JimmyZhang12 :: PR: #11070
Bump Dockerfile.ci (2024-10-30) by @ko3n1g :: PR: #11092
NeMorun for NeMo 2.0 T5 finetuning by @huvunvidia :: PR: #11040
fix model_checkpoint.py by @ethanhe42 :: PR: #11057
Update PTQ tests and ModelOpt version by @janekl :: PR: #11095
Fix datasets in CLI by @marcromeyn :: PR: #11097
Fix yaml serialization in io mixin by @hemildesai :: PR: #11106
disable overlap_param_gather_with_optimizer_step by @JimmyZhang12 :: PR: #11102
nemo1 to nemo2 checkpoint convert by @HuiyingLi :: PR: #10937
fix expert regex filter by @akoumpa :: PR: #11103
Remove stale checkpoint deletion on checkpoint saving failure by @akoumpa :: PR: #11116
NeMo-UX: Mistral/mixtral peft ci test by @akoumpa :: PR: #11094
Make nemo.collections.llm PreTrainingDataModule num samples configurable by @hemildesai :: PR: #11088
Fix packed seq path by @cuichenx :: PR: #11121
Allow arguments passed to dataset class + Gemma recipe fix by @cuichenx :: PR: #11125
Nemotron Recipe by @suiyoubi :: PR: #11118
NeMo-UX: HF PeFT fix by @akoumpa :: PR: #11096
Remove deprecated tests by @pablo-garay :: PR: #11134
Recipe Fix for NeMo CI by @suiyoubi :: PR: #11127
Fix freeze_model call in peft by @cuichenx :: PR: #11146
Bump Dockerfile.ci (2024-11-05) by @ko3n1g :: PR: #11159
NeMo-UX: Add sgd optim by @akoumpa :: PR: #11157
Update copyright check by @pablo-garay :: PR: #11168
add lora recipt for 405b by @JRD971000 :: PR: #10991
dit training diagrams by @zpx01 :: PR: #10873
ci: Switch to FW templates for build by @ko3n1g :: PR: #11077
Bump Dockerfile.ci (2024-11-06) by @ko3n1g :: PR: #11174
feat: Run PyLint by @ko3n1g :: PR: #11147
Add Alpaca Finetune Datamodule by @suiyoubi :: PR: #11185
Updated Diffusion Collection README by @zpx01 :: PR: #11179
Add support for Cosmos Tokenizers by @jojennin :: PR: #11194
Run formatting only if files changed. Echo message if pylint fails. by @artbataev :: PR: #11188
Bump Dockerfile.ci (2024-11-07) by @ko3n1g :: PR: #11196
Fix rotary_percentage parsing in nemo2 config by @meatybobby :: PR: #11197
ci: Update cherry pick workflow by @ko3n1g :: PR: #11202
ci: Build, test, publish a wheel by @ko3n1g :: PR: #11183
Bump Dockerfile.ci (2024-11-08) by @ko3n1g :: PR: #11222
update default pipeline_parallelism_type by @akoumpa :: PR: #11213
check actual value of vocab_file by @akoumpa :: PR: #11228
Fix VP Initialization Issue with Latest MCore by @suiyoubi :: PR: #11209
ci: Run Pylint strictly on new files, softly on history by @ko3n1g :: PR: #11212
ci: Add release workflow by @ko3n1g :: PR: #11180
Fix llm.generate by @hemildesai :: PR: #11217
Bump Dockerfile.ci (2024-11-11) by @ko3n1g :: PR: #11247
Bump Dockerfile.ci (2024-11-12) by @ko3n1g :: PR: #11254
Handling tokenizer in PTQ for Nemo 2.0 by @janekl :: PR: #11237
Fix finetuning datamodule resume by @cuichenx :: PR: #11187
ci: Move bump mcore to templates by @ko3n1g :: PR: #11229
ci: Fix secrets detector by @ko3n1g :: PR: #11205
chore(beep boop 🤖): Bump MCORE_TAG=aded519... (2024-11-12) by @ko3n1g :: PR: #11260
ci: Run secrets detector on pull_request_target by @ko3n1g :: PR: #11263
Advanced Diffusion Training Features by @zpx01 :: PR: #11246
Update pruning and distillation tutorial notebooks by @gvenkatakris :: PR: #11091
update nemo1->2 conversion according to changes in main by @HuiyingLi :: PR: #11253
Add llama 3.1 recipes by @cuichenx :: PR: #11273
Fix Finetune Recipe by @suiyoubi :: PR: #11267
Configure no restart validation loop in nl.Trainer by @hemildesai :: PR: #11029
Handle _io_unflatten_object when_thread_local.output_dir is not available by @hemildesai :: PR: #11199
Remove opencc upperbound by @thomasdhc :: PR: #10909
Fix head_size in NeMo to HF checkpoint converters for width pruned model support by @eagle705 :: PR: #11230
Fixes per comments by @gvenkatakris :: PR: #11280
Create phi3mini.py by @mayani-nv :: PR: #11281
ci: Fix release workflow by @ko3n1g :: PR: #11286
fix perf plugin CUDA_DEVICE_MAX_CONNECTIONS setting by @JimmyZhang12 :: PR: #11299
PTQ via NeMo-Run CLI by @janekl :: PR: #10984
PTQ memory optimization by @Laplasjan107 :: PR: #11257
Update README.md for collection page by @yaoyu-33 :: PR: #11223
Adding multimodal examples by @shanmugamr1992 :: PR: #11279
Add HF untrusted code toggle by @akoumpa :: PR: #11313
P2p chunk size setting in nemo 2.0 by @erhoo82 :: PR: #11312
Nemo2 batcheval by @HuiyingLi :: PR: #11158
DoRA by @cuichenx :: PR: #11104
Profiling - support Chakra & Kineto trace dumping by @lilyw97 :: PR: #11115
NeMo 2.0 SFT PEFT notebooks by @HuiyingLi :: PR: #10874
Update symlink option for save_last in ModelCheckpoint by @paul-gibbons :: PR: #11319
ci: Pass-through of workflow_event by @ko3n1g :: PR: #11322
Add StragglerDetection and auto-relaunch to NeMo2.0 by @ShriyaPalsamudram :: PR: #11328
Huvu/t5 nemo2.0 nemoci by @huvunvidia :: PR: #11291
TE acceleration using callbacks by @oyilmaz-nvidia :: PR: #11261
Leave target_module as default in PEFT Recipes by @cuichenx :: PR: #11334
More robust tar file loading from AIStore by @pzelasko :: PR: #11323
Fix CLIP transformer layer api by @yaoyu-33 :: PR: #11337
pass trust_remote_code to AutoTokenizer by @akoumpa :: PR: #11343
Fix linear layer replacement by @oyilmaz-nvidia :: PR: #11356
fix typo by @JRD971000 :: PR: #11351
Add torchrun local executor to recipes by @marcromeyn :: PR: #11342
Add PP support in NeVA along with few bug fixes by @yaoyu-33 :: PR: #11170
nemo2 peft merge by @HuiyingLi :: PR: #11017
Add dora recipes by @cuichenx :: PR: #11330
add fix to recipe by @JRD971000 :: PR: #11368
Add missing test to CICD needed list by @pablo-garay :: PR: #11376
update SquadDataModule to use run.config by @huvunvidia :: PR: #11358
Add llama 3.2 1b and 3b by @cuichenx :: PR: #11335
calculate metrics for nemo2 sftpeft notebook by @HuiyingLi :: PR: #11381
Enable packed dataset for validation; add a2a_experimental argument by @michal2409 :: PR: #11378
Fix DDP unused param error when TE is enabled in NeMo Lite by @oyilmaz-nvidia :: PR: #11364
Update llama32 vision (mllama) use attention bias by @yaoyu-33 :: PR: #11316
Fix environment variables in torchrun executor by @hemildesai :: PR: #11363
Add sample generate to PTQ for NeMo 2.0 by @Laplasjan107 :: PR: #11339
Fix selective restore by explicitly verifying keys by @hemildesai :: PR: #11377
Minor fix by @gvenkatakris :: PR: #11353
Add a fix for single-GPU nsys. by @tfogal :: PR: #11354
capitalize HF as HF instead of Hf by @akoumpa :: PR: #11384
ci: Add HF cache by @ko3n1g :: PR: #11398
Remove logic to skip checkpoint save if checkpoint exists by @ashors1 :: PR: #11362
Rewire tokenizer exception handling in model resume by @cuichenx :: PR: #11375
Adding LLava-Next model class by @yashaswikarnati :: PR: #11399
Fix vllm test issue when run_accuracy is enabled by @oyilmaz-nvidia :: PR: #11413
data modules for llava_next by @yashaswikarnati :: PR: #11400
Fix strategies saving unsharded optimizer states by @ananthsub :: PR: #11392
Adjust CLI support for PTQ by @janekl :: PR: #11421
Nemo run recipe's and example scripts for Llava Next by @yashaswikarnati :: PR: #11405
Huvu/t5 nemo2.0 nemoci 3b11b by @huvunvidia :: PR: #11388
ci: Allow dry-run of release by @ko3n1g :: PR: #11418
fix dtype when init HF model from config by @akoumpa :: PR: #11420
Handle import errors in virtual environment when running vLLM tests by @janekl :: PR: #11435
Fix loss mask when answer_only_loss=True by @ashors1 :: PR: #11444
[audio] Keep input directory structure when saving processed files by @anteju :: PR: #11403
Add different recipe examples to NeMo 2.0 by @BoxiangW :: PR: #11317
[Scripts] Remove fixed seed for adding noise by @anteju :: PR: #11401
Add option to provide prior NeMo 2 ckpt path to convert_nemo1_to_nemo… by @hemildesai :: PR: #11452
PTQ CLI and param updates by @janekl :: PR: #11459
Add tests for resiliency feature integration by @maanug-nv :: PR: #11406
ci: Disable HexHighEntropyString plugin by @ko3n1g :: PR: #11470
Fix broken links by @shashank3959 :: PR: #11294
Nemo 2.0 canonical lora by @cuichenx :: PR: #11416
ci: Run secrets detector on merge-commit by @ko3n1g :: PR: #11479
Formatting (minor) by @pablo-garay :: PR: #11485
Fix bug related to naming by @pablo-garay :: PR: #11487
Add BERT Model To NeMo2.0 by @suiyoubi :: PR: #11333
Update Nemo Distributed Checkpoint User Guide by @FortunaZhang :: PR: #11489
fix: regular torch optims (e.g., sgd) no longer error with closure spec by @terrykong :: PR: #11189
Add recipe configs validating by @BoxiangW :: PR: #10954
Fix finetuning PP by @cuichenx :: PR: #11474
[docs] Documentation for audio collection by @anteju :: PR: #11426
config hierarchy by @malay-nagda :: PR: #11145
Force param sync when using distributed optimizer and overlap_param_gather by @hemildesai :: PR: #11486
chore(beep boop 🤖): Bump MCORE_TAG=bd677bf... (2024-12-06) by @ko3n1g :: PR: #11492
Remove default mutable arguments from AbstractEmbModel constructor by @ananthsub :: PR: #11348
minor fix for nemo2 sftpeft readme by @HuiyingLi :: PR: #11502
Update Llama3 Fine-Tuning Notebook by @roclark :: PR: #11522
Fix CI issue on validation config by @BoxiangW :: PR: #11521
Freeze tags in in r2.1.0 by @github-actions[bot] :: PR: #11556
Cherrypick all + R2.1.0 fix cicd by @pablo-garay :: PR: #11622
Cherry pick Add fix docstring for speech commands (11638) into r2.1.0 by @ko3n1g :: PR: #11639
Cherrypick #11628 to r2.1.0 by @nasretdinovr :: PR: #11630
Update package_info.py by @ko3n1g :: PR: #11646
Cherry pick Add fix docstring for VAD (11659) into r2.1.0 by @ko3n1g :: PR: #11660
Fix tokenizer trust_remote_code by @cuichenx :: PR: #11657
Cherrypick 11568 by @cuichenx :: PR: #11656
Cherry pick Downgrading the 'datasets' package from 3.0.0 to 2.21.0 for Multilang_ASR.ipynb and ASR_CTC_Language_Finetuning.ipynb (11675) into r2.1.0 by @ko3n1g :: PR: #11677
r2.1.0 cherrypick by @pablo-garay :: PR: #11680
Cherry pick Rename multimodal data module - EnergonMultiModalDataModule (11654) into r2.1.0 by @ko3n1g :: PR: #11685
chore: Bump to r2.1.0rc2 by @ko3n1g :: PR: #11693
r2.1.0 ptl fix by @pablo-garay :: PR: #11694

NVIDIA Neural Modules 2.1.0rc2

Prerelease: NVIDIA Neural Modules 2.1.0rc2 (2024-12-21)

NVIDIA Neural Modules 2.1.0rc1

Prerelease: NVIDIA Neural Modules 2.1.0rc1 (2024-12-20)

NVIDIA Neural Modules 2.1.0rc0

Prerelease: NVIDIA Neural Modules 2.1.0rc0 (2024-12-12)

NVIDIA Neural Modules 2.0.0rc1

Highlights

Large language models

PEFT: QLoRA support, LoRA/QLora for Mixture-of-Experts (MoE) dense layer
State Space Models & Hybrid Architecture support (Mamba2 and NV-Mamba2-hybrid)
Support Nemotron, Minitron, Gemma2, Qwen, RAG
Custom Tokenizer training in NeMo
Update the Auto-Configurator for EP, CP and FSDP

Multimodal

NeVA: Add SOTA LLM backbone support (Mixtral/LLaMA3) and suite of model parallelism support (PP/EP)
Support Language Instructed Temporal-Localization Assistant (LITA) on top of video NeVA

ASR

SpeechLM and SALM
Adapters for Canary Customization
Pytorch allocator in PyTorch 2.2 improves training speed up to 30% for all ASR models
Cuda Graphs for Transducer Inference
Replaced webdataset with Lhotse - gives up to 2x speedup
Transcription Improvements - Speedup and QoL Changes
ASR Prompt Formatter for multimodal Canary

Export & Deploy

In framework PyTriton deployment with backends: - PyTorch - vLLM - TRT-LLM update to 0.10
TRT-LLM C++ runtime

Detailed Changelogs

ASR

Changelog

Support dataloader as input to audio for transcription by @titu1994 :: PR: #9201
Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
Fix Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9251
Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281
Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. by @galv :: PR: #9347
Revert "Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer." by @titu1994 :: PR: #9351
Prompt formatter API and canary transcribe tensor input support by @pzelasko :: PR: #9206
Fix prompt formatter's defaults=None case in multi-task model by @pzelasko :: PR: #9366
move AED chunked infer script by @stevehuang52 :: PR: #9367
Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. by @galv :: PR: #9198
ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_C… by @ko3n1g :: PR: #9399
Fix logging message for ASR by @titu1994 :: PR: #9469
Add support to change Multi task model prompt by @titu1994 :: PR: #9542
Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409
Audio model collection by @anteju :: PR: #9263
TitaNet Batch Verify Speaker by @monica-sekoyan :: PR: #9337
Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624
chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
refactor: notebook branch release by @ko3n1g :: PR: #9711
Canary Adapters tutorial (#9670) by @nithinraok :: PR: #9777
typos and branch name update to r2.0.0rc1 by @nithinraok :: PR: #9846
Fix RNNT alignments test by @artbataev :: PR: #9770
By default trust remote code from HF Datasets by @nithinraok :: PR: #9886
Temporarily disable cuda graph based RNN-T greedy inference for r2.0.0rc1 by @galv :: PR: #9904
Enable CUDA graphs by default, but require CUDA 12.6 for full graphs by @artbataev :: PR: #9919
update branch name for script by @nithinraok :: PR: #9936
updte branch by @nithinraok :: PR: #9942

TTS

Changelog

Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
Add mel codec checkpoints by @anteju :: PR: #9228
GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559
chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
refactor: notebook branch release by @ko3n1g :: PR: #9711

LLM/Multimodal

Changelog

Update nemo.export module for quantized models by @janekl :: PR: #9218
Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221
Checkpoint resuming compatible for 2403 container by @suiyoubi :: PR: #9199
Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223
Revert rope fusion defaults by @cuichenx :: PR: #9237
fix import by @akoumpa :: PR: #9240
Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210
sum-reduce grad_norm in DP+CP domain by @erhoo82 :: PR: #9262
Alit/bert convert fix by @JRD971000 :: PR: #9285
conv1d stable version by @JRD971000 :: PR: #9330
Fix trainer builder when exp_manager is not in config by @yaoyu-33 :: PR: #9293
Fix Peft Weights Loading in NeVA by @yaoyu-33 :: PR: #9341
Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344
Fix FSDP gradient calculation with orig params by @janEbert :: PR: #9335
TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270
support null/None truncation field by @arendu :: PR: #9355
NeVa token fusion by @paul-gibbons :: PR: #9245
bugfix if using mcore distOpt with sft by @akoumpa :: PR: #9356
Re-org export code by @oyilmaz-nvidia :: PR: #9353
QLoRA by @cuichenx :: PR: #9340
PeFT fix for distOpt by @akoumpa :: PR: #9392
[NeMo-UX] Integrating mcore's DistributedDataParallel into MegatronStrategy by @marcromeyn :: PR: #9387
cherry pick of #9266 by @dimapihtar :: PR: #9411
Enable specifying alpha for PTQ INT8 SmoothQuant method by @janekl :: PR: #9423
add support for new mcore ds features by @dimapihtar :: PR: #9388
LoRA for MoE Layer by @cuichenx :: PR: #9396
Mistral-7B: apply user's precision to output checkpoint by @akoumpa :: PR: #9222
Add option to merge distributed optimizer buckets by @timmoon10 :: PR: #9414
TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402
In-framework deployment by @oyilmaz-nvidia :: PR: #9438
Bugfix missing variables and argument changes to MegatronPretrainingRandomSampler by @jstjohn :: PR: #9458
Hyena Operator by @guyjacob :: PR: #9264
Refactor Quantizer for reusing in QAT by @kevalmorabia97 :: PR: #9276
move load state dict after initialize parallel state in nlp_model by @ryxli :: PR: #9382
Enable user to optionally upgrade Megatron by @jstjohn :: PR: #9478
Fix unwrap model by @cuichenx :: PR: #9480
fix operator precedence by @akoumpa :: PR: #9403
[NeMo-UX] Adding context- & expert-parallelism to MegatronStrategy by @marcromeyn :: PR: #9525
update mcoreddp call by @akoumpa :: PR: #9345
mcore distOpt restore fix by @akoumpa :: PR: #9421
vLLM Export Support by @apanteleev :: PR: #9381
PL: Delete precision if using plugin. TODO switch to MegatronTrainerB… by @akoumpa :: PR: #9535
extend get_gpt_layer_modelopt_spec to support MoE by @akoumpa :: PR: #9532
fix mock data generation for legacy dataset by @dimapihtar :: PR: #9530
add reset learning rate functionality by @dimapihtar :: PR: #9372
Use closed-formula to round by multiple by @akoumpa :: PR: #9307
GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559
Consolidate gpt continue training script into pretraining script by @yaoyu-33 :: PR: #9413
Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409
PTQ refinements by @janekl :: PR: #9574
Add ModelOpt QAT example for Llama2 SFT model by @kevalmorabia97 :: PR: #9326
Multimodal projection layer adapter fix for PP>1 by @paul-gibbons :: PR: #9445
Add offline quantization script for QLoRA deployment by @cuichenx :: PR: #9455
Make QLoRA more model-agnostic by @cuichenx :: PR: #9488
Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593
[NeMo-UX] Fix Megatron-optimizer by @marcromeyn :: PR: #9599
Chat template support for megatron_gpt_eval.py by @akoumpa :: PR: #9354
[NeMo-UX] Add PEFT by @cuichenx :: PR: #9490
Alit/mamba tmp by @JRD971000 :: PR: #9612
Enable MCore checkpointing optimizations by @mikolajblaz :: PR: #9505
Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620
fix ckpt load bug by @dimapihtar :: PR: #9621
Alit/mamba by @JRD971000 :: PR: #9575
Unwrap ckpt_io for model opt (async save) by @mikolajblaz :: PR: #9622
MCore T5 support for NeMo - Training by @huvunvidia :: PR: #9432
[Nemo-UX] Expose transformer_layer_spec inside GPTConfig by @marcromeyn :: PR: #9592
Update NeMo Clip to Use MCore Modules by @yaoyu-33 :: PR: #9594
Mistral + Mixtral Support for NeVa by @paul-gibbons :: PR: #9459
Adding support for mcore generate by @shanmugamr1992 :: PR: #9566
Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638
[Cherrypick] support lora when kv_channel != hidden_size / num_heads by @cuichenx :: PR: #9644
Parametrize FPS group by @mikolajblaz :: PR: #9648
Cherry-pick megatron export fix from main by @borisfom :: PR: #9643
add documentation for reset_lr feature by @dimapihta
chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
Cherry pick: LITA Integration by @Slyne :: PR: #9684
SDXL improvements (and support for Draft+) by @rohitrango :: PR: #9654
Gemma 2 by @cuichenx :: PR: #9672
Allows non-strict load with distributed checkpoints by @mikolajblaz :: PR: #9613
refactor: notebook branch release by @ko3n1g :: PR: #9711
[NeMo-UX] Make TE and Apex dependencies optional by @ashors1 :: PR: #9550
Alit/r2.0.0 by @JRD971000 :: PR: #9718
Manually cherry-pick from PR 9679 (PR to main - Support SFT/Eval/PEFT for mcore T5) by @huvunvidia :: PR: #9737
In framework export by @oyilmaz-nvidia :: PR: #9658
T5 changes based on mcore changes by @pablo-garay :: PR: #9829
[NeMo-UX] Use single instance of loss reductions in GPTModel by @hemildesai :: PR: #9801
deprecate NeMo NLP tutorial by @dimapihtar :: PR: #9864
Disable nvFuser setup with PyTorch 23.11 and later by @athitten :: PR: #9837
make torch_dist ckpt strategy as default by @dimapihtar :: PR: #9852
add rampup bs documentation by @dimapihtar :: PR: #9884
copy of #9576 by @dimapihtar :: PR: #9986
Support Nvidia Torch and Arch versions by @thomasdhc :: PR: #9897
Bug fix for pooler causing dist checkpointing exception by @shanmugamr1992 :: PR: #10008

Export

Changelog

Update nemo.export module for quantized models by @janekl :: PR: #9218
Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221
Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210
TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270
Re-org export code by @oyilmaz-nvidia :: PR: #9353
Use TensorRT-LLM native parameter names in nemo.export module by @janekl :: PR: #9424
TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402
vLLM Export Support by @apanteleev :: PR: #9381
Add page context fmha option in TensorRTLLM export by @meatybobby :: PR: #9526
Test C++ runtime on demand in nemo_export.py to avoid possible OOMs by @janekl :: PR: #9544
Fix nemo export test by @oyilmaz-nvidia :: PR: #9547
Add tps and pps params to the export script by @oyilmaz-nvidia :: PR: #9558
Add Multimodal Exporter by @meatybobby :: PR: #9256
Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593
Inflight nemo model export support by @JimmyZhang12 :: PR: #9527
vLLM Export Improvements by @apanteleev :: PR: #9596
Akoumparouli/nemo ux mixtral export by @akoumpa :: PR: #9603
Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620
Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624
Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638
Cherry-pick megatron export fix from main by @borisfom :: PR: #9643
In framework export by @oyilmaz-nvidia :: PR: #9658
Add missing imports for torch dist ckpt in export by @oyilmaz-nvidia :: PR: #9826~

Bugfixes

Changelog

use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223
fix import by @akoumpa :: PR: #9240
Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281
call set_expert_model_parallel_world_size instead of set_cpu_expert_m… by @akoumpa :: PR: #9275
Fix typos in Mixtral NeMo->HF and Starcoder2 NeMo->HF conversion scripts by @evellasques :: PR: #9325
Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344
Add OpenAI format response to r2.0.0rc1 by @athitten :: PR: #9796
[NeMo UX] Support generating datasets using different train/valid/test distributions by @ashors1 :: PR: #9771
Add missing imports for torch dist ckpt in export by @oyilmaz-nvidia :: PR: #9826

General Improvements

Changelog

[Nemo CICD] run_cicd_for_release_branches_also by @pablo-garay :: PR: #9213
rename paths2audiofiles to audio by @github-actions[bot] :: PR: #9220
Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @github-actions[bot] :: PR: #9234
ci: Remove duplicated job by @ko3n1g :: PR: #9258
Fix document links by @yaoyu-33 :: PR: #9260
Pin transformers by @github-actions[bot] :: PR: #9273
Fix loading github raw images on notebook by @github-actions[bot] :: PR: #9283
Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @github-actions[bot] :: PR: #9278
Refactor Sequence Packing Script by @cuichenx :: PR: #9271
[Nemo-UX] Move code to collections + fix some small bugs by @marcromeyn :: PR: #9277
Fix typo in HF tutorial by @github-actions[bot] :: PR: #9304
Expand documentation for data parallelism and distributed optimizer by @timmoon10 :: PR: #9227
Install alerting by @ko3n1g :: PR: #9311
typos by @github-actions[bot] :: PR: #9315
FP8 feature documentation by @ksivaman :: PR: #9265
[Nemo CICD] Comment out flaky tests by @pablo-garay :: PR: #9333
Fixed typos in README.rst by @gdevakumar :: PR: #9322
Update README.rst to clarify installation via Conda by @SimonCW :: PR: #9323
[Nemo CICD] update flaky test by @pablo-garay :: PR: #9339
fix lora and ptuning and isort/black by @github-actions[bot] :: PR: #9295
Fix P-tuning for Llama based models by @github-actions[bot] :: PR: #9300
add large model stable training fix and contrastive loss update for variable seq by @github-actions[bot] :: PR: #9348
Guard cuda memory allocator update by @github-actions[bot] :: PR: #9313
[Nemo CICD] Remove unnecessary commented out code by @pablo-garay :: PR: #9364
Update Gemma conversion script by @yaoyu-33 :: PR: #9365
Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @github-actions[bot] :: PR: #9371
Re-enable cuda graphs in training modes. by @github-actions[bot] :: PR: #9343
fix typo infer_seq_lenght -> infer_seq_length by @akoumpa :: PR: #9370
Make a backward compatibility for old MSDD configs in label models by @github-actions[bot] :: PR: #9378
Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @github-actions[bot] :: PR: #9253
Update README.rst by @jgerh :: PR: #9393
Force diarizer to use CUDA if cuda is available and if device=None. by @github-actions[bot] :: PR: #9390
ci: Properly catch failed tests by introduction of workflow templates by @ko3n1g :: PR: #9324
Fix T5 G2P Input and Output Types by @github-actions[bot] :: PR: #9269
Huvu/rag pipeline citest by @huvunvidia :: PR: #9384
Fix circular import for MM dataprep notebook by @github-actions[bot] :: PR: #9292
add check if num layers is divisible by pp size by @github-actions[bot] :: PR: #9298
[Nemo CICD] timeouts fix by @pablo-garay :: PR: #9407
[NeMo-UX] Removing un-used ModelConfig class by @marcromeyn :: PR: #9389
Add tutorial for Llama-3-8B lora training and deployment by @shashank3959 :: PR: #9359
[NeMo-UX] Removing default_path from ModelConnector by @marcromeyn :: PR: #9401
Fix README by @ericharper :: PR: #9415
[SD] Fix SD CUDA Graph Failure by @alpha0422 :: PR: #9319
[NeMo-UX] Adding file-lock to Connector by @marcromeyn :: PR: #9400
Add Dev Container Bug Report by @pablo-garay :: PR: #9430
Akoumparouli/profiling docs by @akoumpa :: PR: #9420
ci: Enrich notifications by @ko3n1g :: PR: #9412
Fix failing RIR unit test with lhotse 1.24+ by @pzelasko :: PR: #9444
[NeMo-UX] Adding support for mcore distributed optimizer by @marcromeyn :: PR: #9435
Use ModelOpt build_tensorrt_llm for building engines for qnemo checkpoints by @janekl :: PR: #9452
ci(notifications): Fix extraction of last 2K chars by @ko3n1g :: PR: #9450
Update readme with mlperf news by @ericharper :: PR: #9457
[NeMo-UX] Add nsys callback by @ashors1 :: PR: #9461
[NeMo UX] Introducing optimizer module by @marcromeyn :: PR: #9454
Fix minor import bug in deploy module by @oyilmaz-nvidia :: PR: #9463
ci(notifications): Fetch all jobs by @ko3n1g :: PR: #9465
Update build_dataset.py by @stevehuang52 :: PR: #9467
bionemo: bn2/add pipelineparallel dtype by @skothenhill-nv :: PR: #9475
[NeMo-UX] Integrate experiment manager features with NeMo-UX APIs by @ashors1 :: PR: #9460
Add python_requires by @galv :: PR: #9431
[NeMo-UX] Fixing imports of NeMoLogging, AutoResume & ModelCheckpoint by @marcromeyn :: PR: #9476
Modelopt Refactor for SDXL Quantization by @suiyoubi :: PR: #9279
[NeMo-UX] Fixing defaults in llm.train & Mistral7BModel by @marcromeyn :: PR: #9486
In framework deploy using deploy script by @oyilmaz-nvidia :: PR: #9468
[NeMo-UX] Integrate tokenizer import into model.import_ckpt by @marcromeyn :: PR: #9485
append to file by @malay-nagda :: PR: #9483
[NeMo-UX] Fix bug in import_ckpt by @marcromeyn :: PR: #9492
Add nemotron news by @ericharper :: PR: #9510
Add CICD test for Stable Diffusion by @michal2409 :: PR: #9464
Akoumparouli/nemo ux mixtral by @akoumpa :: PR: #9446
[NeMo-UX] Llama and Gemma by @cuichenx :: PR: #9528
[NeMo-UX] minor logging bug fixes by @ashors1 :: PR: #9529
Update neva conversion script from and to HF by @yaoyu-33 :: PR: #9296
[Nemo-UX] IO fixes by @marcromeyn :: PR: #9512
Fix lhotse tests for v1.24.2 by @pzelasko :: PR: #9546
[Nemo CICD] Make GPU Unit Tests non-optional by @pablo-garay :: PR: #9551
Add Python AIStore SDK to container and bump min Lhotse version by @pzelasko :: PR: #9537
[NeMo-UX] Fix tokenizer IO by @marcromeyn :: PR: #9555
[NeMo UX] Move mistral_7b.py to mistral.py by @akoumpa :: PR: #9545
ci: Do not attempt to send slack on fork by @ko3n1g :: PR: #9556
Fix SDXL incorrect name in Docs by @suiyoubi :: PR: #9534
Bump PTL version by @athitten :: PR: #9557
[Resiliency] Straggler detection by @jbieniusiewi :: PR: #9473
[NeMo-UX] Switch to torch_dist as default distributed checkpointing backend by @ashors1 :: PR: #9541
[NeMo-UX] Checkpointing bug fixes by @ashors1 :: PR: #9562
Expose MCore path_to_cache option by @maanug-nv :: PR: #9570
[NeMo-UX] Fix Trainer serialization by @marcromeyn :: PR: #9571
Update click version requirement by @thomasdhc :: PR: #9580
[Fault tolerance] Heartbeat detection by @maanug-nv :: PR: #9352
[Nemo-UX] Add fabric-API for manual forward-pass by @marcromeyn :: PR: #9577
[Nemo-UX] Add SDK-factories to llm-collection by @marcromeyn :: PR: #9589
[NeMo-UX] Some improvements to NeMoLogger by @marcromeyn :: PR: #9591
Set no_sync_func & grad_sync_fucn by @akoumpa :: PR: #9601
[NeMo-UX] Fix nemo logger when trainer has no loggers by @ashors1 :: PR: #9607
Fix the dictionary format returned by the scheduler method by @sararb :: PR: #9609
[NeMo-UX] Dataloading enhancements and bug fixes by @ashors1 :: PR: #9595
Fix serialization of AutoResume by @sararb :: PR: #9616
Jsonl support by @adityavavre :: PR: #9611
Akoumparouli/mistral import instruct chat template fix by @akoumpa :: PR: #9567
Remove .cuda calls, use device isntead by @akoumpa :: PR: #9602
fix converter defautl args by @akoumpa :: PR: #9565
fix: remove non_blocking from PTL's .cuda call by @akoumpa :: PR: #9618
NeVA Minor Fixes by @yaoyu-33 :: PR: #9608
[NeMo-UX] fix pretrianing data sizes and weights by @cuichenx :: PR: #9627
[NeMo-UX] async checkpointing support by @ashors1 :: PR: #9466
Change default parallel_save to False by @mikolajblaz :: PR: #9632
Add REST API to deploy module by @athitten :: PR: #9539
ci: Timeout per step, not job by @ko3n1g :: PR: #9635
[NeMo-UX] Fix when optimizers are setup for PEFT by @marcromeyn :: PR: #9619
[NeMo-UX] Fix pipeline parallel bug by @ashors1 :: PR: #9637
Fixing import error fior llama-index (RAG pipeline) by @pablo-garay :: PR: #9662
llama CI fix by @rohitrango :: PR: #9663
[NeMo-UX] Make 'load_directly_on_device' configurable by @ashors1 :: PR: #9657
[Nemo-UX] Including all trainable-params in a PEFT-checkpoint by @marcromeyn :: PR: #9650
[NeMo-UX] Fix imports so local configuration of runs works again by @marcromeyn :: PR: #9690
Set TE flag in legacy -> mcore conversion script by @terrykong :: PR: #9722
Update starthere docs text by @erastorgueva-nv :: PR: #9724
TorchAudio installation workaround for incorrect PYTORCH_VERSION variable by @artbataev :: PR: #9736
[NeMo-UX] Match nemo 1's default behavior for drop_last and pad_samples_to_global_batch_size by @ashors1 :: PR: #9707
add a bit more for timeout (#9702) by @pablo-garay :: PR: #9754
Fix missing parallelisms by @maanug-nv :: PR: #9725
update branch by @nithinraok :: PR: #9764
Fix data preprocessing script by @cuichenx :: PR: #9759
vLLM 0.5.1 update by @apanteleev :: PR: #9779
upper bound hf-hub by @akoumpa :: PR: #9805
Fix few issues and docs for neva and clip in r2.0.0rc1 by @yaoyu-33 :: PR: #9681
add dummy vision and text transformer config (assumed mcore to be false) by @rohitrango :: PR: #9699
fix lita bugs by @Slyne :: PR: #9810
[NeMo-UX] Log val_loss by @ashors1 :: PR: #9814
[NeMo-UX] Fix some dataloading bugs by @ashors1 :: PR: #9807
[NeMo-UX] Adding recipes by @marcromeyn :: PR: #9720
[NeMo-UX] Set async_save from strategy rather than ModelCheckpoint by @ashors1 :: PR: #9800
Fix hf hub for 0.24+ by @titu1994 :: PR: #9806
[NeMo-UX] Fix a minor bug with async checkpointing by @ashors1 :: PR: #9856
[NeMo-UX] make progress bar easier to parse by @ashors1 :: PR: #9877
Docs: add "Nemo Fundamentals" page by @erastorgueva-nv :: PR: #9835
Create init.py by @stevehuang52 :: PR: #9892
[NeMo-UX] Fixes to make PreemptionCallback work by @hemildesai :: PR: #9830
Fix Docker build. Make Dockerfile consistent with CI by @artbataev :: PR: #9784
Multimodal data prep notebook fix by @cuichenx :: PR: #9910
[NeMo-UX] Add distributed checkpointing unit tests by @ashors1 :: PR: #9794
r2.0.0rc1 fix for dist checkpoint loading by @yaoyu-33 :: PR: #9854
[NeMo-UX] Rename sdk references to NeMo Run by @hemildesai :: PR: #9872
[NeMo-UX] Fix some serialization bugs by @ashors1 :: PR: #9868
add mixtral neva tutorial (moe + token fusion + siglip) by @paul-gibbons :: PR: #9926
[NeMo-UX] Add more NeMo Logger tests by @ashors1 :: PR: #9795
Akoumparouli/mixtral fixes for r2.0.0rc1 by @akoumpa :: PR: #9911
R2.0.0rc1 clip fix by @Slyne :: PR: #9871
[NeMo-UX] Add missing docstrings and update some defaults by @ashors1 :: PR: #9895
Add REST service requirements.txt by @oyilmaz-nvidia :: PR: #9923
add bert latest fix by @JRD971000 :: PR: #9921
remove empy reconfigure_limit_batches by @akoumpa :: PR: #9934
fix mem by @terrykong :: PR: #9964
Run a sample query for a quantized model conditionally by @janekl :: PR: #9965
Add pydantic-settings by @oyilmaz-nvidia :: PR: #9961
Resiliency features update by @jbieniusiewi :: PR: #9714
[NeMo-UX] Wrap task config save in a try/except by @ashors1 :: PR: #9956
[NeMo-UX] Update default PTL logging save_dir by @ashors1 :: PR: #9954
Fix lita tutorial by @Slyne :: PR: #9980
Add deploy and REST API support to NeMo 2.0 by @athitten :: PR: #9834
ci: Allow changelog manual (#10156) by @ko3n1g :: PR: #10157
docs: Add changelog by @ko3n1g :: PR: #10155
add manifest file by @ko3n1g :: PR: #10161

NVIDIA Neural Modules 2.0.0rc0

Highlights

LLM and MM

Models

Megatron Core RETRO
- Pre-training
- Zero-shot Evaluation
Pretraining, conversion, evaluation, SFT, and PEFT for:
- Mixtral 8X22B
- Llama 3
- SpaceGemma
Embedding Models Fine Tuning
- Mistral
- BERT
BERT models
- Context Parallel
- Distributed checkpoint
Video capabilities with NeVa

Performance

Distributed Checkpointing
- Torch native backend
- Parallel read/write
- Async write
Multimodal LLM (LLAVA/NeVA)
- Pipeline Parallelism support
- Sequence packing support

Export

Integration of Export & Deploy Modules into NeMo Framework container
- Upgrade to TRT-LLM 0.9

Speech (ASR & TTS)

Models

AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model
Multimodal Domain - Speech LLM supporting SALM Model
Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second)
Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs
- mel_codec_22khz_medium
- mel_codec_44khz_medium

Perf Improvements

Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders
Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x
Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models
Semi Sorted Batching support - External User contribution that speeds up training by 15-30%.

Customization

Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation
- Longform Inference
- Longform inference support for AED models
Transcription of multi-channel audio for AED models

Misc

Upgraded webdataset - Speech and LLM / Multimodal unified container

Detailed Changelogs

ASR

Changelog

Enable using hybrid asr models in CTC Segmentation tool by @erastorgueva-nv :: PR: #8828
TDT confidence fix by @GNroy :: PR: #8982
Fix union type annotations for autodoc+mock-import rendering by @pzelasko :: PR: #8956
NeMo dev doc restructure by @yaoyu-33 :: PR: #8896
Improved random seed configuration for Lhotse dataloaders with docs by @pzelasko :: PR: #9001
Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization by @galv :: PR: #8964
[ASR] Support for transcription of multi-channel audio for AED models by @anteju :: PR: #9007
Add ASR latest news by @titu1994 :: PR: #9073
Fix docs errors and most warnings by @erastorgueva-nv :: PR: #9006
PyTorch CUDA allocator optimization for dynamic batch shape dataloading in ASR by @pzelasko :: PR: #9061
RNN-T and TDT inference: use CUDA graphs by default by @artbataev :: PR: #8972
Fix #8891 by supported GPU-side batched CTC Greedy Decoding by @galv :: PR: #9100
Update branch for notebooks and ci in release by @ericharper :: PR: #9189
Enable CUDA graphs by default only for transcription by @artbataev :: PR: #9196
rename paths2audiofiles to audio by @nithinraok :: PR: #9209
Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @andrusenkoau :: PR: #9233
Cherrypick: Support dataloader as input to audio for transcription (#9201) by @titu1994 :: PR: #9235
Update Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9252
Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @galv :: PR: #9243
Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @galv :: PR: #9246
Fix loading github raw images on notebook by @nithinraok :: PR: #9282
typos by @nithinraok :: PR: #9314
Re-enable cuda graphs in training modes. by @galv :: PR: #9338
add large model stable training fix and contrastive loss update for variable seq by @nithinraok :: PR: #9259
Fix conv1d package in r2.0.0rc0 by @pablo-garay :: PR: #9369
Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @titu1994 :: PR: #9350
Make a backward compatibility for old MSDD configs in label models by @tango4j :: PR: #9377
Force diarizer to use CUDA if cuda is available and if device=None. by @tango4j :: PR: #9380

TTS

Changelog

[TTS] Add tutorial for training audio codecs by @rlangman :: PR: #8723
Update radtts.py by @blisc :: PR: #9097
[Nemo CICD] RADTTS test optional by @pablo-garay :: PR: #9112
Remove Radtts CI test by @blisc :: PR: #9144
Fix T5 G2P Input and Output Types by @blisc :: PR: #9224

LLM and MM

Changelog

Rachitg/dpa by @rachitgarg91 :: PR: #8911
Remove precision args in trainer due to PTL update by @yaoyu-33 :: PR: #8908
Huvu/mcore retro by @huvunvidia :: PR: #8861
fsdp tp > 1 bug fix by @dimapihtar :: PR: #8947
Fix memory leak at loss func by @minitu :: PR: #8868
change the condition for get qkv tensor from linear_qkv output in mcoremixin by @HuiyingLi :: PR: #8965
Add safety checks for 'data' key in MegatronGPTModel cfg by @HuiyingLi :: PR: #8991
[NeMo-UX] Adding MegatronParallel by @cuichenx :: PR: #8987
Skip top_p computations when set to 1.0 by @odelalleau :: PR: #8905
Gemma bug by @cuichenx :: PR: #8962
[NeMo-UX] Adding megatron strategy by @marcromeyn :: PR: #8995
Quantized checkpoint support in export and deploy modules by @janekl :: PR: #8859
add geglu to mlp swap by @JRD971000 :: PR: #8999
add timeout for new_group by @acphile :: PR: #8998
Zero-shot evaluation pipeline for mcore RETRO by @huvunvidia :: PR: #8941
Added fusion for squared relu by @sanandaraj5597 :: PR: #8963
Developer Documents for mcore RETRO by @huvunvidia :: PR: #9026
[NeMo-UX] Adding GPTModel & MockDataModule by @marcromeyn :: PR: #9011
Adding unit test for mcore RETRO model by @huvunvidia :: PR: #9022
docs and simplification of cmd args by @arendu :: PR: #8979
[NeMo-UX] Add checkpoint-io to MegatronStrategy by @marcromeyn :: PR: #9057
Enable Sequence Packing and Pipeline Parallel in NeVA by @yaoyu-33 :: PR: #8957
Mingyuanm/add back fp8 support to sd by @Victor49152 :: PR: #9070
unfused lora by @arendu :: PR: #9004
Handle case where num_query_groups is set to null for LoRA config setup by @vysarge :: PR: #9075
Alit/griffin by @JRD971000 :: PR: #9021
Implement DistributedCheckpointIO by @mikolajblaz :: PR: #9016
Video Neva Pretraining + Inference Implementation by @paul-gibbons :: PR: #9095
HF to .nemo for Mixtral-8x22B-instruct by @akoumpa :: PR: #9060
mcore ds updates by @dimapihtar :: PR: #8951
Alit/griffin perf by @JRD971000 :: PR: #9107
Add assert for max_steps to be positive in MegatronGPTSFTModel by @athitten :: PR: #9110
Extend sequence length padding for GPT SFT to account for context parallel by @vysarge :: PR: #8869
Update gpt dataset config parameter for mock by @thomasdhc :: PR: #9118
Add Mcore DistributedDataParallel and distributed optimizer into Nemo by @gdengk :: PR: #9034
Revert "Add assert for max_steps to be positive in MegatronGPTSFTMode… by @pablo-garay :: PR: #9128
scripts to convert HF lora to nemo by @arendu :: PR: #9102
Prevent duplicated checkpoints by @mikolajblaz :: PR: #9015
add TN/ITN link in speech tools list by @erastorgueva-nv :: PR: #9142
Cleanup deprecated files and temporary changes by @cuichenx :: PR: #9088
Use DP+CP groups as the FSDP sharding domain by @erhoo82 :: PR: #9145
CUDA memory profile by @erhoo82 :: PR: #9096
Fix missing func for T5 model by @gdengk :: PR: #9141
Add knob for load_directly_on_device by @mikolajblaz :: PR: #9125
Revert rope fusion defaults by @cuichenx :: PR: #9238
Update nemo.export module for quantized models by @janekl :: PR: #9250
Fix circular import for MM dataprep notebook by @cuichenx :: PR: #9287
neva media_type + text generation default fix by @paul-gibbons :: PR: #9257
fix lora and ptuning and isort/black by @oyilmaz-nvidia :: PR: #9290
add check if num layers is divisible by pp size by @dimapihtar :: PR: #9208
Fix P-tuning for Llama based models by @apanteleev :: PR: #9297
add deprecation warnings by @pablo-garay :: PR: #9266
move pooler under post_process by @dimapihtar :: PR: #9328
add deprecation note for nmt by @dimapihtar :: PR: #9342
Fix incorrect checkpoint removal logic (#9192) by @mikolajblaz :: PR: #9204
fix fp16 precision issue by @dimapihtar :: PR: #9376
Fix module.training for Neva in FusedAttn backward which causes nan by @yaoyu-33 :: PR: #8877

Export

Changelog

Updates for TRT-LLM 0.9 by @oyilmaz-nvidia :: PR: #8873
Mingyuanm/sdxl export by @Victor49152 :: PR: #8926
Avoid unpacking NeMo checkpoints before exporting to TRT-LLM by @apanteleev :: PR: #8866
Update gemma for trt-llm 0.9 by @oyilmaz-nvidia :: PR: #8974
TRT-LLM export P-tuning related fixes by @apanteleev :: PR: #8863

General Improvements

Changelog

Update package info by @ericharper :: PR: #8793
[Nemo CICD] Update mcore 4.13.24 by @pablo-garay :: PR: #8917
Akoumparouli/low mem mixtral ckpt converter by @akoumpa :: PR: #8895
Adding RETRO tests to Action Tests (cicd-main.yml) by @huvunvidia :: PR: #8942
Akoumparouli/fix sd train 2 by @akoumpa :: PR: #8883
Update te install for jenkins by @ericharper :: PR: #8954
[Nemo CICD] Add last job depending on others for blocking check by @pablo-garay :: PR: #8959
Minor quantization pipeline updates by @janekl :: PR: #8924
Fix External CLIP Converter by @yaoyu-33 :: PR: #8960
PP support in LoRA merge script by @cuichenx :: PR: #8934
Update PR template by @ericharper :: PR: #8978
Update Latest News by @shashank3959 :: PR: #8837
Fix incorrect link to latest news in README by @shashank3959 :: PR: #8985
Update dependency install for LLM and MM by @ericharper :: PR: #8990
Temporarily remove mcore dep by @ericharper :: PR: #9010
[Nemo CICD] further specialize runners for more parallelism by @pablo-garay :: PR: #9036
Update mm dataprep notebook based on feedback by @cuichenx :: PR: #9029
Fix import in lora merge script by @cuichenx :: PR: #9032
[Nemo CICD] Run when labeled:Run CICD by @pablo-garay :: PR: #9044
[Nemo CICD] Add tag/label for 1-gpu runner by @pablo-garay :: PR: #9046
[Nemo CICD] checkout v4 by @pablo-garay :: PR: #9048
[Nemo CICD] Remove temp test change by @pablo-garay :: PR: #9049
remove in-place addition for dreambooth train with text encoder by @Victor49152 :: PR: #8825
Mingyuanm/sdxl quantization notebook by @Victor49152 :: PR: #9042
[Nemo CICD] Trigger on comment issued by @pablo-garay :: PR: #9062
zarr ckpt to torch_dist ckpt converter by @dimapihtar :: PR: #8842
Restore PTQ tests for Llama2 (reopened) by @janekl :: PR: #9064
add clip H config by @JRD971000 :: PR: #9082
[NeMo-UX] Add mixed-precision plugin by @marcromeyn :: PR: #9065
Comment baichuan test and update pr template by @ericharper :: PR: #9085
Add safe extraction of nemo tar files by @athitten :: PR: #8976
Improved shard_id parsing in LazyNemoTarredIterator, enables AIS dataloading by @pzelasko :: PR: #9077
[NeMo-UX] Add mistral-7b model by @marcromeyn :: PR: #9066
Llama3 Conversion Script Update by @suiyoubi :: PR: #9089
dehardcode test string by @JimmyZhang12 :: PR: #8865
[Nemo CICD] Try trigger cicd run on comment by @pablo-garay :: PR: #9111
Lhotse dataloading: RIR augmentation and nemo/tarred input support for RIR and noise aug by @pzelasko :: PR: #9109
mixtral evaluation PR by @Slyne :: PR: #8989
[Nemo CICD] Revert: run GHA cicd on comment by @pablo-garay :: PR: #9119
[Nemo CICD] Comment out flaky test: running too long by @pablo-garay :: PR: #9123
[Nemo CICD] Add timeout to unit tests by @pablo-garay :: PR: #9132
[Nemo CICD] Indicate optional test in name (prefix) by @pablo-garay :: PR: #9139
video neva null image+video folder path fix by @paul-gibbons :: PR: #9116
[NeMo-UX] Add data module by @cuichenx :: PR: #9133
NeMo Inference Requirements by @oyilmaz-nvidia :: PR: #9093
Remove debug print by @maanug-nv :: PR: #9074
Remove legacy CI by @pablo-garay :: PR: #9149
Update support for push_to_hf_hub() by @titu1994 :: PR: #9159
[Nemo CICD] comment out flaky PTQ tests by @pablo-garay :: PR: #9160
Update branch by @ericharper :: PR: #9211
dist adam transpose fix by @dimapihtar :: PR: #9239
[Nemo CICD] Increase time limit for Speech_Checkpoints_tests (#9186) by @pablo-garay :: PR: #9247
Pin transformers by @ericharper :: PR: #9261
Fix typo in HF tutorial by @titu1994 :: PR: #9302

NVIDIA Neural Modules 1.23.0

Highlights

Models

Nvidia Starcoder 2 - 15B

Announcement - https://developer.nvidia.com/blog/unlock-your-llm-coding-potential-with-starcoder2/
AI Foundation Model Inference - https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/starcoder2-15b
https://huggingface.co/bigcode/starcoder2-15b

NeMo Canary

Announcement - https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/

https://huggingface.co/nvidia/canary-1b

NeMo LLM

Falcon
Code Llama
StarCoder
GPT perf improvements
Context parallelism
Mistral
Mixtral (without expert parallelism)
Mcore GPT Dataset integration

NeMo MM

CLIP
Stable Diffusion (supporting LoRA)
Imagen
ControlNet (for SD)
Instruct pix2pix (for SD)
LLAVA
NeVA
DreamFusion++
NSFW filtering

NeMo ASR

Lhotse Dataloading support #7880
Canary: Multi task multi lingual ASR #8242
LongForm Audio for Diarization #7737
Faster algorithm for RNN-T Greedy #7926
Cache-Aware streaming notebook #8296

NeMo TTS

NeMo Vision

Known Issues

ASR

RNNT WER calculation when fused batch size > 1 during validation / test step()

Previously, the RNNT metric was stateful while the CTC one was not (r1.22.0, r1.23.0)

Therefore this calculation in the RNNT joint for fused operation worked properly. However with the unification of metrics in r1.23.0, a bug was introduced where only the last sub-batch of metrics calculates the scores and does not accumulate. This is patched via https://github.com/NVIDIA/NeMo/pull/8587 and will be fixed in the next release.

Workaround: Explicitly disable fused batch size during inference using the following command

from omegaconf import open_dict
model = ...
decoding_cfg = model.cfg.decoding
with open_dict(decoding_cfg):
  decoding_cfg.fused_batch_size = -1
model.change_decoding_strategy(decoding_cfg)

Note: This bug does not affect scores calculated via model.transcribe() (since it does not calculate metrics during inference, just text), or using the transcribe_speech.py or speech_to_text_eval.py in examples/asr.

Two failing unit tests due to a change in expected results, caused by lhotse version update

Container

For additional information regarding NeMo containers, please visit: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

docker pull nvcr.io/nvidia/nemo:24.01.speech

ASR

Changelog

Update link to yaml file in ASR_with_Transducers.ipynb by @Faith-Nchifor :: PR: #8014
Use convert_hf_dataset_to_nemo by @karpnv :: PR: #8017
Update asr_language_modeling.rst: Add a missing word by @martin0258 :: PR: #8007
spelling mistake by @orena1 :: PR: #7903
update asr eval by @stevehuang52 :: PR: #8045
fix noise aug by @stevehuang52 :: PR: #8057
Various fixes for typos and urls by @titu1994 :: PR: #8066
[Fix] Increase length check tolerance to prevent test failing by @anteju :: PR: #8067
Add text metrics to asr eval by @stevehuang52 :: PR: #8087
fix device setting to allow using accelerator cpu by @orena1 :: PR: #8084
.ctm in data simulator annotator compliant with RT-09 specification by @popcornell :: PR: #8004
Fix AST eval by @stevehuang52 :: PR: #8112
fix: numba.*_num_threads resets torch num_threads #8141 by @itzsimpl :: PR: #8145
Update dependencies by @titu1994 :: PR: #8156
NeMo + Lhotse integration by @pzelasko :: PR: #7880
Speedup RNN-T greedy decoding by @artbataev :: PR: #7926
[docker] Install k2 before NeMo for faster image rebuilding by @pzelasko :: PR: #8204
[docs] Add --force_codec to tarred dataset creation examples by @pzelasko :: PR: #8227
Temporarily use the previous RNN-T decoding algorithm as default by @artbataev :: PR: #8226
Make TDT inference not require duration params by @hainan-xv :: PR: #8207
Cache Aware Streaming tutorial notebook by @erastorgueva-nv :: PR: #8296
fix path location and branch by @nithinraok :: PR: #8304
Attention encoder-decoder models for multiple speech-to-text tasks … by @titu1994 :: PR: #8324
Remove asr webapp by @titu1994 :: PR: #8347
remove target at model level in aed model config [ASR] by @krishnacpuvvada :: PR: #8351
Add change_vocabulary and save_tokenizers() support to Multitask ASR models by @titu1994 :: PR: #8357
Change default beam size by @titu1994 :: PR: #8371
adding jenkins test for speech_to_text_aed model by @krishnacpuvvada :: PR: #8368
Add Finetuning tutorial with HF Datasets by @nithinraok :: PR: #8356
wer fix by @tbartley94 :: PR: #8404
add ensemble decoding fix by @nithinraok :: PR: #8427
Update k2 by @artbataev :: PR: #8492

TTS

Changelog

[TTS] Scale sampler steps by number of devices by @rlangman :: PR: #7947
Add All Multimodal Source Code Part 2: Text to image, x to nerf by @yaoyu-33 :: PR: #7970
[TTS] Add period discriminator and feature matching loss to codec recipe by @rlangman :: PR: #7884
Added VectorQuantizer base class by @anteju :: PR: #8011

LLMS

Changelog

Add interface to set NCCL options of each process group by @erhoo82 :: PR: #7923
Support O2 training of PEFT and SFT by @cuichenx :: PR: #7971
[NLP] Access scaler only in FP16 case by @janekl :: PR: #7916
[NLP] Minor improvements in Llama conversion script by @janekl :: PR: #7978
[NLP] Use helpers from utils_funcs.py in Llama conversion by @janekl :: PR: #7979
[NLP] Remove replace_sampler_ddp (deprecated in Trainer) by @janekl :: PR: #7981
Reworked MegatronPretrainingRandomBatchSampler to correctly handle epochs > 1 by @trias702 :: PR: #7920
Remove deprecated arguments from TE's TransformerLayer by @jbaczek :: PR: #7917
Add All Multimodal Source Code by @yaoyu-33 :: PR: #7791
First draft of mcore bert model in NeMo by @shanmugamr1992 :: PR: #7814
Support Falcon Variants (7B/40B/180B) in Mcore NeMo by @xuanzic :: PR: #7666
FSDP + Tensor Parallelism by @erhoo82 :: PR: #7897
Packed Sequence by @cuichenx :: PR: #7945
Adding method back that was removed accidentally by @ericharper :: PR: #8038
[NLP] ArtifactItem with init=True to make it debuggable by @janekl :: PR: #7980
SFT patch: (1) enable sequence parallelism and (2) enable profile by @erhoo82 :: PR: #7963
migration to PTL 2.0 for spellmapper model by @bene-ges :: PR: #7924
Change the megatron config lr scheduler default and fix to change partitions script by @shan18 :: PR: #8094
(1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast by @erhoo82 :: PR: #7793
Reconfigure limit_val_batches only for int by @athitten :: PR: #8099
Fixing wrapper and moving it to base class by @shanmugamr1992 :: PR: #8055
fix gated_linear_unit bug by @Agoniii :: PR: #8042
Fix Adapter for MCore models by @cuichenx :: PR: #8124
add war fix for sync issues by @gshennvm :: PR: #8130
Improve PEFT UX by @cuichenx :: PR: #8131
Enhance flexibility by passing callbacks as method argument by @michal2409 :: PR: #8015
context parallelism by @xrennvidia :: PR: #7739
Make pipelined TP comm overlap available with mcore by @erhoo82 :: PR: #8005
remove deprecated scripts by @arendu :: PR: #8138
adding OnlineSampleMapping by @arendu :: PR: #8137
Add distopt support for FP8 params and BF16 optimizer state by @timmoon10 :: PR: #7909
Revert adding OnlineSampleMapping by @pablo-garay :: PR: #8164
Token count and sequence length logging for MegatronGPTSFTModel by @vysarge :: PR: #8136
Use latest apex internal API by @jbaczek :: PR: #8129
tune specific params in the base model by @arendu :: PR: #7745
Virtual pipeline parallel support for MegatronGPTSFTModel by @vysarge :: PR: #7964
removed deprecated peft model by @arendu :: PR: #8183
remove more deprecated files by @arendu :: PR: #8169
Pre-generate cu_seqlens argmin and max_seqlen to remove host-to-device sync by @erhoo82 :: PR: #8108
Add the interface to use SHARP to FSDP strategy by @erhoo82 :: PR: #8202
Multimodal required NLP base model changes by @yaoyu-33 :: PR: #8188
[NLP] Improve and unify loading state_dict for community models by @janekl :: PR: #7977
Rename Finetuning Scripts by @cuichenx :: PR: #8201
Final multimodal PR with our recent developments on MM side by @yaoyu-33 :: PR: #8127
Add include_text parameter to SFT dataloaders by @Kipok :: PR: #8198
Add random_seed argument to generate by @Kipok :: PR: #8162
Added support for neptune logger by @harishankar-gopalan :: PR: #8210
Pre-compute max_seqlen and cu_seqlens_argmin in all model-parallel cases by @erhoo82 :: PR: #8222
Use PackedSeqParams in accordance with changes in Megatron-LM by @cuichenx :: PR: #8205
Fix to peft & virtual pipeline parallel unsupported check by @vysarge :: PR: #8216
Fixed the tp overlap switch by @sanandaraj5597 :: PR: #8195
add knobs for rope/swiglu fusion by @lhb8125 :: PR: #8184
Added sample cpu_offloading switch to YAML by @sanandaraj5597 :: PR: #8148
Syncing random seed between ranks in generate by @Kipok :: PR: #8230
add first_val_step to mcore scheduler by @JimmyZhang12 :: PR: #8150
Correct padding for SFT input data to account for sequence parallel + TE's fp8 op dimension requirements by @vysarge :: PR: #8240
Mistral 7b conversion script by @akoumpa :: PR: #8052
switch to mcore dataset [with FIM support] by @dimapihtar :: PR: #8149
Mixtral to NeMo conversion script. by @akoumpa :: PR: #8155
fixes to accomendate mcore changes by @HuiyingLi :: PR: #8261
Allow MegatronPretrainingRandomSampler to do multi-epoch training by @trias702 :: PR: #8239
Add dist ckpt support for regular optimizers by @mikolajblaz :: PR: #7749
add deallocate pipeline output optimization by @JimmyZhang12 :: PR: #8279
Fix memory leak caused by context parallelism hanging references by omegaconf by @JimmyZhang12 :: PR: #8299
distributed fused adam + rampup bs support by @dimapihtar :: PR: #8302
Update PEFT Doc by @cuichenx :: PR: #8262
Converter script fixes for mixtral/mistral by @akoumpa :: PR: #8272
Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 by @erhoo82 :: PR: #8334
Enable megatron core loggers for GPT pretraining by @ashbhandare :: PR: #8354
mcore ds fix by @dimapihtar :: PR: #8283
release updates by @dimapihtar :: PR: #8378
Mcore customization doc by @HuiyingLi :: PR: #8298
updated link to pubmed by @nithinraok :: PR: #8402
mcore customization doc minor fix by @HuiyingLi :: PR: #8421
Fixing mcore bert for TP, PP and SP by @shanmugamr1992 :: PR: #8336
Add settings to suppress bf16 compile errors in CI on V100 by @athitten :: PR: #8481
MoE parameter passing by @akoumpa :: PR: #8255
Add fp8 support for SD/Update notebook paths by @Victor49152 :: PR: #8489

NeMo Tools

Changelog

SDE bugfix log by @Jorjeous :: PR: #8430

General Improvements

Changelog

Add news section to README by @ericharper :: PR: #7984
Fixing conversion script to work for code llama by @shanmugamr1992 :: PR: #7997
Fix crash when converting to mcore a model using rotary embeddings by @odelalleau :: PR: #7998
Added a procedure for Windows users, README by @Jorjeous :: PR: #7942
Update manifest.py to speedup loading tarred datasets by @stevehuang52 :: PR: #7900
[Fix] Fixed name of a test by @anteju :: PR: #7986
Fix lora merge script by @cuichenx :: PR: #8113
Support transcoding audio formats when saving tarred datasets (FLAC, OPUS) by @pzelasko :: PR: #8102
README edit to change Apple Silicon install instructions (to fix a break introduced by pytorch 2) by @stephenmcconnachie :: PR: #8122
Fixes NVIDIA/apex installation to not erroneously install the pkg by @terrykong :: PR: #8126
Graphviz fix by @GNroy :: PR: #7843
Update README.rst by @fayejf :: PR: #8154
Fix TP>1 issue for conversion script by @cuichenx :: PR: #8144
Support torch jit script by @artbataev :: PR: #8027
NeMo Multimodal Docs and Tests Initial PR by @yaoyu-33 :: PR: #8028
Remove left-over prints in NeMo+Lhotse code by @pzelasko :: PR: #8180
Upgrade to DLFW PyTorch 23.12 by @ericharper :: PR: #8163
Add Lhotse support for key in NeMo manifests by @pzelasko :: PR: #8197
Fix CPU Initialization and TP>1 for LoRA Merge Script by @cuichenx :: PR: #8199
Add support in Neural Typecheck to disable semantic checks by @titu1994 :: PR: #8212
Pin lhotse=1.19.2 in r1.23.0 by @pzelasko :: PR: #8303
Multimodal r1.23.0 bug fix by @yaoyu-33 :: PR: #8315
MCore dataset compatibility for tokenizers by @vysarge :: PR: #8390
Update NFA video download link by @erastorgueva-nv :: PR: #8406
Update MM Dataprep Tutorial by @cuichenx :: PR: #8410
Fix dreambooth data sampler issue by @yaoyu-33 :: PR: #8400
Fix a bug in CTM line processing function for multi-speaker data simulations by @tango4j :: PR: #8416
Akoumparouli/mistral bugfix by @akoumpa :: PR: #8353
pin to 0.5.0 by @ericharper :: PR: #8465
Update NeMo Multimodal Requirements by @yaoyu-33 :: PR: #8515
Fix link in multimodal dataprep tutorial by @cuichenx :: PR: #8517