NT Nano cfg update by malay-nagda · Pull Request #2662 · NVIDIA-NeMo/Megatron-Bridge

malay-nagda · 2026-03-05T14:57:00Z

What does this PR do ?

Updated configs for better perf

Changelog

cfg.model.moe_router_force_load_balancing = True

cuda_graph_impl="transformer_engine",
cuda_graph_scope=["attn", "mamba", "moe_router", "moe_preprocess"],

if model_recipe_name in ["nemotron_3_nano"]:
            del_cudnn_ln = False

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Chores
- Updated Nemotron 3 Nano model configurations across multiple GPU platforms (H100, GB300, GB200, B300, B200).
- Added CUDA graph optimization settings for improved performance.
- Adjusted batch size configurations for different hardware variants.
- Enabled load balancing for mixture-of-experts routing.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot · 2026-03-05T14:57:04Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Malay Nagda <malayn@nvidia.com>

coderabbitai · 2026-03-05T16:18:53Z

📝 Walkthrough

Walkthrough

Configures Nemotron 3 Nano model pretraining across various hardware platforms by enabling MoE router load balancing, adjusting batch size parameters across GPU variants (GB300/B300 and GB200/B200), adding CUDA graph optimization scope definitions, and disabling cuDNN LayerNorm environment variable removal for this recipe.

Changes

Cohort / File(s)	Summary
MoE Configuration `scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py`	Enables MoE router force load balancing by setting `cfg.model.moe_router_force_load_balancing = True` in `set_nemotron_3_nano_common_configs`.
Workload Base Configuration `scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`	Removes `micro_batch_size` from base config; adds `cuda_graph_impl` and `cuda_graph_scope` fields. Updates pretrain configs to replace `tensor_model_parallel_size=1` with `micro_batch_size` (4 for GB300/B300, 2 for GB200/B200). Extends H100 configs with `cuda_graph_scope` and expanded `recompute_modules`.
Performance Plugin Setup `scripts/performance/perf_plugins.py`	Adds conditional branch in `_set_model_specific_environment_variables` to prevent cuDNN LayerNorm variable removal (`del_cudnn_ln = False`) when model recipe is `nemotron_3_nano`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

PR #2060: Modifies perf_plugins.py to adjust model-specific del_cudnn_ln gating logic for different recipes.
PR #2617: Updates nemotron_3_nano H100 pretraining configuration definitions in the same workload config file.
PR #2152: Prevents removal of cuDNN LayerNorm environment variables for MoE recipes in perf_plugins.py.

Suggested reviewers

erhoo82
tomlifu
ko3n1g

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR claims performance optimization but provides no test results, before/after benchmarks, or performance metrics to validate the claimed improvements.	Add performance benchmark results showing impact of configuration changes on throughput/latency with specific hardware context and configurations tested.
Title check	❓ Inconclusive	The title 'NT Nano cfg update' is vague and generic, using abbreviated terms ('NT', 'cfg') that lack clarity about the specific changes being made.	Provide a more descriptive title that explains the main change, such as 'Update Nemotron 3 Nano configurations for performance optimization' or 'Configure CUDA Graph settings and load balancing for Nemotron Nano'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch malay/nano_260201_opt

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py (1)
75-75: Redundant cuda_graph_impl override can be removed.

Line 75 repeats the same cuda_graph_impl already defined in BASE_NEMOTRON_3_NANO_CONFIG (Line 38). Dropping it here would reduce config drift risk.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`
at line 75, Remove the redundant cuda_graph_impl override in the specific
workload config: delete the line setting cuda_graph_impl="transformer_engine" in
the nemotron_3_nano workload so it inherits the value from
BASE_NEMOTRON_3_NANO_CONFIG instead of shadowing it, ensuring the config no
longer duplicates the same setting and reducing config drift risk.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Line 39: Change the type annotation for the cuda_graph_scope variable in
scripts/performance/utils/utils.py from Optional[str] to allow both strings and
lists by using the union type str | list[str] | None; update every occurrence of
the cuda_graph_scope annotation (the variable named cuda_graph_scope and any
function signatures or defaults referencing it) so the annotation matches actual
usage (accepting values like "full_iteration" or ["attn","mamba","moe_router"]).
Ensure imports or typing usage remain valid for the project's Python version
(use PEP 604 union syntax).

---

Nitpick comments:
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Line 75: Remove the redundant cuda_graph_impl override in the specific
workload config: delete the line setting cuda_graph_impl="transformer_engine" in
the nemotron_3_nano workload so it inherits the value from
BASE_NEMOTRON_3_NANO_CONFIG instead of shadowing it, ensuring the config no
longer duplicates the same setting and reducing config drift risk.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7cca4e97-bce1-400e-b5ad-353938665028

📥 Commits

Reviewing files that changed from the base of the PR and between 01cb6be and 540e8e7.

📒 Files selected for processing (3)

scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py
scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py
scripts/performance/perf_plugins.py

coderabbitai · 2026-03-05T16:26:27Z

scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py

    expert_model_parallel_size=8,
    moe_flex_dispatcher_backend="hybridep",
+    cuda_graph_impl="transformer_engine",
+    cuda_graph_scope=["attn", "mamba", "moe_router", "moe_preprocess"],


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the WorkloadBaseConfig class definition and cuda_graph_scope type annotation rg -A 20 "class WorkloadBaseConfig" scripts/performance/utils/utils.py | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1046

🏁 Script executed:

# Find all cuda_graph_scope assignments in config files rg -n "cuda_graph_scope\s*=" scripts/performance/configs/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 8390

🏁 Script executed:

# Check the specific file and lines mentioned in the review sed -n '35,90p' scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2564

🏁 Script executed:

# Search for cuda_graph_scope field in WorkloadBaseConfig class rg -A 100 "class WorkloadBaseConfig" scripts/performance/utils/utils.py | grep -A 5 -B 5 "cuda_graph_scope"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 440

🏁 Script executed:

# Alternative: search more broadly for all fields in the class wc -l scripts/performance/utils/utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 109

🏁 Script executed:

# Read the entire utils.py file to see the full WorkloadBaseConfig definition cat -n scripts/performance/utils/utils.py | head -200

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 9205

Fix type annotation for cuda_graph_scope to accept both string and list values.

Line 55 in scripts/performance/utils/utils.py defines cuda_graph_scope: Optional[str] = None, but the actual usage across config files assigns both list[str] values (e.g., ["attn", "mamba"]) and str values (e.g., "full_iteration"). Update the type annotation to str | list[str] | None to match the actual usage pattern. This also applies to lines 81 and 85 of the file under review.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py` at line 39, Change the type annotation for the cuda_graph_scope variable in scripts/performance/utils/utils.py from Optional[str] to allow both strings and lists by using the union type str | list[str] | None; update every occurrence of the cuda_graph_scope annotation (the variable named cuda_graph_scope and any function signatures or defaults referencing it) so the annotation matches actual usage (accepting values like "full_iteration" or ["attn","mamba","moe_router"]). Ensure imports or typing usage remain valid for the project's Python version (use PEP 604 union syntax).

malay-nagda added 3 commits March 5, 2026 13:17

nt nano optimizations

1512e71

Signed-off-by: Malay Nagda <malayn@nvidia.com>

h100 fp8 cfg

8d99ed0

Signed-off-by: Malay Nagda <malayn@nvidia.com>

h100 fp8 recompute modules

21d53e4

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda changed the title ~~Malay/nano 260201 opt~~ NT Nano cfg update Mar 5, 2026

malay-nagda added performance r0.3.0 Cherry-pick label for r0.3.0 release branch performance/optimize Performance optimization tracking labels Mar 5, 2026

malay-nagda added 3 commits March 5, 2026 20:34

h100 fp8 CG scope reduced

ec932d5

Signed-off-by: Malay Nagda <malayn@nvidia.com>

h100 fp8 CG scope reduced

0d3ab68

Signed-off-by: Malay Nagda <malayn@nvidia.com>

h100 fp8 CG scope reduced

c348e6a

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda marked this pull request as ready for review March 5, 2026 16:16

Merge branch 'main' into malay/nano_260201_opt

540e8e7

malay-nagda requested review from erhoo82, ko3n1g and tomlifu March 5, 2026 16:17

copy-pr-bot bot temporarily deployed to test March 5, 2026 16:18 Inactive

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 18:12 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 18:22 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 18:36 Inactive