Skip to content

NT Nano cfg update#2662

Open
malay-nagda wants to merge 7 commits intomainfrom
malay/nano_260201_opt
Open

NT Nano cfg update#2662
malay-nagda wants to merge 7 commits intomainfrom
malay/nano_260201_opt

Conversation

@malay-nagda
Copy link
Collaborator

@malay-nagda malay-nagda commented Mar 5, 2026

What does this PR do ?

Updated configs for better perf

Changelog

cfg.model.moe_router_force_load_balancing = True
cuda_graph_impl="transformer_engine",
cuda_graph_scope=["attn", "mamba", "moe_router", "moe_preprocess"],
if model_recipe_name in ["nemotron_3_nano"]:
            del_cudnn_ln = False

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Chores
    • Updated Nemotron 3 Nano model configurations across multiple GPU platforms (H100, GB300, GB200, B300, B200).
    • Added CUDA graph optimization settings for improved performance.
    • Adjusted batch size configurations for different hardware variants.
    • Enabled load balancing for mixture-of-experts routing.

Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@malay-nagda malay-nagda changed the title Malay/nano 260201 opt NT Nano cfg update Mar 5, 2026
@malay-nagda malay-nagda added performance r0.3.0 Cherry-pick label for r0.3.0 release branch performance/optimize Performance optimization tracking labels Mar 5, 2026
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: Malay Nagda <malayn@nvidia.com>
@malay-nagda malay-nagda marked this pull request as ready for review March 5, 2026 16:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Configures Nemotron 3 Nano model pretraining across various hardware platforms by enabling MoE router load balancing, adjusting batch size parameters across GPU variants (GB300/B300 and GB200/B200), adding CUDA graph optimization scope definitions, and disabling cuDNN LayerNorm environment variable removal for this recipe.

Changes

Cohort / File(s) Summary
MoE Configuration
scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py
Enables MoE router force load balancing by setting cfg.model.moe_router_force_load_balancing = True in set_nemotron_3_nano_common_configs.
Workload Base Configuration
scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py
Removes micro_batch_size from base config; adds cuda_graph_impl and cuda_graph_scope fields. Updates pretrain configs to replace tensor_model_parallel_size=1 with micro_batch_size (4 for GB300/B300, 2 for GB200/B200). Extends H100 configs with cuda_graph_scope and expanded recompute_modules.
Performance Plugin Setup
scripts/performance/perf_plugins.py
Adds conditional branch in _set_model_specific_environment_variables to prevent cuDNN LayerNorm variable removal (del_cudnn_ln = False) when model recipe is nemotron_3_nano.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • PR #2060: Modifies perf_plugins.py to adjust model-specific del_cudnn_ln gating logic for different recipes.
  • PR #2617: Updates nemotron_3_nano H100 pretraining configuration definitions in the same workload config file.
  • PR #2152: Prevents removal of cuDNN LayerNorm environment variables for MoE recipes in perf_plugins.py.

Suggested reviewers

  • erhoo82
  • tomlifu
  • ko3n1g
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR claims performance optimization but provides no test results, before/after benchmarks, or performance metrics to validate the claimed improvements. Add performance benchmark results showing impact of configuration changes on throughput/latency with specific hardware context and configurations tested.
Title check ❓ Inconclusive The title 'NT Nano cfg update' is vague and generic, using abbreviated terms ('NT', 'cfg') that lack clarity about the specific changes being made. Provide a more descriptive title that explains the main change, such as 'Update Nemotron 3 Nano configurations for performance optimization' or 'Configure CUDA Graph settings and load balancing for Nemotron Nano'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch malay/nano_260201_opt

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py (1)

75-75: Redundant cuda_graph_impl override can be removed.

Line 75 repeats the same cuda_graph_impl already defined in BASE_NEMOTRON_3_NANO_CONFIG (Line 38). Dropping it here would reduce config drift risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`
at line 75, Remove the redundant cuda_graph_impl override in the specific
workload config: delete the line setting cuda_graph_impl="transformer_engine" in
the nemotron_3_nano workload so it inherits the value from
BASE_NEMOTRON_3_NANO_CONFIG instead of shadowing it, ensuring the config no
longer duplicates the same setting and reducing config drift risk.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Line 39: Change the type annotation for the cuda_graph_scope variable in
scripts/performance/utils/utils.py from Optional[str] to allow both strings and
lists by using the union type str | list[str] | None; update every occurrence of
the cuda_graph_scope annotation (the variable named cuda_graph_scope and any
function signatures or defaults referencing it) so the annotation matches actual
usage (accepting values like "full_iteration" or ["attn","mamba","moe_router"]).
Ensure imports or typing usage remain valid for the project's Python version
(use PEP 604 union syntax).

---

Nitpick comments:
In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`:
- Line 75: Remove the redundant cuda_graph_impl override in the specific
workload config: delete the line setting cuda_graph_impl="transformer_engine" in
the nemotron_3_nano workload so it inherits the value from
BASE_NEMOTRON_3_NANO_CONFIG instead of shadowing it, ensuring the config no
longer duplicates the same setting and reducing config drift risk.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7cca4e97-bce1-400e-b5ad-353938665028

📥 Commits

Reviewing files that changed from the base of the PR and between 01cb6be and 540e8e7.

📒 Files selected for processing (3)
  • scripts/performance/configs/nemotronh/nemotron_3_nano_llm_pretrain.py
  • scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py
  • scripts/performance/perf_plugins.py

expert_model_parallel_size=8,
moe_flex_dispatcher_backend="hybridep",
cuda_graph_impl="transformer_engine",
cuda_graph_scope=["attn", "mamba", "moe_router", "moe_preprocess"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the WorkloadBaseConfig class definition and cuda_graph_scope type annotation
rg -A 20 "class WorkloadBaseConfig" scripts/performance/utils/utils.py | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1046


🏁 Script executed:

# Find all cuda_graph_scope assignments in config files
rg -n "cuda_graph_scope\s*=" scripts/performance/configs/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 8390


🏁 Script executed:

# Check the specific file and lines mentioned in the review
sed -n '35,90p' scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2564


🏁 Script executed:

# Search for cuda_graph_scope field in WorkloadBaseConfig class
rg -A 100 "class WorkloadBaseConfig" scripts/performance/utils/utils.py | grep -A 5 -B 5 "cuda_graph_scope"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 440


🏁 Script executed:

# Alternative: search more broadly for all fields in the class
wc -l scripts/performance/utils/utils.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 109


🏁 Script executed:

# Read the entire utils.py file to see the full WorkloadBaseConfig definition
cat -n scripts/performance/utils/utils.py | head -200

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 9205


Fix type annotation for cuda_graph_scope to accept both string and list values.

Line 55 in scripts/performance/utils/utils.py defines cuda_graph_scope: Optional[str] = None, but the actual usage across config files assigns both list[str] values (e.g., ["attn", "mamba"]) and str values (e.g., "full_iteration"). Update the type annotation to str | list[str] | None to match the actual usage pattern. This also applies to lines 81 and 85 of the file under review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/performance/configs/nemotronh/nemotron_3_nano_workload_base_configs.py`
at line 39, Change the type annotation for the cuda_graph_scope variable in
scripts/performance/utils/utils.py from Optional[str] to allow both strings and
lists by using the union type str | list[str] | None; update every occurrence of
the cuda_graph_scope annotation (the variable named cuda_graph_scope and any
function signatures or defaults referencing it) so the annotation matches actual
usage (accepting values like "full_iteration" or ["attn","mamba","moe_router"]).
Ensure imports or typing usage remain valid for the project's Python version
(use PEP 604 union syntax).

@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 18:36 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 18:36 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 18:36 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 18:36 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 18:36 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 18:36 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 18:36 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci March 5, 2026 18:36 Abandoned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance/optimize Performance optimization tracking performance r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant