Skip to content

[model] feat: add MiniMax-M2 MoE bridge#2602

Open
yaoyu-33 wants to merge 4 commits intomainfrom
yuya/add-minimax-m2-bridge
Open

[model] feat: add MiniMax-M2 MoE bridge#2602
yaoyu-33 wants to merge 4 commits intomainfrom
yuya/add-minimax-m2-bridge

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Mar 1, 2026

Summary

Add Megatron Bridge for MiniMaxAI/MiniMax-M2, a sparse MoE model with 256 experts, top-8 sigmoid routing, and expert bias correction.

Bridge (minimax_m2_bridge.py)

  • Config mapping with manual rotary_percent calculation (rotary_dim / head_dim)
  • Per-expert weight mapping using block_sparse_moe prefix (w1/w2/w3 format)
  • Sigmoid routing with e_score_correction_bias buffer mapping
  • QK layernorm intentionally disabled (full-dim vs per-head mismatch — documented TODO)
  • MoE settings: grouped gemm, alltoall dispatcher, aux_loss load balancing

Examples

  • conversion.sh — single/multi-GPU round-trip and checkpoint import/export
  • inference.sh — text generation with TP support
  • verify_toy_model.py — creates a toy model and runs forward-pass comparison via compare.py

Tests

  • test_minimax_m2_conversion.py — toy model creation, single-GPU round-trip, TP=2 and PP=2 parallelism tests

Bug fix in compare.py

  • Truncate Megatron logits to HF vocab size before comparison, fixing shape mismatch when Megatron pads vocab for GPU kernel efficiency

Verification

  • 1-GPU forward pass: cosine similarity 0.999990 ✅
  • EP=2 forward pass: cosine similarity 0.999990 ✅
  • EP=2 weight round-trip: all 43 weights match ✅
  • TP=2+EP=2 weight round-trip: all weights match ✅

Known Limitations

  • QK layernorm weights (q_norm, k_norm) are dropped — MiniMax-M2 uses full-dimension QK norm while Megatron uses per-head. Acceptable for fine-tuning.
  • MTP (Multi-Token Prediction) modules are not mapped.

Test plan

  • Run verify_toy_model.py on 1-GPU, TP=2, EP=2
  • Run test_minimax_m2_conversion.py with 2+ GPUs
  • Verify weight round-trip with TP=2+EP=2

Made with Cursor

Summary by CodeRabbit

  • New Features

    • Added MiniMax-M2 model support with conversion and inference capabilities.
    • Added toy model verification script for MiniMax-M2 conversion testing.
  • Bug Fixes

    • Improved Megatron/HF model comparison with proper logit truncation and similarity metrics.
  • Refactor

    • Consolidated deprecated model provider classes into unified base providers.
    • Migrated recipes to AutoBridge-based model configuration.
  • Tests

    • Updated tests to align with new provider infrastructure.
    • Added comprehensive MiniMax-M2 conversion test suite.
    • Removed deprecated provider test coverage.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- Remove NemotronNano12Bv2Provider from nemotron_vl/__init__.py
  (was a deprecated alias from deleted nemotron_h_provider.py)
- Remove invalid max_position_embeddings kwarg from kimi and moonlight
  recipes (not a field on MLAModelProvider)
- Update moonlight test to monkeypatch MLAModelProvider instead of
  deleted MoonlightModelProvider16B

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Add Megatron Bridge for MiniMaxAI/MiniMax-M2, a sparse MoE model with
256 experts, top-8 sigmoid routing, and expert bias correction.

Includes:
- Bridge with config mapping and per-expert weight conversion
  (block_sparse_moe prefix, w1/w2/w3 format)
- Partial RoPE support (rotary_dim -> rotary_percent)
- QK layernorm intentionally disabled (full-dim vs per-head mismatch)
- Functional test with toy model for TP/PP/EP parallelism
- Example scripts for conversion, inference, and verification
- compare.py fix: truncate Megatron logits to HF vocab size for
  proper comparison when Megatron pads vocab for kernel efficiency

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 1, 2026

📝 Walkthrough

Walkthrough

This PR adds MiniMax-M2 model support to Megatron-Bridge and consolidates the model provider architecture by removing numerous deprecated provider classes across DeepSeek, Llama, Nemotron, GLM, GPT-OSS, and Kimi variants, replacing them with unified bridge implementations and generic providers (GPTModelProvider, MLAModelProvider, MambaModelProvider). It includes new conversion/inference scripts, updates to recipes, tests, and examples.

Changes

Cohort / File(s) Summary
MiniMax-M2 Bridge & Support
src/megatron/bridge/models/minimax_m2/*, examples/models/minimax_m2/*, tests/functional_tests/models/minimax_m2/*
New MiniMax-M2 model bridge with provider configuration, mapping registry for MoE architecture; adds conversion/inference shell scripts and comprehensive conversion tests including toy model creation and multi-GPU verification.
Example Conversion Updates
examples/conversion/compare_hf_and_megatron/compare.py
Modified Megatron model comparison: disabled MTP layer configuration, added dtype normalization for NCCL broadcasting, removed model reloading, added logits truncation to HF vocabulary size, and enhanced comparison metrics (cosine similarity, threshold-based status).
Safe Repository Whitelist
src/megatron/bridge/models/hf_pretrained/utils.py
Added "MiniMaxAI" to SAFE_REPOS list for HF model loading.
Bridge Module Initialization
src/megatron/bridge/models/__init__.py
Consolidated public API exports: removed numerous DeepSeek, Llama, Nemotron, GPT-OSS provider classes; added MiniMaxM2Bridge, LlamaBridge, NemotronHBridge to unified exports.
Deprecated DeepSeek Provider Removal
src/megatron/bridge/models/deepseek/deepseek_provider.py, src/megatron/bridge/models/deepseek/__init__.py
Removed all DeepSeek provider classes (DeepSeekModelProvider, DeepSeekV2\*/DeepSeekV3ModelProvider, MoonlightModelProvider16B) and deprecation warnings; retained only bridge definitions.
Llama Provider Consolidation
src/megatron/bridge/models/llama/*
Deleted entire llama_provider.py file (20+ provider variants); updated __init__.py to expose only LlamaBridge; simplified public API surface.
Llama-Nemotron Provider Refactor
src/megatron/bridge/models/llama_nemotron/*
Replaced Llama-based provider inheritance with GPTModelProvider; removed per-model provider variants; updated bridge return type to LlamaNemotronHeterogeneousProvider.
GLM Provider & File Removal
src/megatron/bridge/models/glm/*, src/megatron/bridge/models/glm_vl/glm_45v_provider.py
Deleted glm45_provider.py (3 provider classes removed); updated GLM45VModelProvider to inherit from GPTModelProvider with expanded configuration fields; narrowed GLM module exports to GLM45Bridge.
GPT-OSS Provider Removal
src/megatron/bridge/models/gpt_oss/*
Completely removed gpt_oss_provider.py file (GPTOSSProvider base and variants); updated __init__.py to export only bridge.
Kimi Provider Consolidation
src/megatron/bridge/models/kimi/*
Deleted kimi_provider.py (KimiK2Provider removed); switched imports to use KimiBridge; updated recipes to use AutoBridge instead of direct provider instantiation.
Nemotron & NemotronH Provider Removal
src/megatron/bridge/models/nemotron/*, src/megatron/bridge/models/nemotronh/*
Deleted entire nemotron_provider.py and nemotron_h_provider.py (13+ provider classes removed); narrowed module exports to bridge definitions only.
Nemotron-VL Provider Refactor
src/megatron/bridge/models/nemotron_vl/*
Replaced NemotronNano12Bv2Provider inheritance with MambaModelProvider; expanded public configuration fields; removed legacy provider exports.
Recipe Updates (Kimi, Moonlight, Nemotron)
src/megatron/bridge/recipes/kimi/kimi_k2.py, src/megatron/bridge/recipes/moonlight/moonlight_16b.py, src/megatron/bridge/recipes/nemotronh/*
Replaced direct provider instantiation with AutoBridge/MambaModelProvider usage; updated imports and provider configurations across pretrain/SFT/PEFT paths; adjusted activation function references.
Example & Test Sampler Updates
examples/conversion/.., tests/functional_tests/data/test_samplers.py
Updated import statements to use new provider classes (GPTModelProvider, MLAModelProvider); simplified test bridge implementations.
Model Provider Test Removal
tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py, tests/unit_tests/models/*/test_*_provider.py
Deleted 8 provider test files covering DeepSeek, GLM, Kimi, Llama, Nemotron, and NemotronH variants (totaling 1000+ deleted test lines).
Functional Training Test Updates
tests/functional_tests/training/test_*.py (15+ files)
Replaced Llama-based provider usage (Llama32ModelProvider1B, Llama3ModelProvider variants) with GPTModelProvider across all training tests; added expanded configuration parameters.
Unit Test & Recipe Test Updates
tests/unit_tests/models/test_models_imports.py, tests/unit_tests/recipes/test_*.py, tests/unit_tests/recipes/nemotronh/test_*.py
Updated imports and isinstance checks to use new provider classes (MambaModelProvider, GPTModelProvider, MLAModelProvider); removed test imports for deleted provider classes; simplified monkeypatch strategies.
Conversion Test Helper & Config Tests
tests/unit_tests/training/test_config.py, tests/unit_tests/training/test_log_non_default_values.py
Updated DeepSeekModelProvider references to MLAModelProvider; adjusted expected transformer config parent classes.
Provider Bridge Type Hint Updates
src/megatron/bridge/models/conversion/model_bridge.py, src/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.py
Updated provider_bridge method return types from Llama-based providers to GPTModelProvider and LlamaNemotronHeterogeneousProvider respectively.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Modifications to example conversion comparison logic in compare.py align with logits normalization and truncation improvements.
  • Large-scale provider consolidation from legacy hierarchies (Llama, Nemotron, DeepSeek variants) to unified bridge-based architecture matches ongoing refactoring initiatives.
  • Test suite updates across functional and unit tests reflect the shift from provider-centric to bridge-centric model configuration patterns.

Suggested labels

refactoring, model-provider, bridge, consolidation, tests

Suggested reviewers

  • chtruong814
  • cuichenx
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a MiniMax-M2 MoE bridge to Megatron. It is concise, specific, and directly related to the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 82.08% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR objectives document comprehensive test results including forward-pass cosine similarity (0.999990 for 1-GPU and EP=2) and weight round-trip verification across multiple parallelism settings, with functional tests and verification scripts included.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yuya/add-minimax-m2-bridge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/unit_tests/recipes/test_run_plugins.py (1)

77-102: ⚠️ Potential issue | 🟡 Minor

Use the caller-provided sequence length in model config.

Line 77 accepts seq_length, but Line 101 hardcodes 8192. This ignores overrides and can desynchronize model vs dataset sequence lengths.

🔧 Minimal fix
-        seq_length=8192,
+        seq_length=seq_length,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/test_run_plugins.py` around lines 77 - 102, The
model config currently hardcodes seq_length=8192 in the GPTModelProvider call,
ignoring the caller-provided seq_length variable; update the GPTModelProvider
instantiation to use the seq_length variable (the one popped from kwargs at the
top of the function) instead of the literal 8192 so the model config matches the
dataset/training sequence length and respects overrides.
src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py (1)

79-151: ⚠️ Potential issue | 🟠 Major

Do not silently ignore vp_stage in provide().

Line 79 accepts vp_stage, but the implementation never uses or validates it. That can silently construct the wrong model scope under virtual pipeline usage instead of failing fast.

🔒 Suggested guard
 def provide(self, pre_process=None, post_process=None, vp_stage=None):  # noqa: D401
     """Assemble a full :class:`~megatron.core.models.multimodal.llava_model.LLaVAModel`."""
+    if vp_stage is not None or getattr(self, "virtual_pipeline_model_parallel_size", None) is not None:
+        raise ValueError(
+            "Virtual pipeline parallelism is not supported in "
+            "NemotronNano12Bv2VLModelProvider.provide()."
+        )

     language_cfg = copy.deepcopy(self)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py` around lines
79 - 151, The provide() method accepts vp_stage but never uses or validates it,
which can silently construct an incorrect model under virtual pipeline; update
provide() (in nemotron_vl_provider.py) to validate vp_stage at the start (e.g.,
ensure it's None or within allowed stage range) and either pass it into the
LLaVAModel constructor if that class supports a virtual pipeline stage parameter
or raise a clear exception when an unsupported vp_stage is given; reference the
provide() function and the vp_stage parameter and ensure the guard runs before
assembling language_cfg/vision_cfg and creating the LLaVAModel so incorrect
scopes cannot be built.
🧹 Nitpick comments (10)
tests/unit_tests/recipes/test_moonlight_recipes.py (1)

18-18: Consider updating stale documentation and class naming.

The docstring on line 18 still references MoonlightModelProvider16B, and the fake class _FakeMoonlightModelProvider16B (line 76) is now used to mock MLAModelProvider. This inconsistency could confuse future maintainers.

✏️ Suggested documentation updates
 # Test purpose:
 # - Parametrize over all exported Moonlight recipe functions in `megatron.bridge.recipes.moonlight`.
-# - For each recipe, monkeypatch `MoonlightModelProvider16B` with a lightweight fake to avoid I/O.
+# - For each recipe, monkeypatch `MLAModelProvider` with a lightweight fake to avoid I/O.
 # - Build a config with small, safe overrides and assert it forms a valid `ConfigContainer`.
 # - Verify tokenizer selection and sanity-check parallelism fields.
-class _FakeMoonlightModelProvider16B:
-    """Fake MoonlightModelProvider16B for testing without model I/O."""
+class _FakeMLAModelProvider:
+    """Fake MLAModelProvider for testing Moonlight recipes without model I/O."""

If renaming the class, update all usages (lines 136, 171, 196, 222, 238, 261, 284, 307, 329).

Also applies to: 76-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/test_moonlight_recipes.py` at line 18, Update the
stale docstring and fake class name so they match the current provider type:
change any docstring text referencing MoonlightModelProvider16B to reference
MLAModelProvider (or the current provider name) and rename the fake class
_FakeMoonlightModelProvider16B to _FakeMLAModelProvider (or similar) and update
all monkeypatch usages that reference the old name (the test class and each
place where the fake is injected in the tests). Ensure the docstring, the fake
class declaration, and every usage/monkeypatch that currently points to
_FakeMoonlightModelProvider16B are updated to the new MLAModelProvider-based
names so they are consistent.
tests/unit_tests/training/test_config.py (1)

79-89: Consider renaming this function to reflect its current purpose.

The function is named create_test_deepseek_config but now creates an MLAModelProvider. Consider renaming to create_test_mla_config for consistency with the updated implementation. The test comment on line 1229 also references "Deepseek" but tests MLAModelProvider.

♻️ Suggested rename
-def create_test_deepseek_config(**kwargs: Any) -> MLAModelProvider:
-    """Creates an instance of MLAModelProvider for testing."""
+def create_test_mla_config(**kwargs: Any) -> MLAModelProvider:
+    """Creates an instance of MLAModelProvider for testing MLA-based models."""
     defaults = {
         "num_layers": 1,
         "hidden_size": 128,
         "num_attention_heads": 4,
         "seq_length": 512,
         "apply_rope_fusion": False,
     }
     defaults.update(kwargs)
     return MLAModelProvider(**defaults)

Also update the reference on line 1225 and comment on line 1229:

-    `@pytest.mark.parametrize`("model_factory", [create_test_gpt_config, create_test_deepseek_config])
+    `@pytest.mark.parametrize`("model_factory", [create_test_gpt_config, create_test_mla_config])
     def test_default_pipeline_dtype(self, model_factory, monkeypatch):
         """
-        Test pipeline_dtype is automatically set if None and PP enabled.
-        Test for both GPT and Deepseek to test both TransformerConfig types.
+        Test pipeline_dtype is automatically set if None and PP enabled.
+        Test for both GPT and MLA to test both TransformerConfig types.
         """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_config.py` around lines 79 - 89, Rename the
helper function create_test_deepseek_config to create_test_mla_config and update
all call sites and comments that still reference "Deepseek" to instead reference
MLAModelProvider/MLA (e.g., change the function name and any test references and
the comment that mentions Deepseek to reflect MLAModelProvider), ensuring the
factory still returns MLAModelProvider(**defaults) and preserving the same
default kwargs.
tests/unit_tests/training/test_log_non_default_values.py (2)

58-71: Test logic is correct; consider renaming for clarity.

The test correctly verifies that MLAModelProvider returns MCoreMLATransformerConfig as its parent, and the docstring is accurate. The method name test_deepseek_provider_returns_mla_transformer_config could be renamed to test_mla_provider_returns_mla_transformer_config for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_log_non_default_values.py` around lines 58 -
71, Rename the test function to match the subject under test: change the
function name test_deepseek_provider_returns_mla_transformer_config to
test_mla_provider_returns_mla_transformer_config in the test file so it clearly
reflects MLAModelProvider; keep the existing docstring and assertions (including
the reference to _get_mcore_transformer_parent and MCoreMLATransformerConfig)
unchanged.

299-315: Test logic is correct; naming could be improved for consistency.

The test correctly verifies that MLAModelProvider uses MLATransformerConfig for comparison. However, the method name, docstring, and variable name still reference "DeepSeek" while testing the generic MLAModelProvider.

♻️ Suggested naming updates
     `@patch`("megatron.bridge.training.config.print_rank_0")
-    def test_handles_deepseek_model_correctly(self, mock_print_rank_0):
-        """Should use MLATransformerConfig for DeepSeek models."""
-        deepseek_model = MLAModelProvider(
+    def test_handles_mla_model_correctly(self, mock_print_rank_0):
+        """Should use MLATransformerConfig for MLA-based models."""
+        mla_model = MLAModelProvider(
             num_layers=2,
             hidden_size=128,
             num_attention_heads=4,
             seq_length=512,
         )

-        cfg = self._create_minimal_config_container(model_provider=deepseek_model)
+        cfg = self._create_minimal_config_container(model_provider=mla_model)

         cfg.log_non_default_values()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_log_non_default_values.py` around lines 299 -
315, Rename references to "DeepSeek" in the test to reflect the generic
MLAModelProvider: update the test method name
test_handles_deepseek_model_correctly, its docstring, and the variable
deepseek_model to names that reference MLAModelProvider (e.g.,
test_handles_mla_model_correctly and mla_model) so they match the actual class
under test (MLAModelProvider); keep the assertions and usage of
mock_print_rank_0, cfg, and cfg.log_non_default_values() unchanged and ensure
the assertion still verifies "MLATransformerConfig" appears in
mock_print_rank_0.call_args[0][0].
examples/conversion/compare_hf_and_megatron/compare.py (1)

612-613: Also disable grad_scale_func when turning off MTP.

For consistency with the existing MTP-disable pattern, set m.config.grad_scale_func = None alongside mtp_num_layers = None.

🔧 Suggested patch
     for m in megatron_model:
         m.config.mtp_num_layers = None
+        m.config.grad_scale_func = None

Based on learnings: In examples/conversion/compare_hf_and_megatron/compare.py, disabling MTP for inference used m.config.mtp_num_layers = None and m.config.grad_scale_func = None together.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 612 -
613, When disabling MTP in the loop over megatron_model, also set the grad scale
function to None: inside the for m in megatron_model loop where you currently do
m.config.mtp_num_layers = None, add m.config.grad_scale_func = None so both
m.config.mtp_num_layers and m.config.grad_scale_func are cleared to disable MTP
consistently for inference.
src/megatron/bridge/recipes/moonlight/moonlight_16b.py (1)

199-255: Deduplicate shared MLAModelProvider kwargs between SFT and PEFT.

These two constructor blocks are large and mostly identical, which increases config drift risk.

♻️ Suggested refactor sketch
+def _moonlight_16b_model_kwargs(*, tp: int, ep: int, sequence_parallel: bool) -> dict:
+    return dict(
+        num_layers=27,
+        hidden_size=2048,
+        ffn_hidden_size=11264,
+        num_attention_heads=16,
+        kv_channels=16,
+        q_lora_rank=None,
+        kv_lora_rank=512,
+        num_moe_experts=64,
+        moe_ffn_hidden_size=1408,
+        moe_shared_expert_intermediate_size=2816,
+        moe_layer_freq=[0] * 1 + [1] * 26,
+        moe_router_topk=6,
+        moe_router_num_groups=1,
+        moe_router_group_topk=1,
+        moe_router_topk_scaling_factor=2.446,
+        moe_aux_loss_coeff=0.001,
+        make_vocab_size_divisible_by=1280,
+        moe_router_score_function="sigmoid",
+        moe_router_enable_expert_bias=True,
+        rotary_scaling_factor=1.0,
+        mscale=1.0,
+        mscale_all_dim=1.0,
+        rotary_base=50000,
+        layernorm_epsilon=1e-5,
+        init_method_std=0.02,
+        moe_router_bias_update_rate=1e-3,
+        rotary_percent=1.0,
+        vocab_size=163842,
+        normalization="RMSNorm",
+        activation_func=F.silu,
+        gated_linear_unit=True,
+        position_embedding_type="rope",
+        add_bias_linear=False,
+        share_embeddings_and_output_weights=False,
+        qk_layernorm=True,
+        bf16=True,
+        params_dtype=torch.bfloat16,
+        moe_grouped_gemm=True,
+        moe_token_dispatcher_type="alltoall",
+        tensor_model_parallel_size=tp,
+        pipeline_model_parallel_size=1,
+        pipeline_dtype=torch.bfloat16,
+        virtual_pipeline_model_parallel_size=None,
+        context_parallel_size=1,
+        expert_model_parallel_size=ep,
+        sequence_parallel=sequence_parallel,
+        expert_tensor_parallel_size=1,
+        recompute_granularity="selective",
+        recompute_modules=None,
+        recompute_method=None,
+        recompute_num_layers=None,
+    )
@@
-    cfg.model = MLAModelProvider(
-        ...
-    )
+    cfg.model = MLAModelProvider(**_moonlight_16b_model_kwargs(tp=2, ep=8, sequence_parallel=True))
@@
-    cfg.model = MLAModelProvider(
-        ...
-    )
+    cfg.model = MLAModelProvider(**_moonlight_16b_model_kwargs(tp=1, ep=2, sequence_parallel=False))

Also applies to: 413-468

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py` around lines 199 -
255, The MLAModelProvider constructor kwargs are duplicated between the SFT and
PEFT blocks (MLAModelProvider in moonlight_16b.py); extract the shared keyword
args into a single dict (e.g., common_model_kwargs or common_mla_config) and
pass it into both MLAModelProvider(...) calls (merge per-block overrides after
the shared dict), so SFT and PEFT only specify differences. Update any
references to parameters like num_layers, hidden_size, moe_*, rotary_*,
normalization, activation_func, bf16, params_dtype, parallelism and recompute_*
to live in the shared dict and leave only unique overrides in the individual
SFT/PEFT constructor calls.
tests/unit_tests/recipes/test_glm45_recipes.py (1)

138-153: Extract common monkeypatch setup into a pytest fixture.

This setup pattern is repeated and should be centralized to reduce drift across tests.

♻️ Example fixture extraction
+@pytest.fixture
+def patch_glm_recipe_io(monkeypatch: pytest.MonkeyPatch):
+    def _apply(mod, needs_tokenizer: bool):
+        monkeypatch.setattr(mod, "AutoBridge", _FakeBridge)
+        if needs_tokenizer:
+            import transformers
+            monkeypatch.setattr(
+                transformers,
+                "AutoTokenizer",
+                type("FakeAutoTokenizer", (), {"from_pretrained": staticmethod(lambda *args, **kwargs: _FakeTokenizer())}),
+            )
+    return _apply
@@
-def test_each_glm45_recipe_builds_config(recipe_func: Callable, monkeypatch: pytest.MonkeyPatch):
+def test_each_glm45_recipe_builds_config(recipe_func: Callable, patch_glm_recipe_io):
@@
-    monkeypatch.setattr(mod, "AutoBridge", _FakeBridge)
+    patch_glm_recipe_io(mod, is_sft_or_peft)

As per coding guidelines, "tests/**/*.py: Use pytest fixtures for common setup in unit tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/test_glm45_recipes.py` around lines 138 - 153,
Extract the repeated monkeypatch setup into a pytest fixture (e.g.,
fake_bridge_and_tokenizer) that accepts monkeypatch and recipe_func (or uses
request.param) and performs the same work: import the recipe module via
recipe_func.__module__, monkeypatch the module's AutoBridge to _FakeBridge,
compute is_sft_or_peft by checking "sft" or "peft" in
recipe_func.__name__.lower(), and when true monkeypatch
transformers.AutoTokenizer to a FakeAutoTokenizer that returns
_FakeTokenizer.from_pretrained; replace the inline code in tests with a use of
this fixture to centralize the AutoBridge/AutoTokenizer mocking and avoid
duplication.
examples/models/minimax_m2/inference.sh (1)

1-24: Add strict shell mode for safer multi-node execution.

Please add fail-fast shell options so partial failures do not continue silently.

♻️ Suggested hardening
 #!/usr/bin/env bash
+set -euo pipefail
+
 # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
@@
-uv run python -m torch.distributed.run --nproc_per_node=8 \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 \

As per coding guidelines, "**/*.sh: Follow Google Shell Style Guide." Based on learnings, "prefer including --no-sync` when dependencies are already installed and environment/toolchain compatibility is known."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/minimax_m2/inference.sh` around lines 1 - 24, Add fail-fast
shell options and the recommended --no-sync flag: at the top of inference.sh
(right after the shebang) enable strict mode by adding "set -euo pipefail"
(optionally "set -x" for debug) so the script exits on any error/unset
variable/pipe failure, and update the distributed run command line (the "uv run
python -m torch.distributed.run --nproc_per_node=8 ...") to include the
"--no-sync" option when appropriate to avoid unnecessary dependency sync; ensure
any future shell variables in this file are quoted to remain safe under "set
-u".
src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py (1)

37-77: Extract a shared Nano-v2 provider factory to reduce drift risk.

These six MambaModelProvider blocks duplicate a large set of identical fields. A small helper with variant-specific overrides would make updates safer and easier.

♻️ Refactor direction (example)
+def _build_nemotron_nano_v2_provider(
+    *,
+    variant: str,  # "9b" | "12b"
+    seq_length: int,
+    tensor_model_parallel_size: int,
+    sequence_parallel: bool,
+) -> MambaModelProvider:
+    base_kwargs = dict(
+        mamba_num_groups=8,
+        num_query_groups=8,
+        make_vocab_size_divisible_by=128,
+        activation_func=squared_relu,
+        masked_softmax_fusion=True,
+        apply_query_key_layer_scaling=False,
+        persist_layer_norm=True,
+        attention_softmax_in_fp32=False,
+        first_last_layers_bf16=True,
+        is_hybrid_model=True,
+        moe_aux_loss_coeff=0.0001,
+        moe_router_score_function="sigmoid",
+        moe_router_enable_expert_bias=True,
+        moe_router_load_balancing_type="seq_aux_loss",
+        moe_router_dtype="fp32",
+        moe_grouped_gemm=True,
+        moe_token_dispatcher_type="alltoall",
+        moe_permute_fusion=True,
+        moe_shared_expert_overlap=True,
+        pipeline_model_parallel_size=1,
+        pipeline_dtype=torch.bfloat16,
+        virtual_pipeline_model_parallel_size=None,
+        context_parallel_size=1,
+    )
+    variant_kwargs = {
+        "9b": dict(...),
+        "12b": dict(...),
+    }[variant]
+    return MambaModelProvider(
+        **base_kwargs,
+        **variant_kwargs,
+        seq_length=seq_length,
+        tensor_model_parallel_size=tensor_model_parallel_size,
+        sequence_parallel=sequence_parallel,
+    )

Also applies to: 172-212, 309-349, 435-475, 571-610, 718-757

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py` around lines 37 -
77, Multiple identical MambaModelProvider blocks (e.g., the cfg.model
assignment) should be consolidated into a shared factory to avoid drift; create
a helper function (e.g., make_nano_v2_provider or create_nano_v2_provider) that
returns a MambaModelProvider pre-populated with the common fields shown
(hybrid_override_pattern, num_layers, hidden_size, mamba_num_heads, kv_channels,
mamba_state_dim, ffn_hidden_size, num_attention_heads, mamba_head_dim,
seq_length, mamba_num_groups, num_query_groups, make_vocab_size_divisible_by,
activation_func, masked_softmax_fusion, apply_query_key_layer_scaling,
persist_layer_norm, attention_softmax_in_fp32, first_last_layers_bf16,
is_hybrid_model, moe_* settings, parallelism defaults, etc.), then replace each
duplicated MambaModelProvider instantiation (including cfg.model and the other
five blocks) with a call to this factory passing only variant-specific overrides
(e.g., tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype,
virtual_pipeline_model_parallel_size, context_parallel_size, sequence_parallel)
so updates to common settings are made in one place.
tests/functional_tests/training/test_decentralized_pg.py (1)

104-140: Extract a shared model-config factory to reduce maintenance drift across duplicated test configurations.

The same GPTModelProvider baseline is repeated across six test functions. A local helper with base kwargs and per-test overrides would eliminate drift-prone duplication and make test-specific variations clearer.

♻️ Refactor sketch
+    def _create_model_cfg(self, **overrides):
+        base = dict(
+            normalization="RMSNorm",
+            activation_func=F.silu,
+            gated_linear_unit=True,
+            position_embedding_type="rope",
+            add_bias_linear=False,
+            attention_dropout=0.0,
+            hidden_dropout=0.0,
+            bias_activation_fusion=True,
+            masked_softmax_fusion=True,
+            persist_layer_norm=True,
+            bias_dropout_fusion=True,
+            apply_rope_fusion=True,
+            num_query_groups=8,
+            init_method_std=0.02,
+            layernorm_epsilon=1e-05,
+            rotary_percent=1.0,
+            rope_scaling=True,
+            rope_scaling_factor=32.0,
+            rotary_base=500_000,
+            hidden_size=2048,
+            ffn_hidden_size=8192,
+            num_attention_heads=32,
+            attention_softmax_in_fp32=True,
+            pipeline_dtype=torch.bfloat16,
+            bf16=True,
+            make_vocab_size_divisible_by=128,
+            vocab_size=None,
+        )
+        base.update(overrides)
+        return GPTModelProvider(**base)

Applies to lines 104–140, 255–290, 410–446, 566–602, 722–758, 878–914.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_decentralized_pg.py` around lines 104 -
140, The repeated GPTModelProvider instantiation should be refactored into a
shared factory: create a local helper (e.g., make_gpt_model_cfg or
gpt_model_provider_factory) that constructs the base kwargs shown
(normalization, activation_func, gated_linear_unit, position_embedding_type,
etc.), accepts overrides for per-test changes, and returns a GPTModelProvider
instance; replace the six duplicated blocks that assign model_cfg with calls to
this helper, passing only test-specific overrides (like seq_length, num_layers,
share_embeddings_and_output_weights), so maintenance drift is eliminated and
variations remain explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py`:
- Around line 17-18: The file mixes old typing (Union/Optional/List) and unsafe
dataclass defaults: replace Union/Optional/List annotations with modern PEP
604/PEP 585 syntax (e.g., A | B, list[T], T | None) for the annotated variables
and function signatures (search for uses in e.g., any function or dataclass
fields around init_method_std and other annotated parameters), change the
annotation of init_method_std from int to float to match its float literal
default, and remove direct calls to functools.partial in dataclass defaults by
using dataclasses.field(default_factory=...) to construct the partial at
instance creation time (or convert to a module-level factory function) for
fields currently assigned partial(...) so you comply with RUF009 and
UP006/UP007.

In `@src/megatron/bridge/models/minimax_m2/minimax_m2_bridge.py`:
- Around line 27-31: The bridge currently registered with MegatronModelBridge
(decorator usage around MiniMaxM2ForCausalLM -> GPTModel) should be split to
follow repo conventions: move the orchestration/bridge implementation (the class
and register_bridge call) into a new model_bridge.py inside the model package,
and extract all parameter mapping logic into a new param_mapping.py; ensure
model_bridge.py imports the mapping functions/classes from param_mapping.py and
registers the bridge (same class name used now), and move the code currently
spanning the mapping section (previously around lines 95-149) into
param_mapping.py as exported mapping utilities referenced by model_bridge.py.
- Around line 73-82: The code currently silences unsupported QK-norm by setting
provider.qk_layernorm = False which drops q_norm.weight and k_norm.weight;
instead, make conversion fail-fast by detecting when those weights are present
and raising a NotImplementedError (or require an explicit opt-in flag) so users
cannot mistakenly get a misleading “successful” conversion. Update the logic
around provider.qk_layernorm in minimax_m2_bridge (and any conversion entrypoint
that calls it) to check for q_norm.weight/k_norm.weight in the source model and
either raise NotImplementedError with a descriptive message referencing
q_norm.weight and k_norm.weight and provider.qk_layernorm, or accept an explicit
parameter (e.g., allow_inexact_qk_norm) that must be true to proceed; ensure the
error message tells users how to opt in if desired.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`:
- Around line 211-259: The recipe passes raw strings to
MambaModelProvider.attention_backend but the provider expects the AttnBackend
enum; import AttnBackend and replace string usages (e.g.,
attention_backend="fused" and any later assignments like
cfg.model.attention_backend = "fused") with the corresponding AttnBackend enum
member (for example AttnBackend.fused or AttnBackend.auto as appropriate) so all
occurrences in this recipe (including the MambaModelProvider constructor and
subsequent cfg.model.attention_backend assignments) use the enum type instead of
raw strings.

In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py`:
- Around line 443-456: The NemotronH 56B configs create MambaModelProvider with
attention_backend=AttnBackend.auto but later assignments overwrite it to None;
update the code so the constructor parameter is authoritative by removing the
subsequent assignments that set attention_backend = None (or, if auto is
intended later, replace those None assignments with AttnBackend.auto) in each
config block where cfg.model = MambaModelProvider(...) appears (e.g., the block
starting with cfg.model = MambaModelProvider and the later places noted around
the other config functions); ensure only one definitive assignment to
attention_backend remains so AttnBackend.auto is not silently discarded.

In `@tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py`:
- Around line 169-173: Replace the fragile "assert False" in the failure path
with an explicit exception so failures are raised even under Python -O;
specifically, where you check result.returncode != 0 (using variables result and
test_name), keep the STDOUT/STDERR prints but replace the assert False line with
a raise (e.g., raise RuntimeError or AssertionError) that includes the same
message and return code information so the test fails reliably and provides
context.
- Around line 15-17: The test uses hardcoded interpreter and absolute repo paths
and an bare assert; update the subprocess invocation(s) in
test_minimax_m2_conversion.py to use sys.executable instead of "python", compute
repository paths relative to the test file using
Path(__file__).resolve().parents[...] (replace occurrences of
"/opt/Megatron-Bridge/" with the computed path) and pass cwd to subprocess.run,
use capture_output=True/text=True and check=True for portable behavior, and wrap
the subprocess call in a try/except catching subprocess.CalledProcessError (or
Exception) and call pytest.fail(...) with the captured stdout/stderr on error
instead of using assert False. Ensure you update imports to include sys and
pytest and adjust the function or block that performs the conversion accordingly
(identify the subprocess.run call and its surrounding error handling).

In `@tests/unit_tests/models/glm_vl/test_glm_45v_provider.py`:
- Around line 240-243: The test class TestGLM45VModelProviderInheritance is
missing a pytest category marker; add the `@pytest.mark.unit` decorator above the
class definition to categorize it as a unit test and ensure pytest is imported
(add "import pytest" at the top if not already present). Locate the class named
TestGLM45VModelProviderInheritance and prepend the decorator line
`@pytest.mark.unit` directly above the class declaration so the test follows the
repository's pytest.mark conventions.

---

Outside diff comments:
In `@src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py`:
- Around line 79-151: The provide() method accepts vp_stage but never uses or
validates it, which can silently construct an incorrect model under virtual
pipeline; update provide() (in nemotron_vl_provider.py) to validate vp_stage at
the start (e.g., ensure it's None or within allowed stage range) and either pass
it into the LLaVAModel constructor if that class supports a virtual pipeline
stage parameter or raise a clear exception when an unsupported vp_stage is
given; reference the provide() function and the vp_stage parameter and ensure
the guard runs before assembling language_cfg/vision_cfg and creating the
LLaVAModel so incorrect scopes cannot be built.

In `@tests/unit_tests/recipes/test_run_plugins.py`:
- Around line 77-102: The model config currently hardcodes seq_length=8192 in
the GPTModelProvider call, ignoring the caller-provided seq_length variable;
update the GPTModelProvider instantiation to use the seq_length variable (the
one popped from kwargs at the top of the function) instead of the literal 8192
so the model config matches the dataset/training sequence length and respects
overrides.

---

Nitpick comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 612-613: When disabling MTP in the loop over megatron_model, also
set the grad scale function to None: inside the for m in megatron_model loop
where you currently do m.config.mtp_num_layers = None, add
m.config.grad_scale_func = None so both m.config.mtp_num_layers and
m.config.grad_scale_func are cleared to disable MTP consistently for inference.

In `@examples/models/minimax_m2/inference.sh`:
- Around line 1-24: Add fail-fast shell options and the recommended --no-sync
flag: at the top of inference.sh (right after the shebang) enable strict mode by
adding "set -euo pipefail" (optionally "set -x" for debug) so the script exits
on any error/unset variable/pipe failure, and update the distributed run command
line (the "uv run python -m torch.distributed.run --nproc_per_node=8 ...") to
include the "--no-sync" option when appropriate to avoid unnecessary dependency
sync; ensure any future shell variables in this file are quoted to remain safe
under "set -u".

In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py`:
- Around line 199-255: The MLAModelProvider constructor kwargs are duplicated
between the SFT and PEFT blocks (MLAModelProvider in moonlight_16b.py); extract
the shared keyword args into a single dict (e.g., common_model_kwargs or
common_mla_config) and pass it into both MLAModelProvider(...) calls (merge
per-block overrides after the shared dict), so SFT and PEFT only specify
differences. Update any references to parameters like num_layers, hidden_size,
moe_*, rotary_*, normalization, activation_func, bf16, params_dtype, parallelism
and recompute_* to live in the shared dict and leave only unique overrides in
the individual SFT/PEFT constructor calls.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py`:
- Around line 37-77: Multiple identical MambaModelProvider blocks (e.g., the
cfg.model assignment) should be consolidated into a shared factory to avoid
drift; create a helper function (e.g., make_nano_v2_provider or
create_nano_v2_provider) that returns a MambaModelProvider pre-populated with
the common fields shown (hybrid_override_pattern, num_layers, hidden_size,
mamba_num_heads, kv_channels, mamba_state_dim, ffn_hidden_size,
num_attention_heads, mamba_head_dim, seq_length, mamba_num_groups,
num_query_groups, make_vocab_size_divisible_by, activation_func,
masked_softmax_fusion, apply_query_key_layer_scaling, persist_layer_norm,
attention_softmax_in_fp32, first_last_layers_bf16, is_hybrid_model, moe_*
settings, parallelism defaults, etc.), then replace each duplicated
MambaModelProvider instantiation (including cfg.model and the other five blocks)
with a call to this factory passing only variant-specific overrides (e.g.,
tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype,
virtual_pipeline_model_parallel_size, context_parallel_size, sequence_parallel)
so updates to common settings are made in one place.

In `@tests/functional_tests/training/test_decentralized_pg.py`:
- Around line 104-140: The repeated GPTModelProvider instantiation should be
refactored into a shared factory: create a local helper (e.g.,
make_gpt_model_cfg or gpt_model_provider_factory) that constructs the base
kwargs shown (normalization, activation_func, gated_linear_unit,
position_embedding_type, etc.), accepts overrides for per-test changes, and
returns a GPTModelProvider instance; replace the six duplicated blocks that
assign model_cfg with calls to this helper, passing only test-specific overrides
(like seq_length, num_layers, share_embeddings_and_output_weights), so
maintenance drift is eliminated and variations remain explicit.

In `@tests/unit_tests/recipes/test_glm45_recipes.py`:
- Around line 138-153: Extract the repeated monkeypatch setup into a pytest
fixture (e.g., fake_bridge_and_tokenizer) that accepts monkeypatch and
recipe_func (or uses request.param) and performs the same work: import the
recipe module via recipe_func.__module__, monkeypatch the module's AutoBridge to
_FakeBridge, compute is_sft_or_peft by checking "sft" or "peft" in
recipe_func.__name__.lower(), and when true monkeypatch
transformers.AutoTokenizer to a FakeAutoTokenizer that returns
_FakeTokenizer.from_pretrained; replace the inline code in tests with a use of
this fixture to centralize the AutoBridge/AutoTokenizer mocking and avoid
duplication.

In `@tests/unit_tests/recipes/test_moonlight_recipes.py`:
- Line 18: Update the stale docstring and fake class name so they match the
current provider type: change any docstring text referencing
MoonlightModelProvider16B to reference MLAModelProvider (or the current provider
name) and rename the fake class _FakeMoonlightModelProvider16B to
_FakeMLAModelProvider (or similar) and update all monkeypatch usages that
reference the old name (the test class and each place where the fake is injected
in the tests). Ensure the docstring, the fake class declaration, and every
usage/monkeypatch that currently points to _FakeMoonlightModelProvider16B are
updated to the new MLAModelProvider-based names so they are consistent.

In `@tests/unit_tests/training/test_config.py`:
- Around line 79-89: Rename the helper function create_test_deepseek_config to
create_test_mla_config and update all call sites and comments that still
reference "Deepseek" to instead reference MLAModelProvider/MLA (e.g., change the
function name and any test references and the comment that mentions Deepseek to
reflect MLAModelProvider), ensuring the factory still returns
MLAModelProvider(**defaults) and preserving the same default kwargs.

In `@tests/unit_tests/training/test_log_non_default_values.py`:
- Around line 58-71: Rename the test function to match the subject under test:
change the function name test_deepseek_provider_returns_mla_transformer_config
to test_mla_provider_returns_mla_transformer_config in the test file so it
clearly reflects MLAModelProvider; keep the existing docstring and assertions
(including the reference to _get_mcore_transformer_parent and
MCoreMLATransformerConfig) unchanged.
- Around line 299-315: Rename references to "DeepSeek" in the test to reflect
the generic MLAModelProvider: update the test method name
test_handles_deepseek_model_correctly, its docstring, and the variable
deepseek_model to names that reference MLAModelProvider (e.g.,
test_handles_mla_model_correctly and mla_model) so they match the actual class
under test (MLAModelProvider); keep the assertions and usage of
mock_print_rank_0, cfg, and cfg.log_non_default_values() unchanged and ensure
the assertion still verifies "MLATransformerConfig" appears in
mock_print_rank_0.call_args[0][0].

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 383b610 and 1434650.

📒 Files selected for processing (70)
  • examples/conversion/compare_hf_and_megatron/compare.py
  • examples/models/minimax_m2/conversion.sh
  • examples/models/minimax_m2/inference.sh
  • examples/models/minimax_m2/verify_toy_model.py
  • src/megatron/bridge/models/__init__.py
  • src/megatron/bridge/models/conversion/model_bridge.py
  • src/megatron/bridge/models/deepseek/__init__.py
  • src/megatron/bridge/models/deepseek/deepseek_provider.py
  • src/megatron/bridge/models/glm/__init__.py
  • src/megatron/bridge/models/glm/glm45_provider.py
  • src/megatron/bridge/models/glm_vl/glm_45v_provider.py
  • src/megatron/bridge/models/gpt_oss/__init__.py
  • src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py
  • src/megatron/bridge/models/hf_pretrained/utils.py
  • src/megatron/bridge/models/kimi/__init__.py
  • src/megatron/bridge/models/kimi/kimi_provider.py
  • src/megatron/bridge/models/llama/__init__.py
  • src/megatron/bridge/models/llama/llama_provider.py
  • src/megatron/bridge/models/llama_nemotron/__init__.py
  • src/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.py
  • src/megatron/bridge/models/llama_nemotron/llama_nemotron_provider.py
  • src/megatron/bridge/models/minimax_m2/__init__.py
  • src/megatron/bridge/models/minimax_m2/minimax_m2_bridge.py
  • src/megatron/bridge/models/nemotron/__init__.py
  • src/megatron/bridge/models/nemotron/nemotron_provider.py
  • src/megatron/bridge/models/nemotron_vl/__init__.py
  • src/megatron/bridge/models/nemotron_vl/nemotron_vl_bridge.py
  • src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py
  • src/megatron/bridge/models/nemotronh/__init__.py
  • src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
  • src/megatron/bridge/recipes/kimi/kimi_k2.py
  • src/megatron/bridge/recipes/moonlight/moonlight_16b.py
  • src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py
  • src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py
  • src/megatron/bridge/recipes/nemotronh/nemotronh.py
  • tests/functional_tests/data/test_samplers.py
  • tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py
  • tests/functional_tests/models/minimax_m2/__init__.py
  • tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py
  • tests/functional_tests/training/test_callbacks.py
  • tests/functional_tests/training/test_decentralized_pg.py
  • tests/functional_tests/training/test_finetune_dora.py
  • tests/functional_tests/training/test_finetune_lora.py
  • tests/functional_tests/training/test_inprocess_restart.py
  • tests/functional_tests/training/test_megatron_fsdp.py
  • tests/functional_tests/training/test_nvrx_straggler.py
  • tests/functional_tests/training/test_pretrain.py
  • tests/functional_tests/training/test_pretrain_resume.py
  • tests/functional_tests/training/test_sample_based_training.py
  • tests/functional_tests/training/test_sft.py
  • tests/functional_tests/training/test_tensor_inspect.py
  • tests/unit_tests/models/deepseek/test_deepseek_provider.py
  • tests/unit_tests/models/glm/test_glm45_provider.py
  • tests/unit_tests/models/glm_vl/test_glm_45v_provider.py
  • tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py
  • tests/unit_tests/models/kimi/test_kimi_provider.py
  • tests/unit_tests/models/llama/test_llama_provider.py
  • tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py
  • tests/unit_tests/models/nemotron/test_nemotron_provider.py
  • tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
  • tests/unit_tests/models/test_models_imports.py
  • tests/unit_tests/recipes/kimi/test_kimi_k2.py
  • tests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.py
  • tests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.py
  • tests/unit_tests/recipes/nemotronh/test_nemotronh.py
  • tests/unit_tests/recipes/test_glm45_recipes.py
  • tests/unit_tests/recipes/test_moonlight_recipes.py
  • tests/unit_tests/recipes/test_run_plugins.py
  • tests/unit_tests/training/test_config.py
  • tests/unit_tests/training/test_log_non_default_values.py
💤 Files with no reviewable changes (22)
  • src/megatron/bridge/models/glm/glm45_provider.py
  • tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py
  • tests/unit_tests/models/llama/test_llama_provider.py
  • tests/unit_tests/models/kimi/test_kimi_provider.py
  • src/megatron/bridge/models/nemotron/init.py
  • src/megatron/bridge/models/kimi/kimi_provider.py
  • tests/unit_tests/models/deepseek/test_deepseek_provider.py
  • src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py
  • src/megatron/bridge/models/llama_nemotron/init.py
  • tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
  • src/megatron/bridge/models/llama/llama_provider.py
  • src/megatron/bridge/models/deepseek/deepseek_provider.py
  • tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py
  • src/megatron/bridge/models/glm/init.py
  • tests/unit_tests/models/glm/test_glm45_provider.py
  • src/megatron/bridge/models/nemotronh/init.py
  • src/megatron/bridge/models/nemotron_vl/init.py
  • src/megatron/bridge/models/gpt_oss/init.py
  • src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
  • src/megatron/bridge/models/nemotron/nemotron_provider.py
  • tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py
  • tests/unit_tests/models/nemotron/test_nemotron_provider.py

Comment on lines +17 to 18
from typing import Callable, List, Optional, Union

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/megatron/bridge/models/glm_vl/glm_45v_provider.py | head -120

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5018


🏁 Script executed:

ruff check --select RUF009,UP006,UP007 src/megatron/bridge/models/glm_vl/glm_45v_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3574


Modernize typing annotations and fix dataclass defaults to comply with Python 3.10+ guidelines.

The file uses deprecated typing syntax (Union, Optional, List) and violates RUF009 by calling partial() directly in a dataclass default. Additionally, init_method_std is annotated as int but assigned a float literal.

♻️ Proposed changes
-from typing import Callable, List, Optional, Union
+from collections.abc import Callable

@@
-    transformer_layer_spec: Union[ModuleSpec, Callable[["GPTModelProvider"], ModuleSpec]] = partial(
-        get_gpt_decoder_block_spec, use_transformer_engine=HAVE_TE
-    )
+    transformer_layer_spec: ModuleSpec | Callable[["GPTModelProvider"], ModuleSpec] = field(
+        default_factory=lambda: partial(get_gpt_decoder_block_spec, use_transformer_engine=HAVE_TE)
+    )
@@
-    init_method_std: int = 0.02
+    init_method_std: float = 0.02
@@
-    share_embeddings_and_output_weights: Optional[bool] = False
+    share_embeddings_and_output_weights: bool | None = False
@@
-    mrope_section: List[int] = field(default_factory=lambda: [8, 12, 12])
+    mrope_section: list[int] = field(default_factory=lambda: [8, 12, 12])
@@
-    mtp_num_layers: Optional[int] = 1
-    mtp_loss_scaling_factor: Optional[float] = 0.3
+    mtp_num_layers: int | None = 1
+    mtp_loss_scaling_factor: float | None = 0.3
@@
-    moe_layer_freq: Union[int, List[int]] = field(default_factory=lambda: [0] * 1 + [1] * 45)
+    moe_layer_freq: int | list[int] = field(default_factory=lambda: [0] * 1 + [1] * 45)

Ruff confirms UP006, UP007, and RUF009 violations at lines 47, 75, and 105. Lines 56, 59, 97–98 contain similar issues requiring modernization per Python 3.10+ guidelines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py` around lines 17 - 18,
The file mixes old typing (Union/Optional/List) and unsafe dataclass defaults:
replace Union/Optional/List annotations with modern PEP 604/PEP 585 syntax
(e.g., A | B, list[T], T | None) for the annotated variables and function
signatures (search for uses in e.g., any function or dataclass fields around
init_method_std and other annotated parameters), change the annotation of
init_method_std from int to float to match its float literal default, and remove
direct calls to functools.partial in dataclass defaults by using
dataclasses.field(default_factory=...) to construct the partial at instance
creation time (or convert to a module-level factory function) for fields
currently assigned partial(...) so you comply with RUF009 and UP006/UP007.

Comment on lines +27 to +31
@MegatronModelBridge.register_bridge(
source="MiniMaxM2ForCausalLM",
target=GPTModel,
model_type="minimax_m2",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Align MiniMax-M2 bridge layout with repository bridge conventions.

Please move bridge orchestration to model_bridge.py and mapping definitions to param_mapping.py under the same model package to match the established layout and keep future bridge maintenance consistent.

Based on learnings: Applies to src/megatron/bridge/models/*/model_bridge.py : Create model bridge implementation in model_bridge.py when adding new model bridges; applies to src/megatron/bridge/models/*/param_mapping.py : Create parameter mapping in param_mapping.py when adding new model bridges.

Also applies to: 95-149

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/minimax_m2/minimax_m2_bridge.py` around lines 27 -
31, The bridge currently registered with MegatronModelBridge (decorator usage
around MiniMaxM2ForCausalLM -> GPTModel) should be split to follow repo
conventions: move the orchestration/bridge implementation (the class and
register_bridge call) into a new model_bridge.py inside the model package, and
extract all parameter mapping logic into a new param_mapping.py; ensure
model_bridge.py imports the mapping functions/classes from param_mapping.py and
registers the bridge (same class name used now), and move the code currently
spanning the mapping section (previously around lines 95-149) into
param_mapping.py as exported mapping utilities referenced by model_bridge.py.

Comment on lines +73 to +82
# TODO: MiniMax-M2 uses full-dimension QK norm (q_norm weight shape = num_heads * head_dim)
# while Megatron uses per-head QK norm (weight shape = head_dim). These are
# mathematically different (different normalization denominators), so the weights
# cannot be losslessly converted. Disabled here, which means q_norm.weight and
# k_norm.weight are dropped during conversion. This is acceptable for fine-tuning
# but will cause forward-pass divergence from HF for inference.
# Fix: add full-dimension QK norm support to Megatron-Core, or write a custom
# layer spec for this model.
provider.qk_layernorm = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not silently drop unsupported QK-norm weights during conversion.

Current behavior proceeds with conversion while dropping q_norm.weight / k_norm.weight, which can be misread as successful exact conversion. This should fail fast (or require explicit opt-in for inexact conversion).

💡 Suggested guard
-        provider.qk_layernorm = False
+        if getattr(hf_config, "use_qk_norm", False):
+            raise NotImplementedError(
+                "MiniMax-M2 q_norm/k_norm conversion is not supported yet. "
+                "Conversion currently drops these weights and is inexact."
+            )
+        provider.qk_layernorm = False

Based on learnings: In Python implementations, when a feature is not supported, raise an explicit error (e.g., NotImplementedError) with a descriptive message.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/minimax_m2/minimax_m2_bridge.py` around lines 73 -
82, The code currently silences unsupported QK-norm by setting
provider.qk_layernorm = False which drops q_norm.weight and k_norm.weight;
instead, make conversion fail-fast by detecting when those weights are present
and raising a NotImplementedError (or require an explicit opt-in flag) so users
cannot mistakenly get a misleading “successful” conversion. Update the logic
around provider.qk_layernorm in minimax_m2_bridge (and any conversion entrypoint
that calls it) to check for q_norm.weight/k_norm.weight in the source model and
either raise NotImplementedError with a descriptive message referencing
q_norm.weight and k_norm.weight and provider.qk_layernorm, or accept an explicit
parameter (e.g., allow_inexact_qk_norm) that must be true to proceed; ensure the
error message tells users how to opt in if desired.

Comment on lines +211 to +259
# Model config - Nemotron 3 Nano
cfg.model = MambaModelProvider(
# Architecture (Nemotron 3 Nano 30B-A3B)
hybrid_override_pattern="MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME",
num_layers=52,
hidden_size=2688,
mamba_num_heads=64,
kv_channels=128,
mamba_state_dim=128,
ffn_hidden_size=1856,
num_attention_heads=32,
mamba_head_dim=64,
seq_length=2048,
num_query_groups=2,
# MoE
num_moe_experts=128,
moe_ffn_hidden_size=1856,
moe_shared_expert_intermediate_size=3712,
moe_router_topk=6,
moe_router_topk_scaling_factor=2.5,
moe_router_num_groups=1,
moe_router_group_topk=1,
# NemotronH base
mamba_num_groups=8,
make_vocab_size_divisible_by=128,
activation_func=squared_relu,
masked_softmax_fusion=True,
apply_query_key_layer_scaling=False,
persist_layer_norm=True,
attention_softmax_in_fp32=False,
first_last_layers_bf16=True,
is_hybrid_model=True,
moe_aux_loss_coeff=0.0001,
moe_router_score_function="sigmoid",
moe_router_enable_expert_bias=True,
moe_router_load_balancing_type="seq_aux_loss",
moe_router_dtype="fp32",
moe_grouped_gemm=True,
moe_token_dispatcher_type="alltoall",
moe_permute_fusion=True,
moe_shared_expert_overlap=True,
# Extra config
apply_rope_fusion=False,
attention_backend="fused",
gradient_accumulation_fusion=True,
init_method_std=0.0173,
use_fused_weighted_squared_relu=True,
calculate_per_token_loss=True,
# Parallelism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect attention_backend assignments in this recipe
echo "=== attention_backend in nemotron_3_nano.py ==="
rg -n 'attention_backend\s*=' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py

# Inspect Mamba provider backend typing
echo -e "\n=== MambaModelProvider attention_backend type ==="
rg -n 'attention_backend' src/megatron/bridge/models/mamba/mamba_provider.py | head -20

# Check if AttnBackend is imported/used in Mamba provider
echo -e "\n=== AttnBackend usage in mamba_provider.py ==="
rg -n 'AttnBackend' src/megatron/bridge/models/mamba/mamba_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 610


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check how attention_backend is used in other recipes
echo "=== AttnBackend usage in recipes ==="
rg -l 'AttnBackend' src/megatron/bridge/recipes/

# Check string usage in other recipes
echo -e "\n=== attention_backend=\"fused\" usage in recipes ==="
rg -n 'attention_backend\s*=' src/megatron/bridge/recipes/ | grep -E '(fused|sdpa|flash_attention)' | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 711


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find AttnBackend enum definition
echo "=== AttnBackend enum definition ==="
rg -n 'class AttnBackend|AttnBackend.*=' --type py | grep -E '(class|enum)' | head -20

# Check imports in megatron.core
fd -e py -p 'enums\.py' src/megatron/bridge
rg -n 'AttnBackend' src/megatron/bridge/models/mamba/mamba_provider.py -B5 -A5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 106


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check how nemotronh.py (parent recipe) uses attention_backend
echo "=== nemotronh.py attention_backend usage ==="
rg -n -B2 -A2 'attention_backend' src/megatron/bridge/recipes/nemotronh/nemotronh.py

# Check the exact lines in the identified ranges
echo -e "\n=== nemotron_3_nano.py lines 211-259 ==="
sed -n '211,259p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py | grep -n 'attention_backend' || echo "attention_backend not in this range"

echo -e "\n=== nemotron_3_nano.py lines 402-449 ==="
sed -n '402,449p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py | grep -n 'attention_backend' || echo "attention_backend not in this range"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3062


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the full context of the flagged assignments
echo "=== Full context: line 254 (in range 211-259) ==="
sed -n '250,260p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py

echo -e "\n=== Full context: line 444 (in range 402-449) ==="
sed -n '440,450p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py

# Check AttnBackend enum import in nemotron_3_nano.py
echo -e "\n=== Imports in nemotron_3_nano.py ==="
head -40 src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py | grep -E '(import|from).*AttnBackend|megatron'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1482


Use AttnBackend enum instead of raw string values for attention_backend in MambaModelProvider.

The MambaModelProvider type signature requires attention_backend: AttnBackend, but the recipe passes raw strings ("fused"). The parent recipe nemotronh.py correctly uses AttnBackend.auto. Add the missing import and update all assignments to use enum values.

♻️ Suggested update
+from megatron.core.transformer.enums import AttnBackend
 from megatron.core.activations import squared_relu
 from megatron.bridge.models.mamba.mamba_provider import MambaModelProvider
@@
-        attention_backend="fused",
+        attention_backend=AttnBackend.fused,
@@
-        attention_backend="fused",
+        attention_backend=AttnBackend.fused,

Also applies to lines 125, 298, and 488 where cfg.model.attention_backend = "fused" should use enum values.

🧰 Tools
🪛 Ruff (0.15.2)

[error] 249-249: Possible hardcoded password assigned to argument: "moe_token_dispatcher_type"

(S106)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` around lines 211 -
259, The recipe passes raw strings to MambaModelProvider.attention_backend but
the provider expects the AttnBackend enum; import AttnBackend and replace string
usages (e.g., attention_backend="fused" and any later assignments like
cfg.model.attention_backend = "fused") with the corresponding AttnBackend enum
member (for example AttnBackend.fused or AttnBackend.auto as appropriate) so all
occurrences in this recipe (including the MambaModelProvider constructor and
subsequent cfg.model.attention_backend assignments) use the enum type instead of
raw strings.

Comment on lines +443 to +456
cfg.model = MambaModelProvider(
# Architecture (NemotronH 56B)
hybrid_override_pattern=(
"M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-"
"M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M-"
),
num_layers=118,
hidden_size=8192,
mamba_state_dim=256,
mamba_num_heads=256,
ffn_hidden_size=32768,
num_attention_heads=64,
attention_backend=AttnBackend.auto,
# NemotronH base
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# First, check if the file exists and get its structure
if [ -f "src/megatron/bridge/recipes/nemotronh/nemotronh.py" ]; then
  echo "=== File found. Showing lines around claimed assignments ==="
  
  # Show constructor assignment at 443-456
  echo -e "\n--- Constructor at lines 443-456 ---"
  sed -n '443,456p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
  
  # Show line 524 (first override claim)
  echo -e "\n--- Assignment at line 524 ---"
  sed -n '520,530p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
  
  # Show 969-983 (second constructor claim)
  echo -e "\n--- Constructor at lines 969-983 ---"
  sed -n '969,983p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
  
  # Show line 1032 (second override claim)
  echo -e "\n--- Assignment at line 1032 ---"
  sed -n '1028,1038p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
  
  # Show 1551-1564 (third constructor claim)
  echo -e "\n--- Constructor at lines 1551-1564 ---"
  sed -n '1551,1564p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
  
  # Show line 1613 (third override claim)
  echo -e "\n--- Assignment at line 1613 ---"
  sed -n '1609,1619p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
  
  echo -e "\n--- All attention_backend assignments in file ---"
  rg -n 'attention_backend\s*=' src/megatron/bridge/recipes/nemotronh/nemotronh.py
else
  echo "File not found at src/megatron/bridge/recipes/nemotronh/nemotronh.py"
  echo "Searching for nemotronh files..."
  fd nemotronh.py
fi

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3687


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if AttnBackend has 'auto' variant
echo "=== AttnBackend enum definition ==="
fd AttnBackend -t f | head -10

# Search for AttnBackend enum definition
rg -l "class AttnBackend|enum AttnBackend" --type py

# Look at the definition
rg -A 10 "class AttnBackend" src/megatron/core/transformer/enums.py 2>/dev/null || rg -A 10 "class AttnBackend"

# Verify MambaModelProvider signature
echo -e "\n=== MambaModelProvider constructor ==="
rg -A 5 "class MambaModelProvider|def __init__" src/megatron/bridge/models/mamba/mamba_provider.py | head -40

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 106


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check imports in nemotronh.py
echo "=== Imports in nemotronh.py ==="
head -50 src/megatron/bridge/recipes/nemotronh/nemotronh.py | grep -i "attnbackend\|import"

# Search for AttnBackend definition across codebase
echo -e "\n=== Searching for AttnBackend definition ==="
rg "class AttnBackend|enum.*AttnBackend" -A 5

# Also check what's imported from megatron
echo -e "\n=== Checking for AttnBackend.auto usage ==="
rg "AttnBackend\." --type py | head -20

# Check MambaModelProvider to see if it accepts attention_backend
echo -e "\n=== MambaModelProvider signature ==="
rg -B 5 -A 20 "class MambaModelProvider" src/megatron/bridge/models/mamba/mamba_provider.py | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 9705


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify MambaModelProvider's attention_backend field definition
echo "=== MambaModelProvider field definition (full class) ==="
rg -B 2 -A 40 "class MambaModelProvider" src/megatron/bridge/models/mamba/mamba_provider.py | grep -A 35 "class MambaModelProvider"

# Check if attention_backend is inherited from TransformerConfig
echo -e "\n=== TransformerConfig definition ==="
rg -A 5 "class TransformerConfig" src/megatron/bridge/models/transformer_config.py | head -20

# Verify what the default attention_backend is in MambaModelProvider
echo -e "\n=== MambaModelProvider attention_backend initialization ==="
rg "attention_backend.*=" src/megatron/bridge/models/mamba/mamba_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2083


Consolidate conflicting attention_backend assignments in NemotronH 56B configs.

AttnBackend.auto is set in the constructor, then immediately overwritten to None in the same function, rendering the constructor parameter ineffective. This pattern repeats across three separate configuration functions.

♻️ Suggested fix

Either remove the constructor parameter:

-        attention_backend=AttnBackend.auto,

Or, if auto is the intended behavior, replace the later None assignments:

-    cfg.model.attention_backend = None
+    cfg.model.attention_backend = AttnBackend.auto

Also applies to: lines 524, 969–983, 1032, 1551–1564, 1613

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py` around lines 443 - 456,
The NemotronH 56B configs create MambaModelProvider with
attention_backend=AttnBackend.auto but later assignments overwrite it to None;
update the code so the constructor parameter is authoritative by removing the
subsequent assignments that set attention_backend = None (or, if auto is
intended later, replace those None assignments with AttnBackend.auto) in each
config block where cfg.model = MambaModelProvider(...) appears (e.g., the block
starting with cfg.model = MambaModelProvider and the later places noted around
the other config functions); ensure only one definitive assignment to
attention_backend remains so AttnBackend.auto is not silently discarded.

Comment on lines +15 to +17
import json
import subprocess
from pathlib import Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and get its full content
wc -l tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 145


🏁 Script executed:

# Read the test file with line numbers
cat -n tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 7712


Use portable subprocess configuration and explicit exception handling in conversion test.

The subprocess command contains hardcoded interpreter path and absolute repo paths that break outside the specific CI environment. Line 138 uses "python" instead of sys.executable, and lines 146–147 hardcode /opt/Megatron-Bridge/ instead of computing from the test file location. Additionally, line 172 uses assert False which can be optimized away; use pytest.fail() instead.

🔧 Suggested fix
 import json
 import subprocess
+import sys
 from pathlib import Path
@@
-        cmd = [
-            "python",
+        repo_root = Path(__file__).resolve().parents[4]
+        coverage_file = tmp_path / ".coverage"
+        cmd = [
+            sys.executable,
             "-m",
             "torch.distributed.run",
             "--nproc_per_node=2",
             "--nnodes=1",
             "-m",
             "coverage",
             "run",
-            "--data-file=/opt/Megatron-Bridge/.coverage",
-            "--source=/opt/Megatron-Bridge/",
+            f"--data-file={coverage_file}",
+            f"--source={repo_root}",
             "--parallel-mode",
-            "examples/conversion/hf_megatron_roundtrip_multi_gpu.py",
+            str(repo_root / "examples/conversion/hf_megatron_roundtrip_multi_gpu.py"),
@@
             cwd=repo_root,
         )
@@
         if result.returncode != 0:
             print(f"STDOUT: {result.stdout}")
             print(f"STDERR: {result.stderr}")
-            assert False, f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"
+            pytest.fail(f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py`
around lines 15 - 17, The test uses hardcoded interpreter and absolute repo
paths and an bare assert; update the subprocess invocation(s) in
test_minimax_m2_conversion.py to use sys.executable instead of "python", compute
repository paths relative to the test file using
Path(__file__).resolve().parents[...] (replace occurrences of
"/opt/Megatron-Bridge/" with the computed path) and pass cwd to subprocess.run,
use capture_output=True/text=True and check=True for portable behavior, and wrap
the subprocess call in a try/except catching subprocess.CalledProcessError (or
Exception) and call pytest.fail(...) with the captured stdout/stderr on error
instead of using assert False. Ensure you update imports to include sys and
pytest and adjust the function or block that performs the conversion accordingly
(identify the subprocess.run call and its surrounding error handling).

Comment on lines +169 to +173
if result.returncode != 0:
print(f"STDOUT: {result.stdout}")
print(f"STDERR: {result.stderr}")
assert False, f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check the file structure and verify the code at lines 169-173
head -180 tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py | tail -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 770


🏁 Script executed:

# Search for all assert False patterns in this file
rg -n 'assert False' tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 181


🏁 Script executed:

# Get full context around lines 169-173
sed -n '160,180p' tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 780


Replace assert False with an explicit exception.

Using assert False in critical failure paths is problematic because asserts can be disabled when Python runs in optimized mode (-O or -OO flags). This is particularly important for functional tests where failures must be explicitly raised regardless of optimization settings.

Suggested fix
-            assert False, f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"
+            raise AssertionError(
+                f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if result.returncode != 0:
print(f"STDOUT: {result.stdout}")
print(f"STDERR: {result.stderr}")
assert False, f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"
if result.returncode != 0:
print(f"STDOUT: {result.stdout}")
print(f"STDERR: {result.stderr}")
raise AssertionError(
f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"
)
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 172-172: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py`
around lines 169 - 173, Replace the fragile "assert False" in the failure path
with an explicit exception so failures are raised even under Python -O;
specifically, where you check result.returncode != 0 (using variables result and
test_name), keep the STDOUT/STDERR prints but replace the assert False line with
a raise (e.g., raise RuntimeError or AssertionError) that includes the same
message and return code information so the test fails reliably and provides
context.

Comment on lines 240 to +243
class TestGLM45VModelProviderInheritance:
"""Test inheritance behavior from GLM45AirModelProvider106B."""
"""Test inheritance behavior from GPTModelProvider."""

def test_glm_45v_inherits_from_air_provider(self, mock_vision_config):
"""Test GLM45VModelProvider inherits from GLM45AirModelProvider106B."""
from megatron.bridge.models import GLM45AirModelProvider106B
def test_glm_45v_inherits_from_gpt_provider(self, mock_vision_config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether touched test class has a pytest category marker.
rg -n -C1 '^\s*@pytest\.mark\.(unit|integration|system)\b|^class TestGLM45VModelProviderInheritance\b' \
  tests/unit_tests/models/glm_vl/test_glm_45v_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 184


🏁 Script executed:

#!/bin/bash
# Get more context around the class definition (lines 235-250)
head -n 250 tests/unit_tests/models/glm_vl/test_glm_45v_provider.py | tail -n 20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 717


🏁 Script executed:

#!/bin/bash
# Check if other test classes in this file have pytest markers
rg -n '^@pytest\.mark\.(unit|integration|system)|^class Test' \
  tests/unit_tests/models/glm_vl/test_glm_45v_provider.py | head -n 30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 194


🏁 Script executed:

#!/bin/bash
# Check the first few lines of the test classes to see if ANY have pytest markers
rg -B5 'class TestGLM45VModelProvider' tests/unit_tests/models/glm_vl/test_glm_45v_provider.py | grep -E '@pytest\.mark|class Test'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 183


Add the @pytest.mark.unit decorator to the test class.

The class TestGLM45VModelProviderInheritance lacks a pytest category marker. Per coding guidelines, all tests in tests/**/*.py must be explicitly categorized with pytest.mark (unit, integration, or system).

Proposed change
+@pytest.mark.unit
 class TestGLM45VModelProviderInheritance:
     """Test inheritance behavior from GPTModelProvider."""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class TestGLM45VModelProviderInheritance:
"""Test inheritance behavior from GLM45AirModelProvider106B."""
"""Test inheritance behavior from GPTModelProvider."""
def test_glm_45v_inherits_from_air_provider(self, mock_vision_config):
"""Test GLM45VModelProvider inherits from GLM45AirModelProvider106B."""
from megatron.bridge.models import GLM45AirModelProvider106B
def test_glm_45v_inherits_from_gpt_provider(self, mock_vision_config):
`@pytest.mark.unit`
class TestGLM45VModelProviderInheritance:
"""Test inheritance behavior from GPTModelProvider."""
def test_glm_45v_inherits_from_gpt_provider(self, mock_vision_config):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/models/glm_vl/test_glm_45v_provider.py` around lines 240 -
243, The test class TestGLM45VModelProviderInheritance is missing a pytest
category marker; add the `@pytest.mark.unit` decorator above the class definition
to categorize it as a unit test and ensure pytest is imported (add "import
pytest" at the top if not already present). Locate the class named
TestGLM45VModelProviderInheritance and prepend the decorator line
`@pytest.mark.unit` directly above the class declaration so the test follows the
repository's pytest.mark conventions.

…ti-node support for MiniMax-M2

Add custom full-dimension QK normalization (minimax_m2_provider.py) since
MiniMax-M2 applies RMSNorm over the entire Q/K projection rather than
per-head. The implementation uses sum-of-squares all-reduce across TP
ranks and provides sharded_state_dict for distributed checkpointing.

Add on-the-fly FP8 block-wise dequantization in the bridge via
maybe_modify_loaded_hf_weight, converting float8_e4m3fn weights to
bfloat16 using per-block scale_inv factors during HF->Megatron
conversion.

Add multi-node Slurm scripts (slurm_conversion.sh, slurm_inference.sh)
for configurations requiring TP*EP*PP > 8 GPUs.

Update verify_toy_model.py to extract real pretrained weights (N layers)
from the FP8 model, dequantize to bf16, and verify round-trip accuracy.

Fix dtype mismatch handling in hf_megatron_roundtrip_multi_gpu.py for
FP8 source models.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@yaoyu-33 yaoyu-33 mentioned this pull request Mar 4, 2026
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant