Conversation
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- Remove NemotronNano12Bv2Provider from nemotron_vl/__init__.py (was a deprecated alias from deleted nemotron_h_provider.py) - Remove invalid max_position_embeddings kwarg from kimi and moonlight recipes (not a field on MLAModelProvider) - Update moonlight test to monkeypatch MLAModelProvider instead of deleted MoonlightModelProvider16B Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Add Megatron Bridge for MiniMaxAI/MiniMax-M2, a sparse MoE model with 256 experts, top-8 sigmoid routing, and expert bias correction. Includes: - Bridge with config mapping and per-expert weight conversion (block_sparse_moe prefix, w1/w2/w3 format) - Partial RoPE support (rotary_dim -> rotary_percent) - QK layernorm intentionally disabled (full-dim vs per-head mismatch) - Functional test with toy model for TP/PP/EP parallelism - Example scripts for conversion, inference, and verification - compare.py fix: truncate Megatron logits to HF vocab size for proper comparison when Megatron pads vocab for kernel efficiency Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
📝 WalkthroughWalkthroughThis PR adds MiniMax-M2 model support to Megatron-Bridge and consolidates the model provider architecture by removing numerous deprecated provider classes across DeepSeek, Llama, Nemotron, GLM, GPT-OSS, and Kimi variants, replacing them with unified bridge implementations and generic providers (GPTModelProvider, MLAModelProvider, MambaModelProvider). It includes new conversion/inference scripts, updates to recipes, tests, and examples. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/unit_tests/recipes/test_run_plugins.py (1)
77-102:⚠️ Potential issue | 🟡 MinorUse the caller-provided sequence length in model config.
Line 77 accepts
seq_length, but Line 101 hardcodes8192. This ignores overrides and can desynchronize model vs dataset sequence lengths.🔧 Minimal fix
- seq_length=8192, + seq_length=seq_length,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/recipes/test_run_plugins.py` around lines 77 - 102, The model config currently hardcodes seq_length=8192 in the GPTModelProvider call, ignoring the caller-provided seq_length variable; update the GPTModelProvider instantiation to use the seq_length variable (the one popped from kwargs at the top of the function) instead of the literal 8192 so the model config matches the dataset/training sequence length and respects overrides.src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py (1)
79-151:⚠️ Potential issue | 🟠 MajorDo not silently ignore
vp_stageinprovide().Line 79 accepts
vp_stage, but the implementation never uses or validates it. That can silently construct the wrong model scope under virtual pipeline usage instead of failing fast.🔒 Suggested guard
def provide(self, pre_process=None, post_process=None, vp_stage=None): # noqa: D401 """Assemble a full :class:`~megatron.core.models.multimodal.llava_model.LLaVAModel`.""" + if vp_stage is not None or getattr(self, "virtual_pipeline_model_parallel_size", None) is not None: + raise ValueError( + "Virtual pipeline parallelism is not supported in " + "NemotronNano12Bv2VLModelProvider.provide()." + ) language_cfg = copy.deepcopy(self)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py` around lines 79 - 151, The provide() method accepts vp_stage but never uses or validates it, which can silently construct an incorrect model under virtual pipeline; update provide() (in nemotron_vl_provider.py) to validate vp_stage at the start (e.g., ensure it's None or within allowed stage range) and either pass it into the LLaVAModel constructor if that class supports a virtual pipeline stage parameter or raise a clear exception when an unsupported vp_stage is given; reference the provide() function and the vp_stage parameter and ensure the guard runs before assembling language_cfg/vision_cfg and creating the LLaVAModel so incorrect scopes cannot be built.
🧹 Nitpick comments (10)
tests/unit_tests/recipes/test_moonlight_recipes.py (1)
18-18: Consider updating stale documentation and class naming.The docstring on line 18 still references
MoonlightModelProvider16B, and the fake class_FakeMoonlightModelProvider16B(line 76) is now used to mockMLAModelProvider. This inconsistency could confuse future maintainers.✏️ Suggested documentation updates
# Test purpose: # - Parametrize over all exported Moonlight recipe functions in `megatron.bridge.recipes.moonlight`. -# - For each recipe, monkeypatch `MoonlightModelProvider16B` with a lightweight fake to avoid I/O. +# - For each recipe, monkeypatch `MLAModelProvider` with a lightweight fake to avoid I/O. # - Build a config with small, safe overrides and assert it forms a valid `ConfigContainer`. # - Verify tokenizer selection and sanity-check parallelism fields.-class _FakeMoonlightModelProvider16B: - """Fake MoonlightModelProvider16B for testing without model I/O.""" +class _FakeMLAModelProvider: + """Fake MLAModelProvider for testing Moonlight recipes without model I/O."""If renaming the class, update all usages (lines 136, 171, 196, 222, 238, 261, 284, 307, 329).
Also applies to: 76-77
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/recipes/test_moonlight_recipes.py` at line 18, Update the stale docstring and fake class name so they match the current provider type: change any docstring text referencing MoonlightModelProvider16B to reference MLAModelProvider (or the current provider name) and rename the fake class _FakeMoonlightModelProvider16B to _FakeMLAModelProvider (or similar) and update all monkeypatch usages that reference the old name (the test class and each place where the fake is injected in the tests). Ensure the docstring, the fake class declaration, and every usage/monkeypatch that currently points to _FakeMoonlightModelProvider16B are updated to the new MLAModelProvider-based names so they are consistent.tests/unit_tests/training/test_config.py (1)
79-89: Consider renaming this function to reflect its current purpose.The function is named
create_test_deepseek_configbut now creates anMLAModelProvider. Consider renaming tocreate_test_mla_configfor consistency with the updated implementation. The test comment on line 1229 also references "Deepseek" but testsMLAModelProvider.♻️ Suggested rename
-def create_test_deepseek_config(**kwargs: Any) -> MLAModelProvider: - """Creates an instance of MLAModelProvider for testing.""" +def create_test_mla_config(**kwargs: Any) -> MLAModelProvider: + """Creates an instance of MLAModelProvider for testing MLA-based models.""" defaults = { "num_layers": 1, "hidden_size": 128, "num_attention_heads": 4, "seq_length": 512, "apply_rope_fusion": False, } defaults.update(kwargs) return MLAModelProvider(**defaults)Also update the reference on line 1225 and comment on line 1229:
- `@pytest.mark.parametrize`("model_factory", [create_test_gpt_config, create_test_deepseek_config]) + `@pytest.mark.parametrize`("model_factory", [create_test_gpt_config, create_test_mla_config]) def test_default_pipeline_dtype(self, model_factory, monkeypatch): """ - Test pipeline_dtype is automatically set if None and PP enabled. - Test for both GPT and Deepseek to test both TransformerConfig types. + Test pipeline_dtype is automatically set if None and PP enabled. + Test for both GPT and MLA to test both TransformerConfig types. """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/training/test_config.py` around lines 79 - 89, Rename the helper function create_test_deepseek_config to create_test_mla_config and update all call sites and comments that still reference "Deepseek" to instead reference MLAModelProvider/MLA (e.g., change the function name and any test references and the comment that mentions Deepseek to reflect MLAModelProvider), ensuring the factory still returns MLAModelProvider(**defaults) and preserving the same default kwargs.tests/unit_tests/training/test_log_non_default_values.py (2)
58-71: Test logic is correct; consider renaming for clarity.The test correctly verifies that
MLAModelProviderreturnsMCoreMLATransformerConfigas its parent, and the docstring is accurate. The method nametest_deepseek_provider_returns_mla_transformer_configcould be renamed totest_mla_provider_returns_mla_transformer_configfor consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/training/test_log_non_default_values.py` around lines 58 - 71, Rename the test function to match the subject under test: change the function name test_deepseek_provider_returns_mla_transformer_config to test_mla_provider_returns_mla_transformer_config in the test file so it clearly reflects MLAModelProvider; keep the existing docstring and assertions (including the reference to _get_mcore_transformer_parent and MCoreMLATransformerConfig) unchanged.
299-315: Test logic is correct; naming could be improved for consistency.The test correctly verifies that
MLAModelProviderusesMLATransformerConfigfor comparison. However, the method name, docstring, and variable name still reference "DeepSeek" while testing the genericMLAModelProvider.♻️ Suggested naming updates
`@patch`("megatron.bridge.training.config.print_rank_0") - def test_handles_deepseek_model_correctly(self, mock_print_rank_0): - """Should use MLATransformerConfig for DeepSeek models.""" - deepseek_model = MLAModelProvider( + def test_handles_mla_model_correctly(self, mock_print_rank_0): + """Should use MLATransformerConfig for MLA-based models.""" + mla_model = MLAModelProvider( num_layers=2, hidden_size=128, num_attention_heads=4, seq_length=512, ) - cfg = self._create_minimal_config_container(model_provider=deepseek_model) + cfg = self._create_minimal_config_container(model_provider=mla_model) cfg.log_non_default_values()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/training/test_log_non_default_values.py` around lines 299 - 315, Rename references to "DeepSeek" in the test to reflect the generic MLAModelProvider: update the test method name test_handles_deepseek_model_correctly, its docstring, and the variable deepseek_model to names that reference MLAModelProvider (e.g., test_handles_mla_model_correctly and mla_model) so they match the actual class under test (MLAModelProvider); keep the assertions and usage of mock_print_rank_0, cfg, and cfg.log_non_default_values() unchanged and ensure the assertion still verifies "MLATransformerConfig" appears in mock_print_rank_0.call_args[0][0].examples/conversion/compare_hf_and_megatron/compare.py (1)
612-613: Also disablegrad_scale_funcwhen turning off MTP.For consistency with the existing MTP-disable pattern, set
m.config.grad_scale_func = Nonealongsidemtp_num_layers = None.🔧 Suggested patch
for m in megatron_model: m.config.mtp_num_layers = None + m.config.grad_scale_func = NoneBased on learnings: In
examples/conversion/compare_hf_and_megatron/compare.py, disabling MTP for inference usedm.config.mtp_num_layers = Noneandm.config.grad_scale_func = Nonetogether.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 612 - 613, When disabling MTP in the loop over megatron_model, also set the grad scale function to None: inside the for m in megatron_model loop where you currently do m.config.mtp_num_layers = None, add m.config.grad_scale_func = None so both m.config.mtp_num_layers and m.config.grad_scale_func are cleared to disable MTP consistently for inference.src/megatron/bridge/recipes/moonlight/moonlight_16b.py (1)
199-255: Deduplicate sharedMLAModelProviderkwargs between SFT and PEFT.These two constructor blocks are large and mostly identical, which increases config drift risk.
♻️ Suggested refactor sketch
+def _moonlight_16b_model_kwargs(*, tp: int, ep: int, sequence_parallel: bool) -> dict: + return dict( + num_layers=27, + hidden_size=2048, + ffn_hidden_size=11264, + num_attention_heads=16, + kv_channels=16, + q_lora_rank=None, + kv_lora_rank=512, + num_moe_experts=64, + moe_ffn_hidden_size=1408, + moe_shared_expert_intermediate_size=2816, + moe_layer_freq=[0] * 1 + [1] * 26, + moe_router_topk=6, + moe_router_num_groups=1, + moe_router_group_topk=1, + moe_router_topk_scaling_factor=2.446, + moe_aux_loss_coeff=0.001, + make_vocab_size_divisible_by=1280, + moe_router_score_function="sigmoid", + moe_router_enable_expert_bias=True, + rotary_scaling_factor=1.0, + mscale=1.0, + mscale_all_dim=1.0, + rotary_base=50000, + layernorm_epsilon=1e-5, + init_method_std=0.02, + moe_router_bias_update_rate=1e-3, + rotary_percent=1.0, + vocab_size=163842, + normalization="RMSNorm", + activation_func=F.silu, + gated_linear_unit=True, + position_embedding_type="rope", + add_bias_linear=False, + share_embeddings_and_output_weights=False, + qk_layernorm=True, + bf16=True, + params_dtype=torch.bfloat16, + moe_grouped_gemm=True, + moe_token_dispatcher_type="alltoall", + tensor_model_parallel_size=tp, + pipeline_model_parallel_size=1, + pipeline_dtype=torch.bfloat16, + virtual_pipeline_model_parallel_size=None, + context_parallel_size=1, + expert_model_parallel_size=ep, + sequence_parallel=sequence_parallel, + expert_tensor_parallel_size=1, + recompute_granularity="selective", + recompute_modules=None, + recompute_method=None, + recompute_num_layers=None, + ) @@ - cfg.model = MLAModelProvider( - ... - ) + cfg.model = MLAModelProvider(**_moonlight_16b_model_kwargs(tp=2, ep=8, sequence_parallel=True)) @@ - cfg.model = MLAModelProvider( - ... - ) + cfg.model = MLAModelProvider(**_moonlight_16b_model_kwargs(tp=1, ep=2, sequence_parallel=False))Also applies to: 413-468
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py` around lines 199 - 255, The MLAModelProvider constructor kwargs are duplicated between the SFT and PEFT blocks (MLAModelProvider in moonlight_16b.py); extract the shared keyword args into a single dict (e.g., common_model_kwargs or common_mla_config) and pass it into both MLAModelProvider(...) calls (merge per-block overrides after the shared dict), so SFT and PEFT only specify differences. Update any references to parameters like num_layers, hidden_size, moe_*, rotary_*, normalization, activation_func, bf16, params_dtype, parallelism and recompute_* to live in the shared dict and leave only unique overrides in the individual SFT/PEFT constructor calls.tests/unit_tests/recipes/test_glm45_recipes.py (1)
138-153: Extract common monkeypatch setup into a pytest fixture.This setup pattern is repeated and should be centralized to reduce drift across tests.
♻️ Example fixture extraction
+@pytest.fixture +def patch_glm_recipe_io(monkeypatch: pytest.MonkeyPatch): + def _apply(mod, needs_tokenizer: bool): + monkeypatch.setattr(mod, "AutoBridge", _FakeBridge) + if needs_tokenizer: + import transformers + monkeypatch.setattr( + transformers, + "AutoTokenizer", + type("FakeAutoTokenizer", (), {"from_pretrained": staticmethod(lambda *args, **kwargs: _FakeTokenizer())}), + ) + return _apply @@ -def test_each_glm45_recipe_builds_config(recipe_func: Callable, monkeypatch: pytest.MonkeyPatch): +def test_each_glm45_recipe_builds_config(recipe_func: Callable, patch_glm_recipe_io): @@ - monkeypatch.setattr(mod, "AutoBridge", _FakeBridge) + patch_glm_recipe_io(mod, is_sft_or_peft)As per coding guidelines, "
tests/**/*.py: Use pytest fixtures for common setup in unit tests."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/recipes/test_glm45_recipes.py` around lines 138 - 153, Extract the repeated monkeypatch setup into a pytest fixture (e.g., fake_bridge_and_tokenizer) that accepts monkeypatch and recipe_func (or uses request.param) and performs the same work: import the recipe module via recipe_func.__module__, monkeypatch the module's AutoBridge to _FakeBridge, compute is_sft_or_peft by checking "sft" or "peft" in recipe_func.__name__.lower(), and when true monkeypatch transformers.AutoTokenizer to a FakeAutoTokenizer that returns _FakeTokenizer.from_pretrained; replace the inline code in tests with a use of this fixture to centralize the AutoBridge/AutoTokenizer mocking and avoid duplication.examples/models/minimax_m2/inference.sh (1)
1-24: Add strict shell mode for safer multi-node execution.Please add fail-fast shell options so partial failures do not continue silently.
♻️ Suggested hardening
#!/usr/bin/env bash +set -euo pipefail + # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. @@ -uv run python -m torch.distributed.run --nproc_per_node=8 \ +uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 \As per coding guidelines, "
**/*.sh: Follow Google Shell Style Guide." Based on learnings, "prefer including --no-sync` when dependencies are already installed and environment/toolchain compatibility is known."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/minimax_m2/inference.sh` around lines 1 - 24, Add fail-fast shell options and the recommended --no-sync flag: at the top of inference.sh (right after the shebang) enable strict mode by adding "set -euo pipefail" (optionally "set -x" for debug) so the script exits on any error/unset variable/pipe failure, and update the distributed run command line (the "uv run python -m torch.distributed.run --nproc_per_node=8 ...") to include the "--no-sync" option when appropriate to avoid unnecessary dependency sync; ensure any future shell variables in this file are quoted to remain safe under "set -u".src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py (1)
37-77: Extract a shared Nano-v2 provider factory to reduce drift risk.These six
MambaModelProviderblocks duplicate a large set of identical fields. A small helper with variant-specific overrides would make updates safer and easier.♻️ Refactor direction (example)
+def _build_nemotron_nano_v2_provider( + *, + variant: str, # "9b" | "12b" + seq_length: int, + tensor_model_parallel_size: int, + sequence_parallel: bool, +) -> MambaModelProvider: + base_kwargs = dict( + mamba_num_groups=8, + num_query_groups=8, + make_vocab_size_divisible_by=128, + activation_func=squared_relu, + masked_softmax_fusion=True, + apply_query_key_layer_scaling=False, + persist_layer_norm=True, + attention_softmax_in_fp32=False, + first_last_layers_bf16=True, + is_hybrid_model=True, + moe_aux_loss_coeff=0.0001, + moe_router_score_function="sigmoid", + moe_router_enable_expert_bias=True, + moe_router_load_balancing_type="seq_aux_loss", + moe_router_dtype="fp32", + moe_grouped_gemm=True, + moe_token_dispatcher_type="alltoall", + moe_permute_fusion=True, + moe_shared_expert_overlap=True, + pipeline_model_parallel_size=1, + pipeline_dtype=torch.bfloat16, + virtual_pipeline_model_parallel_size=None, + context_parallel_size=1, + ) + variant_kwargs = { + "9b": dict(...), + "12b": dict(...), + }[variant] + return MambaModelProvider( + **base_kwargs, + **variant_kwargs, + seq_length=seq_length, + tensor_model_parallel_size=tensor_model_parallel_size, + sequence_parallel=sequence_parallel, + )Also applies to: 172-212, 309-349, 435-475, 571-610, 718-757
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py` around lines 37 - 77, Multiple identical MambaModelProvider blocks (e.g., the cfg.model assignment) should be consolidated into a shared factory to avoid drift; create a helper function (e.g., make_nano_v2_provider or create_nano_v2_provider) that returns a MambaModelProvider pre-populated with the common fields shown (hybrid_override_pattern, num_layers, hidden_size, mamba_num_heads, kv_channels, mamba_state_dim, ffn_hidden_size, num_attention_heads, mamba_head_dim, seq_length, mamba_num_groups, num_query_groups, make_vocab_size_divisible_by, activation_func, masked_softmax_fusion, apply_query_key_layer_scaling, persist_layer_norm, attention_softmax_in_fp32, first_last_layers_bf16, is_hybrid_model, moe_* settings, parallelism defaults, etc.), then replace each duplicated MambaModelProvider instantiation (including cfg.model and the other five blocks) with a call to this factory passing only variant-specific overrides (e.g., tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype, virtual_pipeline_model_parallel_size, context_parallel_size, sequence_parallel) so updates to common settings are made in one place.tests/functional_tests/training/test_decentralized_pg.py (1)
104-140: Extract a shared model-config factory to reduce maintenance drift across duplicated test configurations.The same
GPTModelProviderbaseline is repeated across six test functions. A local helper with base kwargs and per-test overrides would eliminate drift-prone duplication and make test-specific variations clearer.♻️ Refactor sketch
+ def _create_model_cfg(self, **overrides): + base = dict( + normalization="RMSNorm", + activation_func=F.silu, + gated_linear_unit=True, + position_embedding_type="rope", + add_bias_linear=False, + attention_dropout=0.0, + hidden_dropout=0.0, + bias_activation_fusion=True, + masked_softmax_fusion=True, + persist_layer_norm=True, + bias_dropout_fusion=True, + apply_rope_fusion=True, + num_query_groups=8, + init_method_std=0.02, + layernorm_epsilon=1e-05, + rotary_percent=1.0, + rope_scaling=True, + rope_scaling_factor=32.0, + rotary_base=500_000, + hidden_size=2048, + ffn_hidden_size=8192, + num_attention_heads=32, + attention_softmax_in_fp32=True, + pipeline_dtype=torch.bfloat16, + bf16=True, + make_vocab_size_divisible_by=128, + vocab_size=None, + ) + base.update(overrides) + return GPTModelProvider(**base)Applies to lines 104–140, 255–290, 410–446, 566–602, 722–758, 878–914.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/training/test_decentralized_pg.py` around lines 104 - 140, The repeated GPTModelProvider instantiation should be refactored into a shared factory: create a local helper (e.g., make_gpt_model_cfg or gpt_model_provider_factory) that constructs the base kwargs shown (normalization, activation_func, gated_linear_unit, position_embedding_type, etc.), accepts overrides for per-test changes, and returns a GPTModelProvider instance; replace the six duplicated blocks that assign model_cfg with calls to this helper, passing only test-specific overrides (like seq_length, num_layers, share_embeddings_and_output_weights), so maintenance drift is eliminated and variations remain explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py`:
- Around line 17-18: The file mixes old typing (Union/Optional/List) and unsafe
dataclass defaults: replace Union/Optional/List annotations with modern PEP
604/PEP 585 syntax (e.g., A | B, list[T], T | None) for the annotated variables
and function signatures (search for uses in e.g., any function or dataclass
fields around init_method_std and other annotated parameters), change the
annotation of init_method_std from int to float to match its float literal
default, and remove direct calls to functools.partial in dataclass defaults by
using dataclasses.field(default_factory=...) to construct the partial at
instance creation time (or convert to a module-level factory function) for
fields currently assigned partial(...) so you comply with RUF009 and
UP006/UP007.
In `@src/megatron/bridge/models/minimax_m2/minimax_m2_bridge.py`:
- Around line 27-31: The bridge currently registered with MegatronModelBridge
(decorator usage around MiniMaxM2ForCausalLM -> GPTModel) should be split to
follow repo conventions: move the orchestration/bridge implementation (the class
and register_bridge call) into a new model_bridge.py inside the model package,
and extract all parameter mapping logic into a new param_mapping.py; ensure
model_bridge.py imports the mapping functions/classes from param_mapping.py and
registers the bridge (same class name used now), and move the code currently
spanning the mapping section (previously around lines 95-149) into
param_mapping.py as exported mapping utilities referenced by model_bridge.py.
- Around line 73-82: The code currently silences unsupported QK-norm by setting
provider.qk_layernorm = False which drops q_norm.weight and k_norm.weight;
instead, make conversion fail-fast by detecting when those weights are present
and raising a NotImplementedError (or require an explicit opt-in flag) so users
cannot mistakenly get a misleading “successful” conversion. Update the logic
around provider.qk_layernorm in minimax_m2_bridge (and any conversion entrypoint
that calls it) to check for q_norm.weight/k_norm.weight in the source model and
either raise NotImplementedError with a descriptive message referencing
q_norm.weight and k_norm.weight and provider.qk_layernorm, or accept an explicit
parameter (e.g., allow_inexact_qk_norm) that must be true to proceed; ensure the
error message tells users how to opt in if desired.
In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`:
- Around line 211-259: The recipe passes raw strings to
MambaModelProvider.attention_backend but the provider expects the AttnBackend
enum; import AttnBackend and replace string usages (e.g.,
attention_backend="fused" and any later assignments like
cfg.model.attention_backend = "fused") with the corresponding AttnBackend enum
member (for example AttnBackend.fused or AttnBackend.auto as appropriate) so all
occurrences in this recipe (including the MambaModelProvider constructor and
subsequent cfg.model.attention_backend assignments) use the enum type instead of
raw strings.
In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py`:
- Around line 443-456: The NemotronH 56B configs create MambaModelProvider with
attention_backend=AttnBackend.auto but later assignments overwrite it to None;
update the code so the constructor parameter is authoritative by removing the
subsequent assignments that set attention_backend = None (or, if auto is
intended later, replace those None assignments with AttnBackend.auto) in each
config block where cfg.model = MambaModelProvider(...) appears (e.g., the block
starting with cfg.model = MambaModelProvider and the later places noted around
the other config functions); ensure only one definitive assignment to
attention_backend remains so AttnBackend.auto is not silently discarded.
In `@tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py`:
- Around line 169-173: Replace the fragile "assert False" in the failure path
with an explicit exception so failures are raised even under Python -O;
specifically, where you check result.returncode != 0 (using variables result and
test_name), keep the STDOUT/STDERR prints but replace the assert False line with
a raise (e.g., raise RuntimeError or AssertionError) that includes the same
message and return code information so the test fails reliably and provides
context.
- Around line 15-17: The test uses hardcoded interpreter and absolute repo paths
and an bare assert; update the subprocess invocation(s) in
test_minimax_m2_conversion.py to use sys.executable instead of "python", compute
repository paths relative to the test file using
Path(__file__).resolve().parents[...] (replace occurrences of
"/opt/Megatron-Bridge/" with the computed path) and pass cwd to subprocess.run,
use capture_output=True/text=True and check=True for portable behavior, and wrap
the subprocess call in a try/except catching subprocess.CalledProcessError (or
Exception) and call pytest.fail(...) with the captured stdout/stderr on error
instead of using assert False. Ensure you update imports to include sys and
pytest and adjust the function or block that performs the conversion accordingly
(identify the subprocess.run call and its surrounding error handling).
In `@tests/unit_tests/models/glm_vl/test_glm_45v_provider.py`:
- Around line 240-243: The test class TestGLM45VModelProviderInheritance is
missing a pytest category marker; add the `@pytest.mark.unit` decorator above the
class definition to categorize it as a unit test and ensure pytest is imported
(add "import pytest" at the top if not already present). Locate the class named
TestGLM45VModelProviderInheritance and prepend the decorator line
`@pytest.mark.unit` directly above the class declaration so the test follows the
repository's pytest.mark conventions.
---
Outside diff comments:
In `@src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py`:
- Around line 79-151: The provide() method accepts vp_stage but never uses or
validates it, which can silently construct an incorrect model under virtual
pipeline; update provide() (in nemotron_vl_provider.py) to validate vp_stage at
the start (e.g., ensure it's None or within allowed stage range) and either pass
it into the LLaVAModel constructor if that class supports a virtual pipeline
stage parameter or raise a clear exception when an unsupported vp_stage is
given; reference the provide() function and the vp_stage parameter and ensure
the guard runs before assembling language_cfg/vision_cfg and creating the
LLaVAModel so incorrect scopes cannot be built.
In `@tests/unit_tests/recipes/test_run_plugins.py`:
- Around line 77-102: The model config currently hardcodes seq_length=8192 in
the GPTModelProvider call, ignoring the caller-provided seq_length variable;
update the GPTModelProvider instantiation to use the seq_length variable (the
one popped from kwargs at the top of the function) instead of the literal 8192
so the model config matches the dataset/training sequence length and respects
overrides.
---
Nitpick comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 612-613: When disabling MTP in the loop over megatron_model, also
set the grad scale function to None: inside the for m in megatron_model loop
where you currently do m.config.mtp_num_layers = None, add
m.config.grad_scale_func = None so both m.config.mtp_num_layers and
m.config.grad_scale_func are cleared to disable MTP consistently for inference.
In `@examples/models/minimax_m2/inference.sh`:
- Around line 1-24: Add fail-fast shell options and the recommended --no-sync
flag: at the top of inference.sh (right after the shebang) enable strict mode by
adding "set -euo pipefail" (optionally "set -x" for debug) so the script exits
on any error/unset variable/pipe failure, and update the distributed run command
line (the "uv run python -m torch.distributed.run --nproc_per_node=8 ...") to
include the "--no-sync" option when appropriate to avoid unnecessary dependency
sync; ensure any future shell variables in this file are quoted to remain safe
under "set -u".
In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py`:
- Around line 199-255: The MLAModelProvider constructor kwargs are duplicated
between the SFT and PEFT blocks (MLAModelProvider in moonlight_16b.py); extract
the shared keyword args into a single dict (e.g., common_model_kwargs or
common_mla_config) and pass it into both MLAModelProvider(...) calls (merge
per-block overrides after the shared dict), so SFT and PEFT only specify
differences. Update any references to parameters like num_layers, hidden_size,
moe_*, rotary_*, normalization, activation_func, bf16, params_dtype, parallelism
and recompute_* to live in the shared dict and leave only unique overrides in
the individual SFT/PEFT constructor calls.
In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py`:
- Around line 37-77: Multiple identical MambaModelProvider blocks (e.g., the
cfg.model assignment) should be consolidated into a shared factory to avoid
drift; create a helper function (e.g., make_nano_v2_provider or
create_nano_v2_provider) that returns a MambaModelProvider pre-populated with
the common fields shown (hybrid_override_pattern, num_layers, hidden_size,
mamba_num_heads, kv_channels, mamba_state_dim, ffn_hidden_size,
num_attention_heads, mamba_head_dim, seq_length, mamba_num_groups,
num_query_groups, make_vocab_size_divisible_by, activation_func,
masked_softmax_fusion, apply_query_key_layer_scaling, persist_layer_norm,
attention_softmax_in_fp32, first_last_layers_bf16, is_hybrid_model, moe_*
settings, parallelism defaults, etc.), then replace each duplicated
MambaModelProvider instantiation (including cfg.model and the other five blocks)
with a call to this factory passing only variant-specific overrides (e.g.,
tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype,
virtual_pipeline_model_parallel_size, context_parallel_size, sequence_parallel)
so updates to common settings are made in one place.
In `@tests/functional_tests/training/test_decentralized_pg.py`:
- Around line 104-140: The repeated GPTModelProvider instantiation should be
refactored into a shared factory: create a local helper (e.g.,
make_gpt_model_cfg or gpt_model_provider_factory) that constructs the base
kwargs shown (normalization, activation_func, gated_linear_unit,
position_embedding_type, etc.), accepts overrides for per-test changes, and
returns a GPTModelProvider instance; replace the six duplicated blocks that
assign model_cfg with calls to this helper, passing only test-specific overrides
(like seq_length, num_layers, share_embeddings_and_output_weights), so
maintenance drift is eliminated and variations remain explicit.
In `@tests/unit_tests/recipes/test_glm45_recipes.py`:
- Around line 138-153: Extract the repeated monkeypatch setup into a pytest
fixture (e.g., fake_bridge_and_tokenizer) that accepts monkeypatch and
recipe_func (or uses request.param) and performs the same work: import the
recipe module via recipe_func.__module__, monkeypatch the module's AutoBridge to
_FakeBridge, compute is_sft_or_peft by checking "sft" or "peft" in
recipe_func.__name__.lower(), and when true monkeypatch
transformers.AutoTokenizer to a FakeAutoTokenizer that returns
_FakeTokenizer.from_pretrained; replace the inline code in tests with a use of
this fixture to centralize the AutoBridge/AutoTokenizer mocking and avoid
duplication.
In `@tests/unit_tests/recipes/test_moonlight_recipes.py`:
- Line 18: Update the stale docstring and fake class name so they match the
current provider type: change any docstring text referencing
MoonlightModelProvider16B to reference MLAModelProvider (or the current provider
name) and rename the fake class _FakeMoonlightModelProvider16B to
_FakeMLAModelProvider (or similar) and update all monkeypatch usages that
reference the old name (the test class and each place where the fake is injected
in the tests). Ensure the docstring, the fake class declaration, and every
usage/monkeypatch that currently points to _FakeMoonlightModelProvider16B are
updated to the new MLAModelProvider-based names so they are consistent.
In `@tests/unit_tests/training/test_config.py`:
- Around line 79-89: Rename the helper function create_test_deepseek_config to
create_test_mla_config and update all call sites and comments that still
reference "Deepseek" to instead reference MLAModelProvider/MLA (e.g., change the
function name and any test references and the comment that mentions Deepseek to
reflect MLAModelProvider), ensuring the factory still returns
MLAModelProvider(**defaults) and preserving the same default kwargs.
In `@tests/unit_tests/training/test_log_non_default_values.py`:
- Around line 58-71: Rename the test function to match the subject under test:
change the function name test_deepseek_provider_returns_mla_transformer_config
to test_mla_provider_returns_mla_transformer_config in the test file so it
clearly reflects MLAModelProvider; keep the existing docstring and assertions
(including the reference to _get_mcore_transformer_parent and
MCoreMLATransformerConfig) unchanged.
- Around line 299-315: Rename references to "DeepSeek" in the test to reflect
the generic MLAModelProvider: update the test method name
test_handles_deepseek_model_correctly, its docstring, and the variable
deepseek_model to names that reference MLAModelProvider (e.g.,
test_handles_mla_model_correctly and mla_model) so they match the actual class
under test (MLAModelProvider); keep the assertions and usage of
mock_print_rank_0, cfg, and cfg.log_non_default_values() unchanged and ensure
the assertion still verifies "MLATransformerConfig" appears in
mock_print_rank_0.call_args[0][0].
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (70)
examples/conversion/compare_hf_and_megatron/compare.pyexamples/models/minimax_m2/conversion.shexamples/models/minimax_m2/inference.shexamples/models/minimax_m2/verify_toy_model.pysrc/megatron/bridge/models/__init__.pysrc/megatron/bridge/models/conversion/model_bridge.pysrc/megatron/bridge/models/deepseek/__init__.pysrc/megatron/bridge/models/deepseek/deepseek_provider.pysrc/megatron/bridge/models/glm/__init__.pysrc/megatron/bridge/models/glm/glm45_provider.pysrc/megatron/bridge/models/glm_vl/glm_45v_provider.pysrc/megatron/bridge/models/gpt_oss/__init__.pysrc/megatron/bridge/models/gpt_oss/gpt_oss_provider.pysrc/megatron/bridge/models/hf_pretrained/utils.pysrc/megatron/bridge/models/kimi/__init__.pysrc/megatron/bridge/models/kimi/kimi_provider.pysrc/megatron/bridge/models/llama/__init__.pysrc/megatron/bridge/models/llama/llama_provider.pysrc/megatron/bridge/models/llama_nemotron/__init__.pysrc/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.pysrc/megatron/bridge/models/llama_nemotron/llama_nemotron_provider.pysrc/megatron/bridge/models/minimax_m2/__init__.pysrc/megatron/bridge/models/minimax_m2/minimax_m2_bridge.pysrc/megatron/bridge/models/nemotron/__init__.pysrc/megatron/bridge/models/nemotron/nemotron_provider.pysrc/megatron/bridge/models/nemotron_vl/__init__.pysrc/megatron/bridge/models/nemotron_vl/nemotron_vl_bridge.pysrc/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.pysrc/megatron/bridge/models/nemotronh/__init__.pysrc/megatron/bridge/models/nemotronh/nemotron_h_provider.pysrc/megatron/bridge/recipes/kimi/kimi_k2.pysrc/megatron/bridge/recipes/moonlight/moonlight_16b.pysrc/megatron/bridge/recipes/nemotronh/nemotron_3_nano.pysrc/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.pysrc/megatron/bridge/recipes/nemotronh/nemotronh.pytests/functional_tests/data/test_samplers.pytests/functional_tests/models/gpt_oss/test_gpt_oss_provider.pytests/functional_tests/models/minimax_m2/__init__.pytests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.pytests/functional_tests/training/test_callbacks.pytests/functional_tests/training/test_decentralized_pg.pytests/functional_tests/training/test_finetune_dora.pytests/functional_tests/training/test_finetune_lora.pytests/functional_tests/training/test_inprocess_restart.pytests/functional_tests/training/test_megatron_fsdp.pytests/functional_tests/training/test_nvrx_straggler.pytests/functional_tests/training/test_pretrain.pytests/functional_tests/training/test_pretrain_resume.pytests/functional_tests/training/test_sample_based_training.pytests/functional_tests/training/test_sft.pytests/functional_tests/training/test_tensor_inspect.pytests/unit_tests/models/deepseek/test_deepseek_provider.pytests/unit_tests/models/glm/test_glm45_provider.pytests/unit_tests/models/glm_vl/test_glm_45v_provider.pytests/unit_tests/models/gpt_oss/test_gpt_oss_provider.pytests/unit_tests/models/kimi/test_kimi_provider.pytests/unit_tests/models/llama/test_llama_provider.pytests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.pytests/unit_tests/models/nemotron/test_nemotron_provider.pytests/unit_tests/models/nemotronh/test_nemotron_h_provider.pytests/unit_tests/models/test_models_imports.pytests/unit_tests/recipes/kimi/test_kimi_k2.pytests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.pytests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.pytests/unit_tests/recipes/nemotronh/test_nemotronh.pytests/unit_tests/recipes/test_glm45_recipes.pytests/unit_tests/recipes/test_moonlight_recipes.pytests/unit_tests/recipes/test_run_plugins.pytests/unit_tests/training/test_config.pytests/unit_tests/training/test_log_non_default_values.py
💤 Files with no reviewable changes (22)
- src/megatron/bridge/models/glm/glm45_provider.py
- tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py
- tests/unit_tests/models/llama/test_llama_provider.py
- tests/unit_tests/models/kimi/test_kimi_provider.py
- src/megatron/bridge/models/nemotron/init.py
- src/megatron/bridge/models/kimi/kimi_provider.py
- tests/unit_tests/models/deepseek/test_deepseek_provider.py
- src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py
- src/megatron/bridge/models/llama_nemotron/init.py
- tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
- src/megatron/bridge/models/llama/llama_provider.py
- src/megatron/bridge/models/deepseek/deepseek_provider.py
- tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py
- src/megatron/bridge/models/glm/init.py
- tests/unit_tests/models/glm/test_glm45_provider.py
- src/megatron/bridge/models/nemotronh/init.py
- src/megatron/bridge/models/nemotron_vl/init.py
- src/megatron/bridge/models/gpt_oss/init.py
- src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
- src/megatron/bridge/models/nemotron/nemotron_provider.py
- tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py
- tests/unit_tests/models/nemotron/test_nemotron_provider.py
| from typing import Callable, List, Optional, Union | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/megatron/bridge/models/glm_vl/glm_45v_provider.py | head -120Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5018
🏁 Script executed:
ruff check --select RUF009,UP006,UP007 src/megatron/bridge/models/glm_vl/glm_45v_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3574
Modernize typing annotations and fix dataclass defaults to comply with Python 3.10+ guidelines.
The file uses deprecated typing syntax (Union, Optional, List) and violates RUF009 by calling partial() directly in a dataclass default. Additionally, init_method_std is annotated as int but assigned a float literal.
♻️ Proposed changes
-from typing import Callable, List, Optional, Union
+from collections.abc import Callable
@@
- transformer_layer_spec: Union[ModuleSpec, Callable[["GPTModelProvider"], ModuleSpec]] = partial(
- get_gpt_decoder_block_spec, use_transformer_engine=HAVE_TE
- )
+ transformer_layer_spec: ModuleSpec | Callable[["GPTModelProvider"], ModuleSpec] = field(
+ default_factory=lambda: partial(get_gpt_decoder_block_spec, use_transformer_engine=HAVE_TE)
+ )
@@
- init_method_std: int = 0.02
+ init_method_std: float = 0.02
@@
- share_embeddings_and_output_weights: Optional[bool] = False
+ share_embeddings_and_output_weights: bool | None = False
@@
- mrope_section: List[int] = field(default_factory=lambda: [8, 12, 12])
+ mrope_section: list[int] = field(default_factory=lambda: [8, 12, 12])
@@
- mtp_num_layers: Optional[int] = 1
- mtp_loss_scaling_factor: Optional[float] = 0.3
+ mtp_num_layers: int | None = 1
+ mtp_loss_scaling_factor: float | None = 0.3
@@
- moe_layer_freq: Union[int, List[int]] = field(default_factory=lambda: [0] * 1 + [1] * 45)
+ moe_layer_freq: int | list[int] = field(default_factory=lambda: [0] * 1 + [1] * 45)Ruff confirms UP006, UP007, and RUF009 violations at lines 47, 75, and 105. Lines 56, 59, 97–98 contain similar issues requiring modernization per Python 3.10+ guidelines.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py` around lines 17 - 18,
The file mixes old typing (Union/Optional/List) and unsafe dataclass defaults:
replace Union/Optional/List annotations with modern PEP 604/PEP 585 syntax
(e.g., A | B, list[T], T | None) for the annotated variables and function
signatures (search for uses in e.g., any function or dataclass fields around
init_method_std and other annotated parameters), change the annotation of
init_method_std from int to float to match its float literal default, and remove
direct calls to functools.partial in dataclass defaults by using
dataclasses.field(default_factory=...) to construct the partial at instance
creation time (or convert to a module-level factory function) for fields
currently assigned partial(...) so you comply with RUF009 and UP006/UP007.
| @MegatronModelBridge.register_bridge( | ||
| source="MiniMaxM2ForCausalLM", | ||
| target=GPTModel, | ||
| model_type="minimax_m2", | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Align MiniMax-M2 bridge layout with repository bridge conventions.
Please move bridge orchestration to model_bridge.py and mapping definitions to param_mapping.py under the same model package to match the established layout and keep future bridge maintenance consistent.
Based on learnings: Applies to src/megatron/bridge/models/*/model_bridge.py : Create model bridge implementation in model_bridge.py when adding new model bridges; applies to src/megatron/bridge/models/*/param_mapping.py : Create parameter mapping in param_mapping.py when adding new model bridges.
Also applies to: 95-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/minimax_m2/minimax_m2_bridge.py` around lines 27 -
31, The bridge currently registered with MegatronModelBridge (decorator usage
around MiniMaxM2ForCausalLM -> GPTModel) should be split to follow repo
conventions: move the orchestration/bridge implementation (the class and
register_bridge call) into a new model_bridge.py inside the model package, and
extract all parameter mapping logic into a new param_mapping.py; ensure
model_bridge.py imports the mapping functions/classes from param_mapping.py and
registers the bridge (same class name used now), and move the code currently
spanning the mapping section (previously around lines 95-149) into
param_mapping.py as exported mapping utilities referenced by model_bridge.py.
| # TODO: MiniMax-M2 uses full-dimension QK norm (q_norm weight shape = num_heads * head_dim) | ||
| # while Megatron uses per-head QK norm (weight shape = head_dim). These are | ||
| # mathematically different (different normalization denominators), so the weights | ||
| # cannot be losslessly converted. Disabled here, which means q_norm.weight and | ||
| # k_norm.weight are dropped during conversion. This is acceptable for fine-tuning | ||
| # but will cause forward-pass divergence from HF for inference. | ||
| # Fix: add full-dimension QK norm support to Megatron-Core, or write a custom | ||
| # layer spec for this model. | ||
| provider.qk_layernorm = False | ||
|
|
There was a problem hiding this comment.
Do not silently drop unsupported QK-norm weights during conversion.
Current behavior proceeds with conversion while dropping q_norm.weight / k_norm.weight, which can be misread as successful exact conversion. This should fail fast (or require explicit opt-in for inexact conversion).
💡 Suggested guard
- provider.qk_layernorm = False
+ if getattr(hf_config, "use_qk_norm", False):
+ raise NotImplementedError(
+ "MiniMax-M2 q_norm/k_norm conversion is not supported yet. "
+ "Conversion currently drops these weights and is inexact."
+ )
+ provider.qk_layernorm = FalseBased on learnings: In Python implementations, when a feature is not supported, raise an explicit error (e.g., NotImplementedError) with a descriptive message.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/minimax_m2/minimax_m2_bridge.py` around lines 73 -
82, The code currently silences unsupported QK-norm by setting
provider.qk_layernorm = False which drops q_norm.weight and k_norm.weight;
instead, make conversion fail-fast by detecting when those weights are present
and raising a NotImplementedError (or require an explicit opt-in flag) so users
cannot mistakenly get a misleading “successful” conversion. Update the logic
around provider.qk_layernorm in minimax_m2_bridge (and any conversion entrypoint
that calls it) to check for q_norm.weight/k_norm.weight in the source model and
either raise NotImplementedError with a descriptive message referencing
q_norm.weight and k_norm.weight and provider.qk_layernorm, or accept an explicit
parameter (e.g., allow_inexact_qk_norm) that must be true to proceed; ensure the
error message tells users how to opt in if desired.
| # Model config - Nemotron 3 Nano | ||
| cfg.model = MambaModelProvider( | ||
| # Architecture (Nemotron 3 Nano 30B-A3B) | ||
| hybrid_override_pattern="MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME", | ||
| num_layers=52, | ||
| hidden_size=2688, | ||
| mamba_num_heads=64, | ||
| kv_channels=128, | ||
| mamba_state_dim=128, | ||
| ffn_hidden_size=1856, | ||
| num_attention_heads=32, | ||
| mamba_head_dim=64, | ||
| seq_length=2048, | ||
| num_query_groups=2, | ||
| # MoE | ||
| num_moe_experts=128, | ||
| moe_ffn_hidden_size=1856, | ||
| moe_shared_expert_intermediate_size=3712, | ||
| moe_router_topk=6, | ||
| moe_router_topk_scaling_factor=2.5, | ||
| moe_router_num_groups=1, | ||
| moe_router_group_topk=1, | ||
| # NemotronH base | ||
| mamba_num_groups=8, | ||
| make_vocab_size_divisible_by=128, | ||
| activation_func=squared_relu, | ||
| masked_softmax_fusion=True, | ||
| apply_query_key_layer_scaling=False, | ||
| persist_layer_norm=True, | ||
| attention_softmax_in_fp32=False, | ||
| first_last_layers_bf16=True, | ||
| is_hybrid_model=True, | ||
| moe_aux_loss_coeff=0.0001, | ||
| moe_router_score_function="sigmoid", | ||
| moe_router_enable_expert_bias=True, | ||
| moe_router_load_balancing_type="seq_aux_loss", | ||
| moe_router_dtype="fp32", | ||
| moe_grouped_gemm=True, | ||
| moe_token_dispatcher_type="alltoall", | ||
| moe_permute_fusion=True, | ||
| moe_shared_expert_overlap=True, | ||
| # Extra config | ||
| apply_rope_fusion=False, | ||
| attention_backend="fused", | ||
| gradient_accumulation_fusion=True, | ||
| init_method_std=0.0173, | ||
| use_fused_weighted_squared_relu=True, | ||
| calculate_per_token_loss=True, | ||
| # Parallelism |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect attention_backend assignments in this recipe
echo "=== attention_backend in nemotron_3_nano.py ==="
rg -n 'attention_backend\s*=' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py
# Inspect Mamba provider backend typing
echo -e "\n=== MambaModelProvider attention_backend type ==="
rg -n 'attention_backend' src/megatron/bridge/models/mamba/mamba_provider.py | head -20
# Check if AttnBackend is imported/used in Mamba provider
echo -e "\n=== AttnBackend usage in mamba_provider.py ==="
rg -n 'AttnBackend' src/megatron/bridge/models/mamba/mamba_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 610
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check how attention_backend is used in other recipes
echo "=== AttnBackend usage in recipes ==="
rg -l 'AttnBackend' src/megatron/bridge/recipes/
# Check string usage in other recipes
echo -e "\n=== attention_backend=\"fused\" usage in recipes ==="
rg -n 'attention_backend\s*=' src/megatron/bridge/recipes/ | grep -E '(fused|sdpa|flash_attention)' | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 711
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find AttnBackend enum definition
echo "=== AttnBackend enum definition ==="
rg -n 'class AttnBackend|AttnBackend.*=' --type py | grep -E '(class|enum)' | head -20
# Check imports in megatron.core
fd -e py -p 'enums\.py' src/megatron/bridge
rg -n 'AttnBackend' src/megatron/bridge/models/mamba/mamba_provider.py -B5 -A5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 106
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check how nemotronh.py (parent recipe) uses attention_backend
echo "=== nemotronh.py attention_backend usage ==="
rg -n -B2 -A2 'attention_backend' src/megatron/bridge/recipes/nemotronh/nemotronh.py
# Check the exact lines in the identified ranges
echo -e "\n=== nemotron_3_nano.py lines 211-259 ==="
sed -n '211,259p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py | grep -n 'attention_backend' || echo "attention_backend not in this range"
echo -e "\n=== nemotron_3_nano.py lines 402-449 ==="
sed -n '402,449p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py | grep -n 'attention_backend' || echo "attention_backend not in this range"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3062
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the full context of the flagged assignments
echo "=== Full context: line 254 (in range 211-259) ==="
sed -n '250,260p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py
echo -e "\n=== Full context: line 444 (in range 402-449) ==="
sed -n '440,450p' src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py
# Check AttnBackend enum import in nemotron_3_nano.py
echo -e "\n=== Imports in nemotron_3_nano.py ==="
head -40 src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py | grep -E '(import|from).*AttnBackend|megatron'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1482
Use AttnBackend enum instead of raw string values for attention_backend in MambaModelProvider.
The MambaModelProvider type signature requires attention_backend: AttnBackend, but the recipe passes raw strings ("fused"). The parent recipe nemotronh.py correctly uses AttnBackend.auto. Add the missing import and update all assignments to use enum values.
♻️ Suggested update
+from megatron.core.transformer.enums import AttnBackend
from megatron.core.activations import squared_relu
from megatron.bridge.models.mamba.mamba_provider import MambaModelProvider
@@
- attention_backend="fused",
+ attention_backend=AttnBackend.fused,
@@
- attention_backend="fused",
+ attention_backend=AttnBackend.fused,Also applies to lines 125, 298, and 488 where cfg.model.attention_backend = "fused" should use enum values.
🧰 Tools
🪛 Ruff (0.15.2)
[error] 249-249: Possible hardcoded password assigned to argument: "moe_token_dispatcher_type"
(S106)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` around lines 211 -
259, The recipe passes raw strings to MambaModelProvider.attention_backend but
the provider expects the AttnBackend enum; import AttnBackend and replace string
usages (e.g., attention_backend="fused" and any later assignments like
cfg.model.attention_backend = "fused") with the corresponding AttnBackend enum
member (for example AttnBackend.fused or AttnBackend.auto as appropriate) so all
occurrences in this recipe (including the MambaModelProvider constructor and
subsequent cfg.model.attention_backend assignments) use the enum type instead of
raw strings.
| cfg.model = MambaModelProvider( | ||
| # Architecture (NemotronH 56B) | ||
| hybrid_override_pattern=( | ||
| "M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-" | ||
| "M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M-" | ||
| ), | ||
| num_layers=118, | ||
| hidden_size=8192, | ||
| mamba_state_dim=256, | ||
| mamba_num_heads=256, | ||
| ffn_hidden_size=32768, | ||
| num_attention_heads=64, | ||
| attention_backend=AttnBackend.auto, | ||
| # NemotronH base |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# First, check if the file exists and get its structure
if [ -f "src/megatron/bridge/recipes/nemotronh/nemotronh.py" ]; then
echo "=== File found. Showing lines around claimed assignments ==="
# Show constructor assignment at 443-456
echo -e "\n--- Constructor at lines 443-456 ---"
sed -n '443,456p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
# Show line 524 (first override claim)
echo -e "\n--- Assignment at line 524 ---"
sed -n '520,530p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
# Show 969-983 (second constructor claim)
echo -e "\n--- Constructor at lines 969-983 ---"
sed -n '969,983p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
# Show line 1032 (second override claim)
echo -e "\n--- Assignment at line 1032 ---"
sed -n '1028,1038p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
# Show 1551-1564 (third constructor claim)
echo -e "\n--- Constructor at lines 1551-1564 ---"
sed -n '1551,1564p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
# Show line 1613 (third override claim)
echo -e "\n--- Assignment at line 1613 ---"
sed -n '1609,1619p' src/megatron/bridge/recipes/nemotronh/nemotronh.py
echo -e "\n--- All attention_backend assignments in file ---"
rg -n 'attention_backend\s*=' src/megatron/bridge/recipes/nemotronh/nemotronh.py
else
echo "File not found at src/megatron/bridge/recipes/nemotronh/nemotronh.py"
echo "Searching for nemotronh files..."
fd nemotronh.py
fiRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3687
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if AttnBackend has 'auto' variant
echo "=== AttnBackend enum definition ==="
fd AttnBackend -t f | head -10
# Search for AttnBackend enum definition
rg -l "class AttnBackend|enum AttnBackend" --type py
# Look at the definition
rg -A 10 "class AttnBackend" src/megatron/core/transformer/enums.py 2>/dev/null || rg -A 10 "class AttnBackend"
# Verify MambaModelProvider signature
echo -e "\n=== MambaModelProvider constructor ==="
rg -A 5 "class MambaModelProvider|def __init__" src/megatron/bridge/models/mamba/mamba_provider.py | head -40Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 106
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check imports in nemotronh.py
echo "=== Imports in nemotronh.py ==="
head -50 src/megatron/bridge/recipes/nemotronh/nemotronh.py | grep -i "attnbackend\|import"
# Search for AttnBackend definition across codebase
echo -e "\n=== Searching for AttnBackend definition ==="
rg "class AttnBackend|enum.*AttnBackend" -A 5
# Also check what's imported from megatron
echo -e "\n=== Checking for AttnBackend.auto usage ==="
rg "AttnBackend\." --type py | head -20
# Check MambaModelProvider to see if it accepts attention_backend
echo -e "\n=== MambaModelProvider signature ==="
rg -B 5 -A 20 "class MambaModelProvider" src/megatron/bridge/models/mamba/mamba_provider.py | head -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 9705
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify MambaModelProvider's attention_backend field definition
echo "=== MambaModelProvider field definition (full class) ==="
rg -B 2 -A 40 "class MambaModelProvider" src/megatron/bridge/models/mamba/mamba_provider.py | grep -A 35 "class MambaModelProvider"
# Check if attention_backend is inherited from TransformerConfig
echo -e "\n=== TransformerConfig definition ==="
rg -A 5 "class TransformerConfig" src/megatron/bridge/models/transformer_config.py | head -20
# Verify what the default attention_backend is in MambaModelProvider
echo -e "\n=== MambaModelProvider attention_backend initialization ==="
rg "attention_backend.*=" src/megatron/bridge/models/mamba/mamba_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2083
Consolidate conflicting attention_backend assignments in NemotronH 56B configs.
AttnBackend.auto is set in the constructor, then immediately overwritten to None in the same function, rendering the constructor parameter ineffective. This pattern repeats across three separate configuration functions.
♻️ Suggested fix
Either remove the constructor parameter:
- attention_backend=AttnBackend.auto,Or, if auto is the intended behavior, replace the later None assignments:
- cfg.model.attention_backend = None
+ cfg.model.attention_backend = AttnBackend.autoAlso applies to: lines 524, 969–983, 1032, 1551–1564, 1613
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py` around lines 443 - 456,
The NemotronH 56B configs create MambaModelProvider with
attention_backend=AttnBackend.auto but later assignments overwrite it to None;
update the code so the constructor parameter is authoritative by removing the
subsequent assignments that set attention_backend = None (or, if auto is
intended later, replace those None assignments with AttnBackend.auto) in each
config block where cfg.model = MambaModelProvider(...) appears (e.g., the block
starting with cfg.model = MambaModelProvider and the later places noted around
the other config functions); ensure only one definitive assignment to
attention_backend remains so AttnBackend.auto is not silently discarded.
| import json | ||
| import subprocess | ||
| from pathlib import Path |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the file exists and get its full content
wc -l tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 145
🏁 Script executed:
# Read the test file with line numbers
cat -n tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 7712
Use portable subprocess configuration and explicit exception handling in conversion test.
The subprocess command contains hardcoded interpreter path and absolute repo paths that break outside the specific CI environment. Line 138 uses "python" instead of sys.executable, and lines 146–147 hardcode /opt/Megatron-Bridge/ instead of computing from the test file location. Additionally, line 172 uses assert False which can be optimized away; use pytest.fail() instead.
🔧 Suggested fix
import json
import subprocess
+import sys
from pathlib import Path
@@
- cmd = [
- "python",
+ repo_root = Path(__file__).resolve().parents[4]
+ coverage_file = tmp_path / ".coverage"
+ cmd = [
+ sys.executable,
"-m",
"torch.distributed.run",
"--nproc_per_node=2",
"--nnodes=1",
"-m",
"coverage",
"run",
- "--data-file=/opt/Megatron-Bridge/.coverage",
- "--source=/opt/Megatron-Bridge/",
+ f"--data-file={coverage_file}",
+ f"--source={repo_root}",
"--parallel-mode",
- "examples/conversion/hf_megatron_roundtrip_multi_gpu.py",
+ str(repo_root / "examples/conversion/hf_megatron_roundtrip_multi_gpu.py"),
@@
cwd=repo_root,
)
@@
if result.returncode != 0:
print(f"STDOUT: {result.stdout}")
print(f"STDERR: {result.stderr}")
- assert False, f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"
+ pytest.fail(f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py`
around lines 15 - 17, The test uses hardcoded interpreter and absolute repo
paths and an bare assert; update the subprocess invocation(s) in
test_minimax_m2_conversion.py to use sys.executable instead of "python", compute
repository paths relative to the test file using
Path(__file__).resolve().parents[...] (replace occurrences of
"/opt/Megatron-Bridge/" with the computed path) and pass cwd to subprocess.run,
use capture_output=True/text=True and check=True for portable behavior, and wrap
the subprocess call in a try/except catching subprocess.CalledProcessError (or
Exception) and call pytest.fail(...) with the captured stdout/stderr on error
instead of using assert False. Ensure you update imports to include sys and
pytest and adjust the function or block that performs the conversion accordingly
(identify the subprocess.run call and its surrounding error handling).
| if result.returncode != 0: | ||
| print(f"STDOUT: {result.stdout}") | ||
| print(f"STDERR: {result.stderr}") | ||
| assert False, f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the file structure and verify the code at lines 169-173
head -180 tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py | tail -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 770
🏁 Script executed:
# Search for all assert False patterns in this file
rg -n 'assert False' tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 181
🏁 Script executed:
# Get full context around lines 169-173
sed -n '160,180p' tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 780
Replace assert False with an explicit exception.
Using assert False in critical failure paths is problematic because asserts can be disabled when Python runs in optimized mode (-O or -OO flags). This is particularly important for functional tests where failures must be explicitly raised regardless of optimization settings.
Suggested fix
- assert False, f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"
+ raise AssertionError(
+ f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}"
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if result.returncode != 0: | |
| print(f"STDOUT: {result.stdout}") | |
| print(f"STDERR: {result.stderr}") | |
| assert False, f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}" | |
| if result.returncode != 0: | |
| print(f"STDOUT: {result.stdout}") | |
| print(f"STDERR: {result.stderr}") | |
| raise AssertionError( | |
| f"MiniMax-M2 {test_name} conversion failed with return code {result.returncode}" | |
| ) |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 172-172: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/models/minimax_m2/test_minimax_m2_conversion.py`
around lines 169 - 173, Replace the fragile "assert False" in the failure path
with an explicit exception so failures are raised even under Python -O;
specifically, where you check result.returncode != 0 (using variables result and
test_name), keep the STDOUT/STDERR prints but replace the assert False line with
a raise (e.g., raise RuntimeError or AssertionError) that includes the same
message and return code information so the test fails reliably and provides
context.
| class TestGLM45VModelProviderInheritance: | ||
| """Test inheritance behavior from GLM45AirModelProvider106B.""" | ||
| """Test inheritance behavior from GPTModelProvider.""" | ||
|
|
||
| def test_glm_45v_inherits_from_air_provider(self, mock_vision_config): | ||
| """Test GLM45VModelProvider inherits from GLM45AirModelProvider106B.""" | ||
| from megatron.bridge.models import GLM45AirModelProvider106B | ||
| def test_glm_45v_inherits_from_gpt_provider(self, mock_vision_config): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether touched test class has a pytest category marker.
rg -n -C1 '^\s*@pytest\.mark\.(unit|integration|system)\b|^class TestGLM45VModelProviderInheritance\b' \
tests/unit_tests/models/glm_vl/test_glm_45v_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 184
🏁 Script executed:
#!/bin/bash
# Get more context around the class definition (lines 235-250)
head -n 250 tests/unit_tests/models/glm_vl/test_glm_45v_provider.py | tail -n 20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 717
🏁 Script executed:
#!/bin/bash
# Check if other test classes in this file have pytest markers
rg -n '^@pytest\.mark\.(unit|integration|system)|^class Test' \
tests/unit_tests/models/glm_vl/test_glm_45v_provider.py | head -n 30Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 194
🏁 Script executed:
#!/bin/bash
# Check the first few lines of the test classes to see if ANY have pytest markers
rg -B5 'class TestGLM45VModelProvider' tests/unit_tests/models/glm_vl/test_glm_45v_provider.py | grep -E '@pytest\.mark|class Test'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 183
Add the @pytest.mark.unit decorator to the test class.
The class TestGLM45VModelProviderInheritance lacks a pytest category marker. Per coding guidelines, all tests in tests/**/*.py must be explicitly categorized with pytest.mark (unit, integration, or system).
Proposed change
+@pytest.mark.unit
class TestGLM45VModelProviderInheritance:
"""Test inheritance behavior from GPTModelProvider."""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class TestGLM45VModelProviderInheritance: | |
| """Test inheritance behavior from GLM45AirModelProvider106B.""" | |
| """Test inheritance behavior from GPTModelProvider.""" | |
| def test_glm_45v_inherits_from_air_provider(self, mock_vision_config): | |
| """Test GLM45VModelProvider inherits from GLM45AirModelProvider106B.""" | |
| from megatron.bridge.models import GLM45AirModelProvider106B | |
| def test_glm_45v_inherits_from_gpt_provider(self, mock_vision_config): | |
| `@pytest.mark.unit` | |
| class TestGLM45VModelProviderInheritance: | |
| """Test inheritance behavior from GPTModelProvider.""" | |
| def test_glm_45v_inherits_from_gpt_provider(self, mock_vision_config): |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit_tests/models/glm_vl/test_glm_45v_provider.py` around lines 240 -
243, The test class TestGLM45VModelProviderInheritance is missing a pytest
category marker; add the `@pytest.mark.unit` decorator above the class definition
to categorize it as a unit test and ensure pytest is imported (add "import
pytest" at the top if not already present). Locate the class named
TestGLM45VModelProviderInheritance and prepend the decorator line
`@pytest.mark.unit` directly above the class declaration so the test follows the
repository's pytest.mark conventions.
…ti-node support for MiniMax-M2 Add custom full-dimension QK normalization (minimax_m2_provider.py) since MiniMax-M2 applies RMSNorm over the entire Q/K projection rather than per-head. The implementation uses sum-of-squares all-reduce across TP ranks and provides sharded_state_dict for distributed checkpointing. Add on-the-fly FP8 block-wise dequantization in the bridge via maybe_modify_loaded_hf_weight, converting float8_e4m3fn weights to bfloat16 using per-block scale_inv factors during HF->Megatron conversion. Add multi-node Slurm scripts (slurm_conversion.sh, slurm_inference.sh) for configurations requiring TP*EP*PP > 8 GPUs. Update verify_toy_model.py to extract real pretrained weights (N layers) from the FP8 model, dequantize to bf16, and verify round-trip accuracy. Fix dtype mismatch handling in hf_megatron_roundtrip_multi_gpu.py for FP8 source models. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Summary
Add Megatron Bridge for MiniMaxAI/MiniMax-M2, a sparse MoE model with 256 experts, top-8 sigmoid routing, and expert bias correction.
Bridge (
minimax_m2_bridge.py)rotary_percentcalculation (rotary_dim / head_dim)block_sparse_moeprefix (w1/w2/w3 format)e_score_correction_biasbuffer mappingExamples
conversion.sh— single/multi-GPU round-trip and checkpoint import/exportinference.sh— text generation with TP supportverify_toy_model.py— creates a toy model and runs forward-pass comparison viacompare.pyTests
test_minimax_m2_conversion.py— toy model creation, single-GPU round-trip, TP=2 and PP=2 parallelism testsBug fix in
compare.pyVerification
Known Limitations
q_norm,k_norm) are dropped — MiniMax-M2 uses full-dimension QK norm while Megatron uses per-head. Acceptable for fine-tuning.Test plan
verify_toy_model.pyon 1-GPU, TP=2, EP=2test_minimax_m2_conversion.pywith 2+ GPUsMade with Cursor
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests