Skip to content

Add GLM5 support#2469

Open
pengdurice wants to merge 19 commits intoNVIDIA-NeMo:mainfrom
pengdurice:peng-add-glm5
Open

Add GLM5 support#2469
pengdurice wants to merge 19 commits intoNVIDIA-NeMo:mainfrom
pengdurice:peng-add-glm5

Conversation

@pengdurice
Copy link

@pengdurice pengdurice commented Feb 20, 2026

What does this PR do ?

This PR adds support of GLM5 Model.

Changelog

  • Add GLM5 model's model bridge and model provider classes
  • Because GLM5 model is only available from 5.2.0 of transformers and Megatron-Bridge is not migrated to transformer 5 yet. So we need to make some additional changes to make it work without breaking backward compatibility. In the new transformers, the experts are fused into one 3D tensor where the 0th dim is the experts. That breaks the 1-1 mapping between hf and megatron tensors. So we need to override the maybe_modify_loaded_hf_weight and maybe_modify_converted_hf_weight functions so that we slice the HF tensor to convert to megatron tensors and concat the megatron tensors to HF tensor.
  • please locally update your transformers' version to 5.2.0 before using this feature.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Summary by CodeRabbit

  • New Features

    • Added GLM-5 MoE model support for converting between HuggingFace and Megatron-Core formats, including provider/config translation and export/import hooks for MoE weights.
    • Exposed GLM-5 bridge and model provider in the public API.
  • Bug Fixes / Improvements

    • Improved handling of per-expert (MoE) weights for import/export and added logic to tolerate certain HuggingFace parameter name patterns for gated MLPs.
  • Tests

    • Added functional test suite validating GLM-5 MoE conversion workflows across parallelism configurations.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@HollowMan6 HollowMan6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can also integrate the tilelang fused kernels for DSA, (https://github.com/THUDM/slime/tree/14413cf7c5c0d1c9854a28f471d3ab575461aa51/slime_plugins/models/glm5) to accelerate the training speed.

@pengdurice
Copy link
Author

Probably we can also integrate the tilelang fused kernels for DSA, (https://github.com/THUDM/slime/tree/14413cf7c5c0d1c9854a28f471d3ab575461aa51/slime_plugins/models/glm5) to accelerate the training speed.

that's a great idea, let me add it in the next PR;-)

@pengdurice pengdurice marked this pull request as ready for review February 24, 2026 18:32
@pengdurice pengdurice changed the title [Draft] Add GLM5 support Add GLM5 support Feb 24, 2026
@pengdurice pengdurice changed the title Add GLM5 support [Draft]Add GLM5 support Feb 24, 2026
@pengdurice
Copy link
Author

pengdurice commented Feb 24, 2026

@HollowMan6 would you mind reviewing this PR when you get a chance? thanks! edit: rebasing causing some 3rd party module changes being wrongly picked up, will need to fix them first, converting back to draft

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Adds GLM-5 MoE import/export support: a GLM5Bridge and GLM5ModelProvider, MoE-aware weight slicing/packing and mapping logic, updates to gated MLP mapping, functional tests for GLM-5 conversion, and a Megatron-LM submodule pointer update.

Changes

Cohort / File(s) Summary
Submodule Update
3rdparty/Megatron-LM
Updated submodule pointer to a newer commit for GLM-5 MoE compatibility.
Bridge & Provider
src/megatron/bridge/models/glm_moe_dsa/__init__.py, src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py, src/megatron/bridge/models/glm_moe_dsa/glm5_provider.py, src/megatron/bridge/models/__init__.py
Adds GLM5Bridge (bridge registration, config translation, mapping registry, MoE-aware weight hooks) and GLM5ModelProvider (dataclass with DSA/MoE defaults); exports added to package init.
Weight Mapping & Conversion
src/megatron/bridge/models/conversion/param_mapping.py, src/megatron/bridge/models/glm/glm45_bridge.py
GatedMLPMapping: new helper to detect HF expert-name patterns and flag allow_hf_name_mismatch. GLM45Bridge: new methods to slice per-expert HF weights on load and pack per-expert weights for HF export, plus related helpers and caching.
Tests
tests/functional_tests/models/glm_moe_dsa/__init__.py, tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py
Adds functional tests that build a toy GLM-5 MoE model, run roundtrip conversions across TP/PP/EP combinations, and assert presence/structure of MoE config and weight artifacts.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Bridge as GLM5Bridge
    participant Config as ConfigBuilder
    participant Registry as MappingRegistry
    participant Weights as WeightHandler
    participant Target as MegatronModel

    User->>Bridge: start HF→Megatron conversion
    Bridge->>Config: _get_glm5_configs(hf_pretrained)
    Config-->>Bridge: provider_config
    Bridge->>Bridge: provider_bridge() -> GLM5ModelProvider
    Bridge->>Registry: mapping_registry()
    Registry-->>Bridge: param mappings
    Bridge->>Weights: maybe_modify_loaded_hf_weight(hf_param, hf_state_dict)
    Weights-->>Bridge: sliced per-expert tensors
    Bridge->>Target: apply mapped weights and provider config
    Target-->>User: converted Megatron-Core model
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

Run CICD

Suggested reviewers

  • erhoo82
  • malay-nagda
  • yaoyu-33
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR adds major GLM5 MoE support but lacks documented test results, numeric validation, or convergence verification despite complex weight conversion logic. Document and include test execution results, numeric validation, and convergence characteristics in the PR description to verify the changes work correctly without regressions.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title 'Add GLM5 support' accurately and concisely summarizes the main objective of the pull request, which is to introduce GLM5 model support with a new bridge and provider.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pengdurice pengdurice marked this pull request as draft February 24, 2026 18:41
@pengdurice pengdurice marked this pull request as ready for review February 24, 2026 22:31
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (8)
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-331-332 (1)

331-332: ⚠️ Potential issue | 🟡 Minor

Commented-out # try: is a leftover debug artifact.

As per coding guidelines: "If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it as debug code before merging."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around
lines 331 - 332, Remove the leftover commented-out debug token "# try:" above
the subprocess.run call in test_glm5_conversion.py (the block where result =
subprocess.run(cmd, capture_output=True, text=True, cwd=repo_root) is invoked);
either delete the commented marker entirely or replace it with a proper
try/except around subprocess.run (handling exceptions and failing the test) if
you intended to catch errors — ensure the change updates the surrounding test
logic (the test function that invokes subprocess.run) consistently and does not
leave unexplained commented-out code.
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-194-197 (1)

194-197: ⚠️ Potential issue | 🟡 Minor

Docstring mentions "GLM 4.5" but this is the GLM 5 bridge test class.

🐛 Proposed fix
-    """
-    Test GLM 4.5 MoE model conversion from local HuggingFace model with different parallelism configurations.
-    """
+    """Test GLM 5 MoE model conversion from local HuggingFace model with different parallelism configurations."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around
lines 194 - 197, The class docstring for TestGLM5Conversion incorrectly
references "GLM 4.5"; update the docstring in the TestGLM5Conversion class to
mention "GLM 5" (or a neutral phrase like "GLM 5 bridge") so it accurately
describes the test coverage for the GLM 5 MoE model conversion; locate the
TestGLM5Conversion class and edit its triple-quoted docstring accordingly.
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-335-338 (1)

335-338: ⚠️ Potential issue | 🟡 Minor

assert False is stripped under python -O; use raise AssertionError (B011).

This pattern applies to both lines 338 and 434.

🐛 Proposed fix (applies similarly to line 434)
-            assert False, f"GLM 5 MoE {test_name} conversion failed with return code {result.returncode}"
+            raise AssertionError(f"GLM 5 MoE {test_name} conversion failed with return code {result.returncode}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around
lines 335 - 338, Replace the use of "assert False" which is ignored under
optimized Python with explicit raises: in the failure branches that inspect
result.returncode (using the result variable and test_name) change the "assert
False, f'GLM 5 MoE {test_name} conversion failed with return code
{result.returncode}'" to "raise AssertionError(...)" so the failure always
raises regardless of optimization; make the same change for the similar
assertion at the other occurrence later in the file (the second block around
line 434).
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-293-293 (1)

293-293: ⚠️ Potential issue | 🟡 Minor

Line exceeds the 119-character maximum.

The print(...) at line 293 is a very long line and also contains debug output that should be removed before merging.

As per coding guidelines: "Maximum line length is 119 characters (matching ruff configuration)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` at line
293, Remove the long debug print statement that exceeds 119 chars: delete the
print(f"second_layer mlp experts: {second_layer.mlp.experts} and type:
{type(second_layer.mlp.experts)} and size: {total_size} and shapes:
{total_shapes}") in
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py; if you need
to keep information for debugging, replace it with a concise logger.debug or a
focused assertion that checks specific properties of second_layer.mlp.experts,
total_size, or total_shapes and keep any kept message well under 119 characters.
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-404-404 (1)

404-404: ⚠️ Potential issue | 🟡 Minor

Unused method argument tmp_path (ARG002).

test_glm5_conversion_parallelism_local_model declares tmp_path but never uses it; all output goes under local_root_path.

♻️ Proposed fix
-    def test_glm5_conversion_parallelism_local_model(self, tmp_path, tp, pp, ep, test_name):
+    def test_glm5_conversion_parallelism_local_model(self, tp, pp, ep, test_name):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` at line
404, The test function test_glm5_conversion_parallelism_local_model declares an
unused pytest fixture parameter tmp_path; either remove tmp_path from the
function signature or rename it to _tmp_path to silence the unused-argument
warning. Update the signature of test_glm5_conversion_parallelism_local_model
accordingly (and any direct references if present) so the test uses only
local_root_path, leaving other parameters tp, pp, ep, test_name unchanged.
src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py-108-142 (1)

108-142: ⚠️ Potential issue | 🟡 Minor

make_vocab_size_divisible_by is set twice in provider_bridge.

Line 116 sets it in the configs dict, and line 141 sets it again as a direct attribute on the constructed provider. The second assignment is redundant.

🐛 Proposed fix
-        configs["make_vocab_size_divisible_by"] = 1280
         configs["moe_router_score_function"] = "sigmoid"
         ...
-        provider.make_vocab_size_divisible_by = 1280

Keep only the constructor-time assignment (configs["make_vocab_size_divisible_by"] = 1280) since the provider is constructed at line 122 with **configs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 108 -
142, The provider_bridge function sets make_vocab_size_divisible_by twice
(configs["make_vocab_size_divisible_by"] and
provider.make_vocab_size_divisible_by) which is redundant; remove the later
direct attribute assignment on the provider (the
provider.make_vocab_size_divisible_by = 1280 line) and keep the constructor-time
setting in configs so GLM5ModelProvider(**configs) receives the value once.
src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py-51-58 (1)

51-58: ⚠️ Potential issue | 🟡 Minor

Docstring contains two inaccuracies.

  1. Line 52: "Glm5MoeForCausalLM" — the registered source class (line 47) is GlmMoeDsaForCausalLM.
  2. Line 57: The example shows zai-org/GLM-4.5 instead of a GLM-5 model path.
🐛 Proposed fix
-    This bridge handles the conversion between HuggingFace Glm5MoeForCausalLM
-    (used for GLM 5 models) and Megatron-Core GPTModel formats.
+    This bridge handles the conversion between HuggingFace GlmMoeDsaForCausalLM
+    (used for GLM 5 models) and Megatron-Core GPTModel formats.

     Example:
         >>> from megatron.bridge import AutoBridge
-        >>> bridge = AutoBridge.from_hf_pretrained("zai-org/GLM-4.5")
+        >>> bridge = AutoBridge.from_hf_pretrained("zai-org/GLM-5")
         >>> provider = bridge.to_megatron_provider()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 51 - 58,
Update the docstring to correct two inaccuracies: replace the incorrect source
class name "Glm5MoeForCausalLM" with the registered class name
GlmMoeDsaForCausalLM, and change the example model path from "zai-org/GLM-4.5"
to a GLM-5 example such as "zai-org/GLM-5" in the AutoBridge.from_hf_pretrained
usage so the example and description correctly reflect GLM-5 and the
GlmMoeDsaForCausalLM source class (references: GlmMoeDsaForCausalLM,
AutoBridge.from_hf_pretrained, to_megatron_provider).
src/megatron/bridge/models/glm/glm45_bridge.py-326-326 (1)

326-326: ⚠️ Potential issue | 🟡 Minor

Import WeightConversionTask directly instead of using a forward reference string.

Line 326 uses a string forward reference task: "WeightConversionTask", but this class should be imported directly from megatron.bridge.models.conversion.model_bridge like other bridge implementations (gpt_oss_bridge.py, deepseek_v3_bridge.py, qwen3_vl_bridge.py). Update the import on line 26 to include WeightConversionTask and remove the quotes from the type hint.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm/glm45_bridge.py` at line 326, Replace the
string-forward type hint for WeightConversionTask with a direct class import:
add WeightConversionTask to the module imports (import it from
megatron.bridge.models.conversion.model_bridge alongside the other bridge
imports) and change the annotated field from task: "WeightConversionTask" to
task: WeightConversionTask so the type is a real symbol rather than a
forward-reference string.
🧹 Nitpick comments (2)
src/megatron/bridge/models/glm_moe_dsa/__init__.py (1)

21-24: Sort __all__ alphabetically.

Ruff (RUF022) flags the unsorted order; "GLM5Bridge" should precede "GLM5ModelProvider".

♻️ Proposed fix
 __all__ = [
-    "GLM5ModelProvider",
     "GLM5Bridge",
+    "GLM5ModelProvider",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_moe_dsa/__init__.py` around lines 21 - 24, The
__all__ list is not alphabetized which triggers RUF022; update the list in
__init__.py so entries are sorted alphabetically, placing "GLM5Bridge" before
"GLM5ModelProvider" (i.e., change the order of the symbols GLM5Bridge and
GLM5ModelProvider in the __all__ declaration).
src/megatron/bridge/models/glm/glm45_bridge.py (1)

18-18: Replace deprecated typing generics with Python 3.10+ built-ins.

Dict, Optional, and Tuple should be replaced with their built-in equivalents; Mapping should come from collections.abc.

♻️ Proposed fix
-from typing import Dict, Mapping, Optional, Tuple
+from collections.abc import Mapping

Then update all usages in the changed lines:

  • Dict[str, torch.Tensor]dict[str, torch.Tensor]
  • Optional[...]... | None
  • Tuple[str, int, Optional[str]]tuple[str, int, str | None]

As per coding guidelines: "Use built-in generics (list, dict, tuple) instead of typing equivalents", "Use 'T | None' for nullable types instead of 'Optional[T]'".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm/glm45_bridge.py` at line 18, The import and
type annotations in glm45_bridge.py use deprecated typing generics; replace
"from typing import Dict, Mapping, Optional, Tuple" with "from collections.abc
import Mapping" (only) and use built-in generics and PEP 604 unions in
annotations: change Dict[...] → dict[...], Tuple[...] → tuple[...], Optional[T]
→ T | None, and ensure any Mapping stays typed as collections.abc.Mapping;
update all occurrences such as Dict[str, torch.Tensor] → dict[str,
torch.Tensor], Optional[...] → ... | None, and Tuple[str, int, Optional[str]] →
tuple[str, int, str | None] (adjust other similar annotations accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/conversion/param_mapping.py`:
- Around line 1985-1995: Update the _should_allow_hf_name_mismatch signature to
use built-in generics and PEP 604 union syntax: change the parameter annotation
from Union[str, Dict[str, str]] to str | dict[str, str] and add the return type
bool; add type hints to the nested helpers has_expert_wildcard(param_name: str)
-> bool and has_resolved_expert(param_name: str) -> bool so all functions in the
_should_allow_hf_name_mismatch scope have explicit parameter and return type
annotations (keep logic unchanged).

In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py`:
- Around line 385-397: Add an explicit __init__ to GLM5Bridge that initializes
the instance attributes _hf_expert_cache, _hf_expert_expected and _hf_config
(e.g., to empty dicts or None as appropriate) so their lifecycle is
deterministic; then remove the lazy hasattr guards that create those attributes
(the check in maybe_modify_converted_hf_weight that sets
_hf_expert_cache/_hf_expert_expected and the guard that sets _hf_config
elsewhere) and rely on the constructor-initialized members instead, updating any
usages to handle their initialized (possibly empty/None) values.
- Line 18: The import and type hints use typing generics; change those to Python
3.10+ built-ins and appropriate stdlib ABCs: remove Dict, Optional, Tuple (and
Mapping) from the typing import line in glm5_bridge.py and update all
annotations—use dict[...] instead of Dict[...], tuple[...] instead of
Tuple[...], and use X | None instead of Optional[X]; for Mapping use
collections.abc.Mapping (import it) or replace with dict if a concrete mapping
is intended; update all function/method signatures and return annotations (e.g.,
"-> Dict[str, torch.Tensor]" → "-> dict[str, torch.Tensor]",
"Optional[torch.Tensor]" → "torch.Tensor | None", "Tuple[str, int,
Optional[str]]" → "tuple[str, int, str | None]") and remove the now-unused
typing imports.
- Around line 414-428: Replace the broad "except Exception: pass" in
_should_pack_experts with a narrow exception handler (e.g., catch
AttributeError/TypeError) and log the caught error instead of silencing it;
specifically handle cases where accessing hf_state_dict attributes or methods
may fail, and include the error and base_key in the log message. Apply the same
change to the analogous try/except in _expected_experts, and prefer checking for
hasattr/getattr safely before calling has_glob to avoid exceptions where
possible.
- Around line 385-390: The annotated type "WeightConversionTask" used in
maybe_modify_converted_hf_weight is not imported; add an import for
WeightConversionTask from megatron.bridge.models.conversion.model_bridge (or, if
circular imports are possible, import it inside a TYPE_CHECKING block from
typing import TYPE_CHECKING and then from
megatron.bridge.models.conversion.model_bridge import WeightConversionTask
within that block) so the annotation resolves for type checkers while leaving
runtime behavior unchanged.

In `@src/megatron/bridge/models/glm_moe_dsa/glm5_provider.py`:
- Around line 24-32: Override GLM5ModelProvider.__post_init__ to suppress the
inherited deprecation message: implement a __post_init__ in GLM5ModelProvider
that temporarily replaces DeepSeekV3ModelProvider._warn_deprecated with a no-op,
calls super().__post_init__ (so any other base init runs), and then restores the
original _warn_deprecated. This prevents DeepSeekV3ModelProvider.__post_init__
from emitting the misleading warning for GLM5ModelProvider while preserving the
base initialization behavior.

In `@src/megatron/bridge/models/glm/glm45_bridge.py`:
- Around line 407-425: The assembly logic in glm45_bridge.py (using
self._hf_expert_cache, base_key, expert_id, expected, output_base_key) assumes
cached keys are 0..expected-1 and so does torch.stack([cache[i] for i in
range(expected)]), which breaks under expert-parallel shards whose global expert
IDs are a non-zero-based range; update the assembly to (a) check that all
expected global expert IDs are present in cache (e.g., compare set(cache.keys())
to the expected global ID set), then sort the actual global IDs and stack using
that sorted order so tensors align to their global expert index, or (b) if you
intend to support only EP==1, add an explicit assertion that cached expert IDs
are contiguous and start at 0 and fail fast; apply the same fix for both the
single-tensor path and the per-kind ("gate"/"up") path that currently uses
cache[i].
- Around line 334-336: The attributes _hf_config, _hf_expert_cache and
_hf_expert_expected are being created lazily inside
maybe_modify_converted_hf_weight, which can cause AttributeError if accessed
earlier; add an __init__ override on the class to call super().__init__(*args,
**kwargs) and initialize self._hf_config = None, self._hf_expert_cache = {} and
self._hf_expert_expected = {} so these members always exist, and then remove the
hasattr guard around _hf_expert_cache in maybe_modify_converted_hf_weight.
- Around line 353-367: The helper _should_pack_experts swallows all exceptions
around the hf_state_dict membership checks, hiding real bugs; replace the broad
"except Exception: pass" with a narrow handler that only catches expected errors
from the membership test (e.g., TypeError for unhashable keys) and lets other
exceptions propagate; update the try/except around "if base_key in
hf_state_dict" and the ".weight" variant to catch TypeError (or ValueError if
applicable) only, and re-raise any other exception so unexpected
AttributeError/TypeError logic bugs in _should_pack_experts or hf_state_dict
surface.
- Around line 252-254: The return type of maybe_modify_loaded_hf_weight is
wrong: it currently declares -> torch.Tensor but the branch where hf_param is a
dict returns a dict[str, torch.Tensor]; update the annotation to reflect both
possibilities (e.g., -> torch.Tensor | dict[str, torch.Tensor] or use
typing.Union[torch.Tensor, dict[str, torch.Tensor]]), and ensure any necessary
typing imports (from __future__ import annotations or typing.Union/Dict) are
added so type checkers accept the union; keep the existing logic in
maybe_modify_loaded_hf_weight and only change the return type to match both the
tensor and dict branches.
- Around line 373-379: In _expected_experts, replace the broad "except
Exception" around the .shape[0] access for base_key/hf_state_dict with a narrow
except that only catches the anticipated errors (e.g., KeyError, IndexError,
AttributeError) so programming errors are not swallowed; specifically, in the
block that reads hf_state_dict[base_key].shape[0] (and the .weight fallback
using base_key[:-len(".weight")]) catch only those exceptions and set expected =
None in those cases.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py`:
- Around line 273-278: The test currently calls
GlmMoeDsaForCausalLM.from_pretrained(..., trust_remote_code=True); remove the
security risk by deleting the trust_remote_code argument or setting
trust_remote_code=False when loading from the local glm5_toy_model_path in
test_glm5_conversion.py so the known local class is used without executing
remote code.
- Around line 119-152: The test fixture _create_glm5_toy_model uses
AutoConfig.from_pretrained("zai-org/GLM-5") and
AutoTokenizer.from_pretrained("zai-org/GLM-5") which creates a network
dependency; replace these with a hermetic local setup by constructing the config
with AutoConfig.for_model("glm_moe_dsa", **HF_GLM5_TOY_MODEL_CONFIG) (so the
test uses HF_GLM5_TOY_MODEL_CONFIG directly and still sets
config.torch_dtype=torch.bfloat16) and stop calling
AutoTokenizer.from_pretrained; instead create a minimal tokenizer
stub/tokenizer_config.json in model_dir (matching patterns used in other tests)
and save that file before calling model.save_pretrained, ensuring
tokenizer.save_pretrained is not used to fetch from HF Hub and the local
config.json is written from HF_GLM5_TOY_MODEL_CONFIG as already done.

---

Minor comments:
In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py`:
- Around line 108-142: The provider_bridge function sets
make_vocab_size_divisible_by twice (configs["make_vocab_size_divisible_by"] and
provider.make_vocab_size_divisible_by) which is redundant; remove the later
direct attribute assignment on the provider (the
provider.make_vocab_size_divisible_by = 1280 line) and keep the constructor-time
setting in configs so GLM5ModelProvider(**configs) receives the value once.
- Around line 51-58: Update the docstring to correct two inaccuracies: replace
the incorrect source class name "Glm5MoeForCausalLM" with the registered class
name GlmMoeDsaForCausalLM, and change the example model path from
"zai-org/GLM-4.5" to a GLM-5 example such as "zai-org/GLM-5" in the
AutoBridge.from_hf_pretrained usage so the example and description correctly
reflect GLM-5 and the GlmMoeDsaForCausalLM source class (references:
GlmMoeDsaForCausalLM, AutoBridge.from_hf_pretrained, to_megatron_provider).

In `@src/megatron/bridge/models/glm/glm45_bridge.py`:
- Line 326: Replace the string-forward type hint for WeightConversionTask with a
direct class import: add WeightConversionTask to the module imports (import it
from megatron.bridge.models.conversion.model_bridge alongside the other bridge
imports) and change the annotated field from task: "WeightConversionTask" to
task: WeightConversionTask so the type is a real symbol rather than a
forward-reference string.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py`:
- Around line 331-332: Remove the leftover commented-out debug token "# try:"
above the subprocess.run call in test_glm5_conversion.py (the block where result
= subprocess.run(cmd, capture_output=True, text=True, cwd=repo_root) is
invoked); either delete the commented marker entirely or replace it with a
proper try/except around subprocess.run (handling exceptions and failing the
test) if you intended to catch errors — ensure the change updates the
surrounding test logic (the test function that invokes subprocess.run)
consistently and does not leave unexplained commented-out code.
- Around line 194-197: The class docstring for TestGLM5Conversion incorrectly
references "GLM 4.5"; update the docstring in the TestGLM5Conversion class to
mention "GLM 5" (or a neutral phrase like "GLM 5 bridge") so it accurately
describes the test coverage for the GLM 5 MoE model conversion; locate the
TestGLM5Conversion class and edit its triple-quoted docstring accordingly.
- Around line 335-338: Replace the use of "assert False" which is ignored under
optimized Python with explicit raises: in the failure branches that inspect
result.returncode (using the result variable and test_name) change the "assert
False, f'GLM 5 MoE {test_name} conversion failed with return code
{result.returncode}'" to "raise AssertionError(...)" so the failure always
raises regardless of optimization; make the same change for the similar
assertion at the other occurrence later in the file (the second block around
line 434).
- Line 293: Remove the long debug print statement that exceeds 119 chars: delete
the print(f"second_layer mlp experts: {second_layer.mlp.experts} and type:
{type(second_layer.mlp.experts)} and size: {total_size} and shapes:
{total_shapes}") in
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py; if you need
to keep information for debugging, replace it with a concise logger.debug or a
focused assertion that checks specific properties of second_layer.mlp.experts,
total_size, or total_shapes and keep any kept message well under 119 characters.
- Line 404: The test function test_glm5_conversion_parallelism_local_model
declares an unused pytest fixture parameter tmp_path; either remove tmp_path
from the function signature or rename it to _tmp_path to silence the
unused-argument warning. Update the signature of
test_glm5_conversion_parallelism_local_model accordingly (and any direct
references if present) so the test uses only local_root_path, leaving other
parameters tp, pp, ep, test_name unchanged.

---

Nitpick comments:
In `@src/megatron/bridge/models/glm_moe_dsa/__init__.py`:
- Around line 21-24: The __all__ list is not alphabetized which triggers RUF022;
update the list in __init__.py so entries are sorted alphabetically, placing
"GLM5Bridge" before "GLM5ModelProvider" (i.e., change the order of the symbols
GLM5Bridge and GLM5ModelProvider in the __all__ declaration).

In `@src/megatron/bridge/models/glm/glm45_bridge.py`:
- Line 18: The import and type annotations in glm45_bridge.py use deprecated
typing generics; replace "from typing import Dict, Mapping, Optional, Tuple"
with "from collections.abc import Mapping" (only) and use built-in generics and
PEP 604 unions in annotations: change Dict[...] → dict[...], Tuple[...] →
tuple[...], Optional[T] → T | None, and ensure any Mapping stays typed as
collections.abc.Mapping; update all occurrences such as Dict[str, torch.Tensor]
→ dict[str, torch.Tensor], Optional[...] → ... | None, and Tuple[str, int,
Optional[str]] → tuple[str, int, str | None] (adjust other similar annotations
accordingly).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cbcf4a and d308a0b.

📒 Files selected for processing (9)
  • 3rdparty/Megatron-LM
  • src/megatron/bridge/models/__init__.py
  • src/megatron/bridge/models/conversion/param_mapping.py
  • src/megatron/bridge/models/glm/glm45_bridge.py
  • src/megatron/bridge/models/glm_moe_dsa/__init__.py
  • src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py
  • src/megatron/bridge/models/glm_moe_dsa/glm5_provider.py
  • tests/functional_tests/models/glm_moe_dsa/__init__.py
  • tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py

Comment on lines +1985 to +1995
@staticmethod
def _should_allow_hf_name_mismatch(hf_param: Union[str, Dict[str, str]]) -> bool:
"""Return True when HF params may be synthesized (e.g., expert slices)."""
def has_expert_wildcard(param_name: str) -> bool:
return ".experts.*." in param_name or ".experts.**." in param_name
def has_resolved_expert(param_name: str) -> bool:
return re.search(r"\.experts\.\d+\.", param_name) is not None

if isinstance(hf_param, str):
return has_expert_wildcard(hf_param) or has_resolved_expert(hf_param)
return any(has_expert_wildcard(name) or has_resolved_expert(name) for name in hf_param.values())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Apply type hints and use built-in generics per coding guidelines.

  1. The hf_param annotation uses Union[str, Dict[str, str]] — should be str | dict[str, str].
  2. The nested helpers has_expert_wildcard and has_resolved_expert lack type hints.
♻️ Proposed fix
     `@staticmethod`
-    def _should_allow_hf_name_mismatch(hf_param: Union[str, Dict[str, str]]) -> bool:
+    def _should_allow_hf_name_mismatch(hf_param: str | dict[str, str]) -> bool:
         """Return True when HF params may be synthesized (e.g., expert slices)."""
-        def has_expert_wildcard(param_name: str) -> bool:
+        def has_expert_wildcard(param_name: str) -> bool:  # noqa: E306
             return ".experts.*." in param_name or ".experts.**." in param_name
-        def has_resolved_expert(param_name: str) -> bool:
+        def has_resolved_expert(param_name: str) -> bool:
             return re.search(r"\.experts\.\d+\.", param_name) is not None
 
         if isinstance(hf_param, str):
             return has_expert_wildcard(hf_param) or has_resolved_expert(hf_param)
         return any(has_expert_wildcard(name) or has_resolved_expert(name) for name in hf_param.values())

As per coding guidelines: "Use type hints for function arguments and return types" and "Use built-in generics (list, dict, tuple) instead of typing equivalents" and "Use 'X | Y' for union types instead of 'Union[X, Y]'".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/conversion/param_mapping.py` around lines 1985 -
1995, Update the _should_allow_hf_name_mismatch signature to use built-in
generics and PEP 604 union syntax: change the parameter annotation from
Union[str, Dict[str, str]] to str | dict[str, str] and add the return type bool;
add type hints to the nested helpers has_expert_wildcard(param_name: str) ->
bool and has_resolved_expert(param_name: str) -> bool so all functions in the
_should_allow_hf_name_mismatch scope have explicit parameter and return type
annotations (keep logic unchanged).

import logging
import re
from functools import partial
from typing import Dict, Mapping, Optional, Tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Replace typing generics with built-in equivalents per coding guidelines.

Dict, Optional, and Tuple from typing should be dict, X | None, and tuple respectively (Python 3.10+).

♻️ Proposed fix
-from typing import Dict, Mapping, Optional, Tuple
+from collections.abc import Mapping
+from typing import Optional, Tuple  # remove once all usages are updated

Then update method signatures throughout the file, e.g.:

  • -> Dict[str, torch.Tensor]-> dict[str, torch.Tensor]
  • Optional[torch.Tensor]torch.Tensor | None
  • Tuple[str, int, Optional[str]]tuple[str, int, str | None]

As per coding guidelines: "Use built-in generics (list, dict, tuple) instead of typing equivalents" and "Use 'T | None' for nullable types instead of 'Optional[T]'".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` at line 18, The import
and type hints use typing generics; change those to Python 3.10+ built-ins and
appropriate stdlib ABCs: remove Dict, Optional, Tuple (and Mapping) from the
typing import line in glm5_bridge.py and update all annotations—use dict[...]
instead of Dict[...], tuple[...] instead of Tuple[...], and use X | None instead
of Optional[X]; for Mapping use collections.abc.Mapping (import it) or replace
with dict if a concrete mapping is intended; update all function/method
signatures and return annotations (e.g., "-> Dict[str, torch.Tensor]" → "->
dict[str, torch.Tensor]", "Optional[torch.Tensor]" → "torch.Tensor | None",
"Tuple[str, int, Optional[str]]" → "tuple[str, int, str | None]") and remove the
now-unused typing imports.

Comment on lines +385 to +397
def maybe_modify_converted_hf_weight(
self,
task: "WeightConversionTask",
converted_weights_dict: Dict[str, torch.Tensor],
hf_state_dict: Mapping[str, torch.Tensor],
) -> Dict[str, torch.Tensor]:
"""Pack per-expert weights into GLM MoE tensors when exporting to HF."""
if not converted_weights_dict:
return converted_weights_dict

if not hasattr(self, "_hf_expert_cache"):
self._hf_expert_cache = {}
self._hf_expert_expected = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Instance state should be initialized in __init__, not lazily via hasattr.

_hf_expert_cache, _hf_expert_expected (lines 396-397) and _hf_config (line 147) are all created via hasattr guards scattered across methods. This violates the coding guideline "Initialize all externally visible members of a class in the constructor" and makes the lifecycle of these caches opaque. If maybe_modify_converted_hf_weight is called concurrently or build_conversion_tasks is skipped, the state is silently absent.

Add an __init__ to GLM5Bridge that initialises these to None / empty dicts.

♻️ Proposed fix
 class GLM5Bridge(MegatronModelBridge):
+    def __init__(self):
+        super().__init__()
+        self._hf_config = None
+        self._hf_expert_cache: dict = {}
+        self._hf_expert_expected: dict = {}
+

Then remove the if not hasattr(self, "_hf_expert_cache"): block at lines 395-397 and the if not hasattr(self, "_hf_config"): guard at line 241.

🧰 Tools
🪛 Flake8 (7.3.0)

[error] 387-387: undefined name 'WeightConversionTask'

(F821)

🪛 Ruff (0.15.2)

[error] 387-387: Undefined name WeightConversionTask

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 385 -
397, Add an explicit __init__ to GLM5Bridge that initializes the instance
attributes _hf_expert_cache, _hf_expert_expected and _hf_config (e.g., to empty
dicts or None as appropriate) so their lifecycle is deterministic; then remove
the lazy hasattr guards that create those attributes (the check in
maybe_modify_converted_hf_weight that sets _hf_expert_cache/_hf_expert_expected
and the guard that sets _hf_config elsewhere) and rely on the
constructor-initialized members instead, updating any usages to handle their
initialized (possibly empty/None) values.

Comment on lines +385 to +390
def maybe_modify_converted_hf_weight(
self,
task: "WeightConversionTask",
converted_weights_dict: Dict[str, torch.Tensor],
hf_state_dict: Mapping[str, torch.Tensor],
) -> Dict[str, torch.Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "WeightConversionTask" --type py -C 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 37586


🏁 Script executed:

head -100 src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5084


🏁 Script executed:

rg "from.*import" src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py | head -30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 768


🏁 Script executed:

# Check if glm5_bridge.py has a TYPE_CHECKING block
grep -n "TYPE_CHECKING" src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check the full imports section of glm5_bridge.py to see if WeightConversionTask should be there
sed -n '1,50p' src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1793


Import WeightConversionTask or use a TYPE_CHECKING block to resolve the type annotation.

The method signature on line 387 uses task: "WeightConversionTask" as a string annotation, but WeightConversionTask is never imported. While Python evaluates string annotations lazily, type checkers cannot resolve the annotation without the type being in scope. This violates the coding guideline to use proper type hints. Compare with other bridge implementations (e.g., deepseek_v3_bridge.py, qwen3_vl_bridge.py) that import WeightConversionTask directly from megatron.bridge.models.conversion.model_bridge.

Import WeightConversionTask from the existing import on line 26, or conditionally import it within a TYPE_CHECKING block if circular imports are a concern.

🧰 Tools
🪛 Flake8 (7.3.0)

[error] 387-387: undefined name 'WeightConversionTask'

(F821)

🪛 Ruff (0.15.2)

[error] 387-387: Undefined name WeightConversionTask

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 385 -
390, The annotated type "WeightConversionTask" used in
maybe_modify_converted_hf_weight is not imported; add an import for
WeightConversionTask from megatron.bridge.models.conversion.model_bridge (or, if
circular imports are possible, import it inside a TYPE_CHECKING block from
typing import TYPE_CHECKING and then from
megatron.bridge.models.conversion.model_bridge import WeightConversionTask
within that block) so the annotation resolves for type checkers while leaving
runtime behavior unchanged.

Comment on lines +414 to +428
def _should_pack_experts(base_key: str) -> bool:
try:
if base_key in hf_state_dict:
return True
if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
return True
except Exception:
pass
has_glob = getattr(hf_state_dict, "has_glob", None)
if callable(has_glob):
if has_glob(base_key):
return True
if base_key.endswith(".weight") and has_glob(base_key[: -len(".weight")]):
return True
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

except Exception: pass silently swallows all errors.

_should_pack_experts catches a broad Exception and suppresses it without logging (lines 420-421). This makes diagnosing failures during HF weight export very difficult. At minimum the exception should be logged.

♻️ Proposed fix
         try:
             if base_key in hf_state_dict:
                 return True
             if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
                 return True
-        except Exception:
-            pass
+        except Exception as e:
+            logger.debug("Exception while probing hf_state_dict for %s: %s", base_key, e)

The same pattern applies to lines 439-440 in _expected_experts. Based on coding guidelines: "When using try-except blocks, limit the except clause to the smallest set of errors possible."

🧰 Tools
🪛 Ruff (0.15.2)

[error] 420-421: try-except-pass detected, consider logging the exception

(S110)


[warning] 420-420: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 414 -
428, Replace the broad "except Exception: pass" in _should_pack_experts with a
narrow exception handler (e.g., catch AttributeError/TypeError) and log the
caught error instead of silencing it; specifically handle cases where accessing
hf_state_dict attributes or methods may fail, and include the error and base_key
in the log message. Apply the same change to the analogous try/except in
_expected_experts, and prefer checking for hasattr/getattr safely before calling
has_glob to avoid exceptions where possible.

Comment on lines +353 to +367
def _should_pack_experts(base_key: str) -> bool:
try:
if base_key in hf_state_dict:
return True
if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
return True
except Exception:
pass
has_glob = getattr(hf_state_dict, "has_glob", None)
if callable(has_glob):
if has_glob(base_key):
return True
if base_key.endswith(".weight") and has_glob(base_key[: -len(".weight")]):
return True
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Swallow-all except Exception: pass masks real failures in _should_pack_experts.

Lines 354–360 catch every possible exception silently, including bugs like attribute errors or type errors that should surface immediately. The outer try body only performs a dict membership test (in hf_state_dict) which should raise at most a TypeError if the key is unhashable.

🐛 Proposed fix
-        def _should_pack_experts(base_key: str) -> bool:
-            try:
-                if base_key in hf_state_dict:
-                    return True
-                if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
-                    return True
-            except Exception:
-                pass
+        def _should_pack_experts(base_key: str) -> bool:
+            try:
+                if base_key in hf_state_dict:
+                    return True
+                if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
+                    return True
+            except TypeError:
+                pass

As per coding guidelines: "When using try-except blocks, limit the except clause to the smallest set of errors possible".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _should_pack_experts(base_key: str) -> bool:
try:
if base_key in hf_state_dict:
return True
if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
return True
except Exception:
pass
has_glob = getattr(hf_state_dict, "has_glob", None)
if callable(has_glob):
if has_glob(base_key):
return True
if base_key.endswith(".weight") and has_glob(base_key[: -len(".weight")]):
return True
return False
def _should_pack_experts(base_key: str) -> bool:
try:
if base_key in hf_state_dict:
return True
if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
return True
except TypeError:
pass
has_glob = getattr(hf_state_dict, "has_glob", None)
if callable(has_glob):
if has_glob(base_key):
return True
if base_key.endswith(".weight") and has_glob(base_key[: -len(".weight")]):
return True
return False
🧰 Tools
🪛 Ruff (0.15.2)

[error] 359-360: try-except-pass detected, consider logging the exception

(S110)


[warning] 359-359: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm/glm45_bridge.py` around lines 353 - 367, The
helper _should_pack_experts swallows all exceptions around the hf_state_dict
membership checks, hiding real bugs; replace the broad "except Exception: pass"
with a narrow handler that only catches expected errors from the membership test
(e.g., TypeError for unhashable keys) and lets other exceptions propagate;
update the try/except around "if base_key in hf_state_dict" and the ".weight"
variant to catch TypeError (or ValueError if applicable) only, and re-raise any
other exception so unexpected AttributeError/TypeError logic bugs in
_should_pack_experts or hf_state_dict surface.

Comment on lines +373 to +379
try:
if base_key in hf_state_dict:
expected = int(hf_state_dict[base_key].shape[0])
elif base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
expected = int(hf_state_dict[base_key[: -len(".weight")]].shape[0])
except Exception:
expected = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Broad except Exception in _expected_experts hides shape-access bugs.

The try body accesses .shape[0] on a tensor. The only expected exceptions are KeyError (key absent) and potentially IndexError (scalar tensor). Catching all Exception silently discards programming errors.

🐛 Proposed fix
-            try:
-                if base_key in hf_state_dict:
-                    expected = int(hf_state_dict[base_key].shape[0])
-                elif base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
-                    expected = int(hf_state_dict[base_key[: -len(".weight")]].shape[0])
-            except Exception:
-                expected = None
+            try:
+                if base_key in hf_state_dict:
+                    expected = int(hf_state_dict[base_key].shape[0])
+                elif base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
+                    expected = int(hf_state_dict[base_key[: -len(".weight")]].shape[0])
+            except (KeyError, IndexError, TypeError):
+                expected = None

As per coding guidelines: "When using try-except blocks, limit the except clause to the smallest set of errors possible".

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 378-378: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm/glm45_bridge.py` around lines 373 - 379, In
_expected_experts, replace the broad "except Exception" around the .shape[0]
access for base_key/hf_state_dict with a narrow except that only catches the
anticipated errors (e.g., KeyError, IndexError, AttributeError) so programming
errors are not swallowed; specifically, in the block that reads
hf_state_dict[base_key].shape[0] (and the .weight fallback using
base_key[:-len(".weight")]) catch only those exceptions and set expected = None
in those cases.

Comment on lines +407 to +425
cache = self._hf_expert_cache.setdefault(base_key, {})
if kind is None:
cache[expert_id] = tensor
if len(cache) == expected:
stacked = torch.stack([cache[i] for i in range(expected)], dim=0)
output[output_base_key] = stacked
del self._hf_expert_cache[base_key]
else:
per_kind = cache.setdefault(expert_id, {})
per_kind[kind] = tensor
if all(
(i in cache and "gate" in cache[i] and "up" in cache[i]) for i in range(expected)
):
stacked = torch.stack(
[torch.cat([cache[i]["gate"], cache[i]["up"]], dim=0) for i in range(expected)],
dim=0,
)
output[output_base_key] = stacked
del self._hf_expert_cache[base_key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Expert packing assumes contiguous global expert IDs starting at 0 — breaks under expert parallelism.

When packing per-expert slices back into the fused [num_experts, ...] tensor at lines 411 and 420–422, the code iterates range(expected) and looks up cache[i]. Under expert parallelism (EP > 1), the Megatron→HF conversion task on a given rank will only produce weights for the local expert shard, whose global expert IDs are a non-zero-based contiguous range (e.g., experts 64–127 on rank 1). The local cache will contain those global IDs as keys, so cache[i] for i in range(expected) will KeyError or silently assemble the wrong tensor.

The fix is to either (a) reassemble only when all expected global expert IDs are present and use the sorted keys, or (b) document that this path is only safe for single-rank EP=1 conversion and add an assertion.

As per coding guidelines: "Handle tensor parallel and pipeline parallel distribution correctly in weight conversion".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm/glm45_bridge.py` around lines 407 - 425, The
assembly logic in glm45_bridge.py (using self._hf_expert_cache, base_key,
expert_id, expected, output_base_key) assumes cached keys are 0..expected-1 and
so does torch.stack([cache[i] for i in range(expected)]), which breaks under
expert-parallel shards whose global expert IDs are a non-zero-based range;
update the assembly to (a) check that all expected global expert IDs are present
in cache (e.g., compare set(cache.keys()) to the expected global ID set), then
sort the actual global IDs and stack using that sorted order so tensors align to
their global expert index, or (b) if you intend to support only EP==1, add an
explicit assertion that cached expert IDs are contiguous and start at 0 and fail
fast; apply the same fix for both the single-tensor path and the per-kind
("gate"/"up") path that currently uses cache[i].

Comment on lines +119 to +152
def _create_glm5_toy_model(model_dir: Path) -> None:
model_dir.mkdir(parents=True, exist_ok=True)

# Create GLM 4.5 config from the toy model config using AutoConfig
config = AutoConfig.from_pretrained("zai-org/GLM-5")

# Override with toy model config
for key, value in HF_GLM5_TOY_MODEL_CONFIG.items():
setattr(config, key, value)

config.torch_dtype = torch.bfloat16 # Explicitly set the torch_dtype in config

# Create model with random weights and convert to bfloat16
from transformers import GlmMoeDsaForCausalLM

model = GlmMoeDsaForCausalLM(config)

model = model.bfloat16() # Use .bfloat16() method instead of .to()
for k, v in model.named_buffers():
if "e_score_correction_bias" in k:
v.data = v.data.to(torch.float32)

# Download and save tokenizer from a reference GLM model
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-5")
tokenizer.save_pretrained(model_dir)

# Save model and config to directory
model.save_pretrained(model_dir, safe_serialization=True)

# Also save config.json explicitly to ensure compatibility with correct torch_dtype
config_to_save = HF_GLM5_TOY_MODEL_CONFIG.copy()
config_path = model_dir / "config.json"
with open(config_path, "w") as f:
json.dump(config_to_save, f, indent=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "test_glm5_conversion.py" -type f

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 138


🏁 Script executed:

cat -n tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py | sed -n '119,152p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1652


🏁 Script executed:

rg "GlmMoeDsaConfig|AutoConfig\.for_model" --type py -A 2 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 604


🏁 Script executed:

rg "from transformers import" tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py -A 5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 717


🏁 Script executed:

rg "save_pretrained|tokenizer" tests/functional_tests/models/olmoe/test_olmoe_conversion.py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1484


🏁 Script executed:

rg "class.*Config|PretrainedConfig" tests/functional_tests/ -t py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 166


🏁 Script executed:

rg "PretrainedConfig|from_config" tests/functional_tests/ -t py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 490


🏁 Script executed:

grep -r "tokenizer_config.json" tests/functional_tests/ --include="*.py" -B 3 -A 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 30162


🏁 Script executed:

rg "vocab_size|vocab\\.json|merges\\.txt" tests/functional_tests/ --include="*.py" -B 2 -A 2 | head -30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 512


🏁 Script executed:

cat -n tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py | sed -n '1,30p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1229


🏁 Script executed:

cat -n tests/functional_tests/models/llama_nemotron/test_llama_nemotron_conversion.py | sed -n '50,100p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2448


Use local config construction and tokenizer stub to eliminate network dependency.

Lines 123 and 142 call AutoConfig.from_pretrained("zai-org/GLM-5") and AutoTokenizer.from_pretrained("zai-org/GLM-5"), creating a non-hermetic test fixture that fails in offline CI environments or when HuggingFace Hub is unreachable and requires credentials.

Replace with:

  • AutoConfig.for_model("glm_moe_dsa", **HF_GLM5_TOY_MODEL_CONFIG) to construct config directly from the toy model config (pattern already used in tests/functional_tests/models/olmoe/test_olmoe_conversion.py)
  • Create a minimal tokenizer_config.json locally instead of downloading (pattern already used in several test files like llama_nemotron and qwen_vl)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around
lines 119 - 152, The test fixture _create_glm5_toy_model uses
AutoConfig.from_pretrained("zai-org/GLM-5") and
AutoTokenizer.from_pretrained("zai-org/GLM-5") which creates a network
dependency; replace these with a hermetic local setup by constructing the config
with AutoConfig.for_model("glm_moe_dsa", **HF_GLM5_TOY_MODEL_CONFIG) (so the
test uses HF_GLM5_TOY_MODEL_CONFIG directly and still sets
config.torch_dtype=torch.bfloat16) and stop calling
AutoTokenizer.from_pretrained; instead create a minimal tokenizer
stub/tokenizer_config.json in model_dir (matching patterns used in other tests)
and save that file before calling model.save_pretrained, ensuring
tokenizer.save_pretrained is not used to fetch from HF Hub and the local
config.json is written from HF_GLM5_TOY_MODEL_CONFIG as already done.

Comment on lines +273 to +278
model = GlmMoeDsaForCausalLM.from_pretrained(
glm5_toy_model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=False, # Ensure full loading
trust_remote_code=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

trust_remote_code=True is a security risk in tests.

Loading a model with trust_remote_code=True executes arbitrary code from the HuggingFace Hub. Since the toy model is created locally from a known class (GlmMoeDsaForCausalLM), the flag is unnecessary here.

🐛 Proposed fix
         model = GlmMoeDsaForCausalLM.from_pretrained(
             glm5_toy_model_path,
             torch_dtype=torch.bfloat16,
             low_cpu_mem_usage=False,
-            trust_remote_code=True,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model = GlmMoeDsaForCausalLM.from_pretrained(
glm5_toy_model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=False, # Ensure full loading
trust_remote_code=True,
)
model = GlmMoeDsaForCausalLM.from_pretrained(
glm5_toy_model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=False, # Ensure full loading
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around
lines 273 - 278, The test currently calls
GlmMoeDsaForCausalLM.from_pretrained(..., trust_remote_code=True); remove the
security risk by deleting the trust_remote_code argument or setting
trust_remote_code=False when loading from the local glm5_toy_model_path in
test_glm5_conversion.py so the known local class is used without executing
remote code.

@pengdurice pengdurice changed the title [Draft]Add GLM5 support Add GLM5 support Feb 25, 2026
pengdurice and others added 10 commits February 26, 2026 16:11
… tests and cleaning

Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-authored-by: maanug-nv <109391026+maanug-nv@users.noreply.github.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
@pengdurice
Copy link
Author

@malay-nagda hi, nice to e-meet you. Do you mind reviewing this PR? thanks! This is to add GLM5 support.

@wuyaoxuehun
Copy link

@malay-nagda hi, nice to e-meet you. Do you mind reviewing this PR? thanks! This is to add GLM5 support.

@HollowMan6 @malay-nagda
+1

@malay-nagda
Copy link
Collaborator

@malay-nagda hi, nice to e-meet you. Do you mind reviewing this PR? thanks! This is to add GLM5 support.

@HollowMan6 @malay-nagda +1

Hi. For base recipes, I think @yaoyu-33 can help review?

@pengdurice
Copy link
Author

@malay-nagda hi, nice to e-meet you. Do you mind reviewing this PR? thanks! This is to add GLM5 support.

@HollowMan6 @malay-nagda +1

Hi. For base recipes, I think @yaoyu-33 can help review?

Hi @malay-nagda thank you for the reply. @yaoyu-33 , nice to meet you, do you have time for reviewing this PR? Thanks!

@pengdurice
Copy link
Author

@yaoyu-33, quickly reminder of this PR, thank you! Have a nice week ahead!

Copy link
Contributor

@HollowMan6 HollowMan6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this PR and looks like the current implementation doesn't properly make GLM5 use DSA but wrongly MLA:

Image

After applying all the fixes I propose (Suggested changes) below, it seems to work fine (tested with LoRA enabled):

Image

Note that it also needs dependency NVIDIA/Megatron-LM#3026, and for now mcore doesn't support CP and thd format for DSA.

Image Image

@HollowMan6
Copy link
Contributor

Probably we can also integrate the tilelang fused kernels for DSA, (https://github.com/THUDM/slime/tree/14413cf7c5c0d1c9854a28f471d3ab575461aa51/slime_plugins/models/glm5) to accelerate the training speed.

Note that it also needs dependency NVIDIA/Megatron-LM#3026, and for now mcore doesn't support CP and thd format for DSA.

Now I opened NVIDIA/Megatron-LM#3674 to integrate the tilelang fused kernels for DSA, as well as supporting CP and thd format. Please feel free to try it out and let me know if you find any issues.

pengdurice and others added 5 commits March 3, 2026 13:38
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Copy link
Contributor

@HollowMan6 HollowMan6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable and fix MTP mappings support

pengdurice and others added 3 commits March 3, 2026 14:57
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
@wuyaoxuehun
Copy link

@pengdurice @HollowMan6 hi, is this pr ready to merge then?
Thanks a lot!

@HollowMan6
Copy link
Contributor

@wuyaoxuehun I haven't tested the convergence yet, just using a dummy model for now due to constrained resources on my side. Other than that, it looks good to me.

Copy link
Contributor

@HollowMan6 HollowMan6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more optional changes so that we don't need to further override build_adapter_conversion_tasks (need to wait until #2644 is merged)

dsa_indexer_topk: int = 2048
dsa_indexer_loss_coeff: float = 0.001
dsa_indexer_use_sparse_loss: bool = True
mtp_num_layers: int = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not pass this param here so that users can decide whether to enable MTP.

Suggested change
mtp_num_layers: int = 1

"dsa_indexer_loss_coeff": 0.001,
"dsa_indexer_use_sparse_loss": True,
# MTP params
"mtp_num_layers": hf_config.num_nextn_predict_layers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not pass this param here so that users can decide whether to enable MTP.

Suggested change
"mtp_num_layers": hf_config.num_nextn_predict_layers,

Comment on lines +145 to +150
def build_conversion_tasks(self, hf_pretrained, megatron_model):
"""Override to store config before mapping_registry is called."""
# Store config on instance for use in mapping_registry
self._hf_config = hf_pretrained.config
return super().build_conversion_tasks(hf_pretrained, megatron_model)

Copy link
Contributor

@HollowMan6 HollowMan6 Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to wait until #2644 is merged

Suggested change
def build_conversion_tasks(self, hf_pretrained, megatron_model):
"""Override to store config before mapping_registry is called."""
# Store config on instance for use in mapping_registry
self._hf_config = hf_pretrained.config
return super().build_conversion_tasks(hf_pretrained, megatron_model)

Comment on lines +241 to +245
# optionally add MTP mappings
if not hasattr(self, "_hf_config"):
logger.warning("No HF config found, skipping MTP mappings.")
return MegatronMappingRegistry(*mapping_list)
hf_config = self._hf_config
Copy link
Contributor

@HollowMan6 HollowMan6 Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to wait until #2644 is merged

Suggested change
# optionally add MTP mappings
if not hasattr(self, "_hf_config"):
logger.warning("No HF config found, skipping MTP mappings.")
return MegatronMappingRegistry(*mapping_list)
hf_config = self._hf_config
hf_config = self.hf_config

Comment on lines +249 to +279
for layer_prefix in ("transformer_layer", "mtp_model_layer"):
for megatron_param, hf_param in (param_mappings | layer_specific_mappings).items():
megatron_param = (
megatron_param.replace(".*", f".*.{layer_prefix}")
.replace("decoder", "mtp")
.replace(".*", f".{mtp_layer}")
)
hf_param = hf_param.replace("layers.*", f"layers.{mtp_layer + num_transformer_layers}")
mapping_list.append(AutoMapping(megatron_param=megatron_param, hf_param=hf_param))

# MTP specific mappings
mapping_list.extend(
[
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.enorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.enorm.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.hnorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.hnorm.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.eh_proj.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.eh_proj.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.final_layernorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.shared_head.norm.weight",
),
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move mappings that don't use layer_prefix out of the loop

Suggested change
for layer_prefix in ("transformer_layer", "mtp_model_layer"):
for megatron_param, hf_param in (param_mappings | layer_specific_mappings).items():
megatron_param = (
megatron_param.replace(".*", f".*.{layer_prefix}")
.replace("decoder", "mtp")
.replace(".*", f".{mtp_layer}")
)
hf_param = hf_param.replace("layers.*", f"layers.{mtp_layer + num_transformer_layers}")
mapping_list.append(AutoMapping(megatron_param=megatron_param, hf_param=hf_param))
# MTP specific mappings
mapping_list.extend(
[
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.enorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.enorm.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.hnorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.hnorm.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.eh_proj.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.eh_proj.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.final_layernorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.shared_head.norm.weight",
),
]
)
# MTP specific mappings
mapping_list.extend(
[
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.enorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.enorm.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.hnorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.hnorm.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.eh_proj.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.eh_proj.weight",
),
AutoMapping(
megatron_param=f"mtp.layers.{mtp_layer}.final_layernorm.weight",
hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.shared_head.norm.weight",
),
]
)
for layer_prefix in ("transformer_layer", "mtp_model_layer"):
for megatron_param, hf_param in (param_mappings | layer_specific_mappings).items():
megatron_param = (
megatron_param.replace(".*", f".*.{layer_prefix}")
.replace("decoder", "mtp")
.replace(".*", f".{mtp_layer}")
)
hf_param = hf_param.replace("layers.*", f"layers.{mtp_layer + num_transformer_layers}")
mapping_list.append(AutoMapping(megatron_param=megatron_param, hf_param=hf_param))

@pengdurice
Copy link
Author

@wuyaoxuehun I haven't tested the convergence yet, just using a dummy model for now due to constrained resources on my side. Other than that, it looks good to me.

@HollowMan6 thank you so much for your contribution to this PR. Due to resource constraints, I was able to only do local unit tests. I am doing some more testing on my side as well. Hope we get to a workable stage soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants