Conversation
HollowMan6
left a comment
There was a problem hiding this comment.
Probably we can also integrate the tilelang fused kernels for DSA, (https://github.com/THUDM/slime/tree/14413cf7c5c0d1c9854a28f471d3ab575461aa51/slime_plugins/models/glm5) to accelerate the training speed.
that's a great idea, let me add it in the next PR;-) |
|
@HollowMan6 would you mind reviewing this PR when you get a chance? thanks! edit: rebasing causing some 3rd party module changes being wrongly picked up, will need to fix them first, converting back to draft |
d0bc00a to
b19f84c
Compare
📝 WalkthroughWalkthroughAdds GLM-5 MoE import/export support: a GLM5Bridge and GLM5ModelProvider, MoE-aware weight slicing/packing and mapping logic, updates to gated MLP mapping, functional tests for GLM-5 conversion, and a Megatron-LM submodule pointer update. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant Bridge as GLM5Bridge
participant Config as ConfigBuilder
participant Registry as MappingRegistry
participant Weights as WeightHandler
participant Target as MegatronModel
User->>Bridge: start HF→Megatron conversion
Bridge->>Config: _get_glm5_configs(hf_pretrained)
Config-->>Bridge: provider_config
Bridge->>Bridge: provider_bridge() -> GLM5ModelProvider
Bridge->>Registry: mapping_registry()
Registry-->>Bridge: param mappings
Bridge->>Weights: maybe_modify_loaded_hf_weight(hf_param, hf_state_dict)
Weights-->>Bridge: sliced per-expert tensors
Bridge->>Target: apply mapped weights and provider config
Target-->>User: converted Megatron-Core model
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
b19f84c to
6cafd46
Compare
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (8)
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-331-332 (1)
331-332:⚠️ Potential issue | 🟡 MinorCommented-out
# try:is a leftover debug artifact.As per coding guidelines: "If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it as debug code before merging."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around lines 331 - 332, Remove the leftover commented-out debug token "# try:" above the subprocess.run call in test_glm5_conversion.py (the block where result = subprocess.run(cmd, capture_output=True, text=True, cwd=repo_root) is invoked); either delete the commented marker entirely or replace it with a proper try/except around subprocess.run (handling exceptions and failing the test) if you intended to catch errors — ensure the change updates the surrounding test logic (the test function that invokes subprocess.run) consistently and does not leave unexplained commented-out code.tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-194-197 (1)
194-197:⚠️ Potential issue | 🟡 MinorDocstring mentions "GLM 4.5" but this is the GLM 5 bridge test class.
🐛 Proposed fix
- """ - Test GLM 4.5 MoE model conversion from local HuggingFace model with different parallelism configurations. - """ + """Test GLM 5 MoE model conversion from local HuggingFace model with different parallelism configurations."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around lines 194 - 197, The class docstring for TestGLM5Conversion incorrectly references "GLM 4.5"; update the docstring in the TestGLM5Conversion class to mention "GLM 5" (or a neutral phrase like "GLM 5 bridge") so it accurately describes the test coverage for the GLM 5 MoE model conversion; locate the TestGLM5Conversion class and edit its triple-quoted docstring accordingly.tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-335-338 (1)
335-338:⚠️ Potential issue | 🟡 Minor
assert Falseis stripped underpython -O; useraise AssertionError(B011).This pattern applies to both lines 338 and 434.
🐛 Proposed fix (applies similarly to line 434)
- assert False, f"GLM 5 MoE {test_name} conversion failed with return code {result.returncode}" + raise AssertionError(f"GLM 5 MoE {test_name} conversion failed with return code {result.returncode}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around lines 335 - 338, Replace the use of "assert False" which is ignored under optimized Python with explicit raises: in the failure branches that inspect result.returncode (using the result variable and test_name) change the "assert False, f'GLM 5 MoE {test_name} conversion failed with return code {result.returncode}'" to "raise AssertionError(...)" so the failure always raises regardless of optimization; make the same change for the similar assertion at the other occurrence later in the file (the second block around line 434).tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-293-293 (1)
293-293:⚠️ Potential issue | 🟡 MinorLine exceeds the 119-character maximum.
The
print(...)at line 293 is a very long line and also contains debug output that should be removed before merging.As per coding guidelines: "Maximum line length is 119 characters (matching ruff configuration)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` at line 293, Remove the long debug print statement that exceeds 119 chars: delete the print(f"second_layer mlp experts: {second_layer.mlp.experts} and type: {type(second_layer.mlp.experts)} and size: {total_size} and shapes: {total_shapes}") in tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py; if you need to keep information for debugging, replace it with a concise logger.debug or a focused assertion that checks specific properties of second_layer.mlp.experts, total_size, or total_shapes and keep any kept message well under 119 characters.tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py-404-404 (1)
404-404:⚠️ Potential issue | 🟡 MinorUnused method argument
tmp_path(ARG002).
test_glm5_conversion_parallelism_local_modeldeclarestmp_pathbut never uses it; all output goes underlocal_root_path.♻️ Proposed fix
- def test_glm5_conversion_parallelism_local_model(self, tmp_path, tp, pp, ep, test_name): + def test_glm5_conversion_parallelism_local_model(self, tp, pp, ep, test_name):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` at line 404, The test function test_glm5_conversion_parallelism_local_model declares an unused pytest fixture parameter tmp_path; either remove tmp_path from the function signature or rename it to _tmp_path to silence the unused-argument warning. Update the signature of test_glm5_conversion_parallelism_local_model accordingly (and any direct references if present) so the test uses only local_root_path, leaving other parameters tp, pp, ep, test_name unchanged.src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py-108-142 (1)
108-142:⚠️ Potential issue | 🟡 Minor
make_vocab_size_divisible_byis set twice inprovider_bridge.Line 116 sets it in the
configsdict, and line 141 sets it again as a direct attribute on the constructedprovider. The second assignment is redundant.🐛 Proposed fix
- configs["make_vocab_size_divisible_by"] = 1280 configs["moe_router_score_function"] = "sigmoid" ... - provider.make_vocab_size_divisible_by = 1280Keep only the constructor-time assignment (
configs["make_vocab_size_divisible_by"] = 1280) since the provider is constructed at line 122 with**configs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 108 - 142, The provider_bridge function sets make_vocab_size_divisible_by twice (configs["make_vocab_size_divisible_by"] and provider.make_vocab_size_divisible_by) which is redundant; remove the later direct attribute assignment on the provider (the provider.make_vocab_size_divisible_by = 1280 line) and keep the constructor-time setting in configs so GLM5ModelProvider(**configs) receives the value once.src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py-51-58 (1)
51-58:⚠️ Potential issue | 🟡 MinorDocstring contains two inaccuracies.
- Line 52:
"Glm5MoeForCausalLM"— the registered source class (line 47) isGlmMoeDsaForCausalLM.- Line 57: The example shows
zai-org/GLM-4.5instead of a GLM-5 model path.🐛 Proposed fix
- This bridge handles the conversion between HuggingFace Glm5MoeForCausalLM - (used for GLM 5 models) and Megatron-Core GPTModel formats. + This bridge handles the conversion between HuggingFace GlmMoeDsaForCausalLM + (used for GLM 5 models) and Megatron-Core GPTModel formats. Example: >>> from megatron.bridge import AutoBridge - >>> bridge = AutoBridge.from_hf_pretrained("zai-org/GLM-4.5") + >>> bridge = AutoBridge.from_hf_pretrained("zai-org/GLM-5") >>> provider = bridge.to_megatron_provider()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 51 - 58, Update the docstring to correct two inaccuracies: replace the incorrect source class name "Glm5MoeForCausalLM" with the registered class name GlmMoeDsaForCausalLM, and change the example model path from "zai-org/GLM-4.5" to a GLM-5 example such as "zai-org/GLM-5" in the AutoBridge.from_hf_pretrained usage so the example and description correctly reflect GLM-5 and the GlmMoeDsaForCausalLM source class (references: GlmMoeDsaForCausalLM, AutoBridge.from_hf_pretrained, to_megatron_provider).src/megatron/bridge/models/glm/glm45_bridge.py-326-326 (1)
326-326:⚠️ Potential issue | 🟡 MinorImport
WeightConversionTaskdirectly instead of using a forward reference string.Line 326 uses a string forward reference
task: "WeightConversionTask", but this class should be imported directly frommegatron.bridge.models.conversion.model_bridgelike other bridge implementations (gpt_oss_bridge.py,deepseek_v3_bridge.py,qwen3_vl_bridge.py). Update the import on line 26 to includeWeightConversionTaskand remove the quotes from the type hint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/glm/glm45_bridge.py` at line 326, Replace the string-forward type hint for WeightConversionTask with a direct class import: add WeightConversionTask to the module imports (import it from megatron.bridge.models.conversion.model_bridge alongside the other bridge imports) and change the annotated field from task: "WeightConversionTask" to task: WeightConversionTask so the type is a real symbol rather than a forward-reference string.
🧹 Nitpick comments (2)
src/megatron/bridge/models/glm_moe_dsa/__init__.py (1)
21-24: Sort__all__alphabetically.Ruff (RUF022) flags the unsorted order;
"GLM5Bridge"should precede"GLM5ModelProvider".♻️ Proposed fix
__all__ = [ - "GLM5ModelProvider", "GLM5Bridge", + "GLM5ModelProvider", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/glm_moe_dsa/__init__.py` around lines 21 - 24, The __all__ list is not alphabetized which triggers RUF022; update the list in __init__.py so entries are sorted alphabetically, placing "GLM5Bridge" before "GLM5ModelProvider" (i.e., change the order of the symbols GLM5Bridge and GLM5ModelProvider in the __all__ declaration).src/megatron/bridge/models/glm/glm45_bridge.py (1)
18-18: Replace deprecatedtypinggenerics with Python 3.10+ built-ins.
Dict,Optional, andTupleshould be replaced with their built-in equivalents;Mappingshould come fromcollections.abc.♻️ Proposed fix
-from typing import Dict, Mapping, Optional, Tuple +from collections.abc import MappingThen update all usages in the changed lines:
Dict[str, torch.Tensor]→dict[str, torch.Tensor]Optional[...]→... | NoneTuple[str, int, Optional[str]]→tuple[str, int, str | None]As per coding guidelines: "Use built-in generics (list, dict, tuple) instead of typing equivalents", "Use 'T | None' for nullable types instead of 'Optional[T]'".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/glm/glm45_bridge.py` at line 18, The import and type annotations in glm45_bridge.py use deprecated typing generics; replace "from typing import Dict, Mapping, Optional, Tuple" with "from collections.abc import Mapping" (only) and use built-in generics and PEP 604 unions in annotations: change Dict[...] → dict[...], Tuple[...] → tuple[...], Optional[T] → T | None, and ensure any Mapping stays typed as collections.abc.Mapping; update all occurrences such as Dict[str, torch.Tensor] → dict[str, torch.Tensor], Optional[...] → ... | None, and Tuple[str, int, Optional[str]] → tuple[str, int, str | None] (adjust other similar annotations accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/conversion/param_mapping.py`:
- Around line 1985-1995: Update the _should_allow_hf_name_mismatch signature to
use built-in generics and PEP 604 union syntax: change the parameter annotation
from Union[str, Dict[str, str]] to str | dict[str, str] and add the return type
bool; add type hints to the nested helpers has_expert_wildcard(param_name: str)
-> bool and has_resolved_expert(param_name: str) -> bool so all functions in the
_should_allow_hf_name_mismatch scope have explicit parameter and return type
annotations (keep logic unchanged).
In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py`:
- Around line 385-397: Add an explicit __init__ to GLM5Bridge that initializes
the instance attributes _hf_expert_cache, _hf_expert_expected and _hf_config
(e.g., to empty dicts or None as appropriate) so their lifecycle is
deterministic; then remove the lazy hasattr guards that create those attributes
(the check in maybe_modify_converted_hf_weight that sets
_hf_expert_cache/_hf_expert_expected and the guard that sets _hf_config
elsewhere) and rely on the constructor-initialized members instead, updating any
usages to handle their initialized (possibly empty/None) values.
- Line 18: The import and type hints use typing generics; change those to Python
3.10+ built-ins and appropriate stdlib ABCs: remove Dict, Optional, Tuple (and
Mapping) from the typing import line in glm5_bridge.py and update all
annotations—use dict[...] instead of Dict[...], tuple[...] instead of
Tuple[...], and use X | None instead of Optional[X]; for Mapping use
collections.abc.Mapping (import it) or replace with dict if a concrete mapping
is intended; update all function/method signatures and return annotations (e.g.,
"-> Dict[str, torch.Tensor]" → "-> dict[str, torch.Tensor]",
"Optional[torch.Tensor]" → "torch.Tensor | None", "Tuple[str, int,
Optional[str]]" → "tuple[str, int, str | None]") and remove the now-unused
typing imports.
- Around line 414-428: Replace the broad "except Exception: pass" in
_should_pack_experts with a narrow exception handler (e.g., catch
AttributeError/TypeError) and log the caught error instead of silencing it;
specifically handle cases where accessing hf_state_dict attributes or methods
may fail, and include the error and base_key in the log message. Apply the same
change to the analogous try/except in _expected_experts, and prefer checking for
hasattr/getattr safely before calling has_glob to avoid exceptions where
possible.
- Around line 385-390: The annotated type "WeightConversionTask" used in
maybe_modify_converted_hf_weight is not imported; add an import for
WeightConversionTask from megatron.bridge.models.conversion.model_bridge (or, if
circular imports are possible, import it inside a TYPE_CHECKING block from
typing import TYPE_CHECKING and then from
megatron.bridge.models.conversion.model_bridge import WeightConversionTask
within that block) so the annotation resolves for type checkers while leaving
runtime behavior unchanged.
In `@src/megatron/bridge/models/glm_moe_dsa/glm5_provider.py`:
- Around line 24-32: Override GLM5ModelProvider.__post_init__ to suppress the
inherited deprecation message: implement a __post_init__ in GLM5ModelProvider
that temporarily replaces DeepSeekV3ModelProvider._warn_deprecated with a no-op,
calls super().__post_init__ (so any other base init runs), and then restores the
original _warn_deprecated. This prevents DeepSeekV3ModelProvider.__post_init__
from emitting the misleading warning for GLM5ModelProvider while preserving the
base initialization behavior.
In `@src/megatron/bridge/models/glm/glm45_bridge.py`:
- Around line 407-425: The assembly logic in glm45_bridge.py (using
self._hf_expert_cache, base_key, expert_id, expected, output_base_key) assumes
cached keys are 0..expected-1 and so does torch.stack([cache[i] for i in
range(expected)]), which breaks under expert-parallel shards whose global expert
IDs are a non-zero-based range; update the assembly to (a) check that all
expected global expert IDs are present in cache (e.g., compare set(cache.keys())
to the expected global ID set), then sort the actual global IDs and stack using
that sorted order so tensors align to their global expert index, or (b) if you
intend to support only EP==1, add an explicit assertion that cached expert IDs
are contiguous and start at 0 and fail fast; apply the same fix for both the
single-tensor path and the per-kind ("gate"/"up") path that currently uses
cache[i].
- Around line 334-336: The attributes _hf_config, _hf_expert_cache and
_hf_expert_expected are being created lazily inside
maybe_modify_converted_hf_weight, which can cause AttributeError if accessed
earlier; add an __init__ override on the class to call super().__init__(*args,
**kwargs) and initialize self._hf_config = None, self._hf_expert_cache = {} and
self._hf_expert_expected = {} so these members always exist, and then remove the
hasattr guard around _hf_expert_cache in maybe_modify_converted_hf_weight.
- Around line 353-367: The helper _should_pack_experts swallows all exceptions
around the hf_state_dict membership checks, hiding real bugs; replace the broad
"except Exception: pass" with a narrow handler that only catches expected errors
from the membership test (e.g., TypeError for unhashable keys) and lets other
exceptions propagate; update the try/except around "if base_key in
hf_state_dict" and the ".weight" variant to catch TypeError (or ValueError if
applicable) only, and re-raise any other exception so unexpected
AttributeError/TypeError logic bugs in _should_pack_experts or hf_state_dict
surface.
- Around line 252-254: The return type of maybe_modify_loaded_hf_weight is
wrong: it currently declares -> torch.Tensor but the branch where hf_param is a
dict returns a dict[str, torch.Tensor]; update the annotation to reflect both
possibilities (e.g., -> torch.Tensor | dict[str, torch.Tensor] or use
typing.Union[torch.Tensor, dict[str, torch.Tensor]]), and ensure any necessary
typing imports (from __future__ import annotations or typing.Union/Dict) are
added so type checkers accept the union; keep the existing logic in
maybe_modify_loaded_hf_weight and only change the return type to match both the
tensor and dict branches.
- Around line 373-379: In _expected_experts, replace the broad "except
Exception" around the .shape[0] access for base_key/hf_state_dict with a narrow
except that only catches the anticipated errors (e.g., KeyError, IndexError,
AttributeError) so programming errors are not swallowed; specifically, in the
block that reads hf_state_dict[base_key].shape[0] (and the .weight fallback
using base_key[:-len(".weight")]) catch only those exceptions and set expected =
None in those cases.
In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py`:
- Around line 273-278: The test currently calls
GlmMoeDsaForCausalLM.from_pretrained(..., trust_remote_code=True); remove the
security risk by deleting the trust_remote_code argument or setting
trust_remote_code=False when loading from the local glm5_toy_model_path in
test_glm5_conversion.py so the known local class is used without executing
remote code.
- Around line 119-152: The test fixture _create_glm5_toy_model uses
AutoConfig.from_pretrained("zai-org/GLM-5") and
AutoTokenizer.from_pretrained("zai-org/GLM-5") which creates a network
dependency; replace these with a hermetic local setup by constructing the config
with AutoConfig.for_model("glm_moe_dsa", **HF_GLM5_TOY_MODEL_CONFIG) (so the
test uses HF_GLM5_TOY_MODEL_CONFIG directly and still sets
config.torch_dtype=torch.bfloat16) and stop calling
AutoTokenizer.from_pretrained; instead create a minimal tokenizer
stub/tokenizer_config.json in model_dir (matching patterns used in other tests)
and save that file before calling model.save_pretrained, ensuring
tokenizer.save_pretrained is not used to fetch from HF Hub and the local
config.json is written from HF_GLM5_TOY_MODEL_CONFIG as already done.
---
Minor comments:
In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py`:
- Around line 108-142: The provider_bridge function sets
make_vocab_size_divisible_by twice (configs["make_vocab_size_divisible_by"] and
provider.make_vocab_size_divisible_by) which is redundant; remove the later
direct attribute assignment on the provider (the
provider.make_vocab_size_divisible_by = 1280 line) and keep the constructor-time
setting in configs so GLM5ModelProvider(**configs) receives the value once.
- Around line 51-58: Update the docstring to correct two inaccuracies: replace
the incorrect source class name "Glm5MoeForCausalLM" with the registered class
name GlmMoeDsaForCausalLM, and change the example model path from
"zai-org/GLM-4.5" to a GLM-5 example such as "zai-org/GLM-5" in the
AutoBridge.from_hf_pretrained usage so the example and description correctly
reflect GLM-5 and the GlmMoeDsaForCausalLM source class (references:
GlmMoeDsaForCausalLM, AutoBridge.from_hf_pretrained, to_megatron_provider).
In `@src/megatron/bridge/models/glm/glm45_bridge.py`:
- Line 326: Replace the string-forward type hint for WeightConversionTask with a
direct class import: add WeightConversionTask to the module imports (import it
from megatron.bridge.models.conversion.model_bridge alongside the other bridge
imports) and change the annotated field from task: "WeightConversionTask" to
task: WeightConversionTask so the type is a real symbol rather than a
forward-reference string.
In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py`:
- Around line 331-332: Remove the leftover commented-out debug token "# try:"
above the subprocess.run call in test_glm5_conversion.py (the block where result
= subprocess.run(cmd, capture_output=True, text=True, cwd=repo_root) is
invoked); either delete the commented marker entirely or replace it with a
proper try/except around subprocess.run (handling exceptions and failing the
test) if you intended to catch errors — ensure the change updates the
surrounding test logic (the test function that invokes subprocess.run)
consistently and does not leave unexplained commented-out code.
- Around line 194-197: The class docstring for TestGLM5Conversion incorrectly
references "GLM 4.5"; update the docstring in the TestGLM5Conversion class to
mention "GLM 5" (or a neutral phrase like "GLM 5 bridge") so it accurately
describes the test coverage for the GLM 5 MoE model conversion; locate the
TestGLM5Conversion class and edit its triple-quoted docstring accordingly.
- Around line 335-338: Replace the use of "assert False" which is ignored under
optimized Python with explicit raises: in the failure branches that inspect
result.returncode (using the result variable and test_name) change the "assert
False, f'GLM 5 MoE {test_name} conversion failed with return code
{result.returncode}'" to "raise AssertionError(...)" so the failure always
raises regardless of optimization; make the same change for the similar
assertion at the other occurrence later in the file (the second block around
line 434).
- Line 293: Remove the long debug print statement that exceeds 119 chars: delete
the print(f"second_layer mlp experts: {second_layer.mlp.experts} and type:
{type(second_layer.mlp.experts)} and size: {total_size} and shapes:
{total_shapes}") in
tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py; if you need
to keep information for debugging, replace it with a concise logger.debug or a
focused assertion that checks specific properties of second_layer.mlp.experts,
total_size, or total_shapes and keep any kept message well under 119 characters.
- Line 404: The test function test_glm5_conversion_parallelism_local_model
declares an unused pytest fixture parameter tmp_path; either remove tmp_path
from the function signature or rename it to _tmp_path to silence the
unused-argument warning. Update the signature of
test_glm5_conversion_parallelism_local_model accordingly (and any direct
references if present) so the test uses only local_root_path, leaving other
parameters tp, pp, ep, test_name unchanged.
---
Nitpick comments:
In `@src/megatron/bridge/models/glm_moe_dsa/__init__.py`:
- Around line 21-24: The __all__ list is not alphabetized which triggers RUF022;
update the list in __init__.py so entries are sorted alphabetically, placing
"GLM5Bridge" before "GLM5ModelProvider" (i.e., change the order of the symbols
GLM5Bridge and GLM5ModelProvider in the __all__ declaration).
In `@src/megatron/bridge/models/glm/glm45_bridge.py`:
- Line 18: The import and type annotations in glm45_bridge.py use deprecated
typing generics; replace "from typing import Dict, Mapping, Optional, Tuple"
with "from collections.abc import Mapping" (only) and use built-in generics and
PEP 604 unions in annotations: change Dict[...] → dict[...], Tuple[...] →
tuple[...], Optional[T] → T | None, and ensure any Mapping stays typed as
collections.abc.Mapping; update all occurrences such as Dict[str, torch.Tensor]
→ dict[str, torch.Tensor], Optional[...] → ... | None, and Tuple[str, int,
Optional[str]] → tuple[str, int, str | None] (adjust other similar annotations
accordingly).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
3rdparty/Megatron-LMsrc/megatron/bridge/models/__init__.pysrc/megatron/bridge/models/conversion/param_mapping.pysrc/megatron/bridge/models/glm/glm45_bridge.pysrc/megatron/bridge/models/glm_moe_dsa/__init__.pysrc/megatron/bridge/models/glm_moe_dsa/glm5_bridge.pysrc/megatron/bridge/models/glm_moe_dsa/glm5_provider.pytests/functional_tests/models/glm_moe_dsa/__init__.pytests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py
| @staticmethod | ||
| def _should_allow_hf_name_mismatch(hf_param: Union[str, Dict[str, str]]) -> bool: | ||
| """Return True when HF params may be synthesized (e.g., expert slices).""" | ||
| def has_expert_wildcard(param_name: str) -> bool: | ||
| return ".experts.*." in param_name or ".experts.**." in param_name | ||
| def has_resolved_expert(param_name: str) -> bool: | ||
| return re.search(r"\.experts\.\d+\.", param_name) is not None | ||
|
|
||
| if isinstance(hf_param, str): | ||
| return has_expert_wildcard(hf_param) or has_resolved_expert(hf_param) | ||
| return any(has_expert_wildcard(name) or has_resolved_expert(name) for name in hf_param.values()) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Apply type hints and use built-in generics per coding guidelines.
- The
hf_paramannotation usesUnion[str, Dict[str, str]]— should bestr | dict[str, str]. - The nested helpers
has_expert_wildcardandhas_resolved_expertlack type hints.
♻️ Proposed fix
`@staticmethod`
- def _should_allow_hf_name_mismatch(hf_param: Union[str, Dict[str, str]]) -> bool:
+ def _should_allow_hf_name_mismatch(hf_param: str | dict[str, str]) -> bool:
"""Return True when HF params may be synthesized (e.g., expert slices)."""
- def has_expert_wildcard(param_name: str) -> bool:
+ def has_expert_wildcard(param_name: str) -> bool: # noqa: E306
return ".experts.*." in param_name or ".experts.**." in param_name
- def has_resolved_expert(param_name: str) -> bool:
+ def has_resolved_expert(param_name: str) -> bool:
return re.search(r"\.experts\.\d+\.", param_name) is not None
if isinstance(hf_param, str):
return has_expert_wildcard(hf_param) or has_resolved_expert(hf_param)
return any(has_expert_wildcard(name) or has_resolved_expert(name) for name in hf_param.values())As per coding guidelines: "Use type hints for function arguments and return types" and "Use built-in generics (list, dict, tuple) instead of typing equivalents" and "Use 'X | Y' for union types instead of 'Union[X, Y]'".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/conversion/param_mapping.py` around lines 1985 -
1995, Update the _should_allow_hf_name_mismatch signature to use built-in
generics and PEP 604 union syntax: change the parameter annotation from
Union[str, Dict[str, str]] to str | dict[str, str] and add the return type bool;
add type hints to the nested helpers has_expert_wildcard(param_name: str) ->
bool and has_resolved_expert(param_name: str) -> bool so all functions in the
_should_allow_hf_name_mismatch scope have explicit parameter and return type
annotations (keep logic unchanged).
| import logging | ||
| import re | ||
| from functools import partial | ||
| from typing import Dict, Mapping, Optional, Tuple |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Replace typing generics with built-in equivalents per coding guidelines.
Dict, Optional, and Tuple from typing should be dict, X | None, and tuple respectively (Python 3.10+).
♻️ Proposed fix
-from typing import Dict, Mapping, Optional, Tuple
+from collections.abc import Mapping
+from typing import Optional, Tuple # remove once all usages are updatedThen update method signatures throughout the file, e.g.:
-> Dict[str, torch.Tensor]→-> dict[str, torch.Tensor]Optional[torch.Tensor]→torch.Tensor | NoneTuple[str, int, Optional[str]]→tuple[str, int, str | None]
As per coding guidelines: "Use built-in generics (list, dict, tuple) instead of typing equivalents" and "Use 'T | None' for nullable types instead of 'Optional[T]'".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` at line 18, The import
and type hints use typing generics; change those to Python 3.10+ built-ins and
appropriate stdlib ABCs: remove Dict, Optional, Tuple (and Mapping) from the
typing import line in glm5_bridge.py and update all annotations—use dict[...]
instead of Dict[...], tuple[...] instead of Tuple[...], and use X | None instead
of Optional[X]; for Mapping use collections.abc.Mapping (import it) or replace
with dict if a concrete mapping is intended; update all function/method
signatures and return annotations (e.g., "-> Dict[str, torch.Tensor]" → "->
dict[str, torch.Tensor]", "Optional[torch.Tensor]" → "torch.Tensor | None",
"Tuple[str, int, Optional[str]]" → "tuple[str, int, str | None]") and remove the
now-unused typing imports.
| def maybe_modify_converted_hf_weight( | ||
| self, | ||
| task: "WeightConversionTask", | ||
| converted_weights_dict: Dict[str, torch.Tensor], | ||
| hf_state_dict: Mapping[str, torch.Tensor], | ||
| ) -> Dict[str, torch.Tensor]: | ||
| """Pack per-expert weights into GLM MoE tensors when exporting to HF.""" | ||
| if not converted_weights_dict: | ||
| return converted_weights_dict | ||
|
|
||
| if not hasattr(self, "_hf_expert_cache"): | ||
| self._hf_expert_cache = {} | ||
| self._hf_expert_expected = {} |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Instance state should be initialized in __init__, not lazily via hasattr.
_hf_expert_cache, _hf_expert_expected (lines 396-397) and _hf_config (line 147) are all created via hasattr guards scattered across methods. This violates the coding guideline "Initialize all externally visible members of a class in the constructor" and makes the lifecycle of these caches opaque. If maybe_modify_converted_hf_weight is called concurrently or build_conversion_tasks is skipped, the state is silently absent.
Add an __init__ to GLM5Bridge that initialises these to None / empty dicts.
♻️ Proposed fix
class GLM5Bridge(MegatronModelBridge):
+ def __init__(self):
+ super().__init__()
+ self._hf_config = None
+ self._hf_expert_cache: dict = {}
+ self._hf_expert_expected: dict = {}
+Then remove the if not hasattr(self, "_hf_expert_cache"): block at lines 395-397 and the if not hasattr(self, "_hf_config"): guard at line 241.
🧰 Tools
🪛 Flake8 (7.3.0)
[error] 387-387: undefined name 'WeightConversionTask'
(F821)
🪛 Ruff (0.15.2)
[error] 387-387: Undefined name WeightConversionTask
(F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 385 -
397, Add an explicit __init__ to GLM5Bridge that initializes the instance
attributes _hf_expert_cache, _hf_expert_expected and _hf_config (e.g., to empty
dicts or None as appropriate) so their lifecycle is deterministic; then remove
the lazy hasattr guards that create those attributes (the check in
maybe_modify_converted_hf_weight that sets _hf_expert_cache/_hf_expert_expected
and the guard that sets _hf_config elsewhere) and rely on the
constructor-initialized members instead, updating any usages to handle their
initialized (possibly empty/None) values.
| def maybe_modify_converted_hf_weight( | ||
| self, | ||
| task: "WeightConversionTask", | ||
| converted_weights_dict: Dict[str, torch.Tensor], | ||
| hf_state_dict: Mapping[str, torch.Tensor], | ||
| ) -> Dict[str, torch.Tensor]: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "WeightConversionTask" --type py -C 3Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 37586
🏁 Script executed:
head -100 src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5084
🏁 Script executed:
rg "from.*import" src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py | head -30Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 768
🏁 Script executed:
# Check if glm5_bridge.py has a TYPE_CHECKING block
grep -n "TYPE_CHECKING" src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Check the full imports section of glm5_bridge.py to see if WeightConversionTask should be there
sed -n '1,50p' src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1793
Import WeightConversionTask or use a TYPE_CHECKING block to resolve the type annotation.
The method signature on line 387 uses task: "WeightConversionTask" as a string annotation, but WeightConversionTask is never imported. While Python evaluates string annotations lazily, type checkers cannot resolve the annotation without the type being in scope. This violates the coding guideline to use proper type hints. Compare with other bridge implementations (e.g., deepseek_v3_bridge.py, qwen3_vl_bridge.py) that import WeightConversionTask directly from megatron.bridge.models.conversion.model_bridge.
Import WeightConversionTask from the existing import on line 26, or conditionally import it within a TYPE_CHECKING block if circular imports are a concern.
🧰 Tools
🪛 Flake8 (7.3.0)
[error] 387-387: undefined name 'WeightConversionTask'
(F821)
🪛 Ruff (0.15.2)
[error] 387-387: Undefined name WeightConversionTask
(F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 385 -
390, The annotated type "WeightConversionTask" used in
maybe_modify_converted_hf_weight is not imported; add an import for
WeightConversionTask from megatron.bridge.models.conversion.model_bridge (or, if
circular imports are possible, import it inside a TYPE_CHECKING block from
typing import TYPE_CHECKING and then from
megatron.bridge.models.conversion.model_bridge import WeightConversionTask
within that block) so the annotation resolves for type checkers while leaving
runtime behavior unchanged.
| def _should_pack_experts(base_key: str) -> bool: | ||
| try: | ||
| if base_key in hf_state_dict: | ||
| return True | ||
| if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict: | ||
| return True | ||
| except Exception: | ||
| pass | ||
| has_glob = getattr(hf_state_dict, "has_glob", None) | ||
| if callable(has_glob): | ||
| if has_glob(base_key): | ||
| return True | ||
| if base_key.endswith(".weight") and has_glob(base_key[: -len(".weight")]): | ||
| return True | ||
| return False |
There was a problem hiding this comment.
except Exception: pass silently swallows all errors.
_should_pack_experts catches a broad Exception and suppresses it without logging (lines 420-421). This makes diagnosing failures during HF weight export very difficult. At minimum the exception should be logged.
♻️ Proposed fix
try:
if base_key in hf_state_dict:
return True
if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
return True
- except Exception:
- pass
+ except Exception as e:
+ logger.debug("Exception while probing hf_state_dict for %s: %s", base_key, e)The same pattern applies to lines 439-440 in _expected_experts. Based on coding guidelines: "When using try-except blocks, limit the except clause to the smallest set of errors possible."
🧰 Tools
🪛 Ruff (0.15.2)
[error] 420-421: try-except-pass detected, consider logging the exception
(S110)
[warning] 420-420: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py` around lines 414 -
428, Replace the broad "except Exception: pass" in _should_pack_experts with a
narrow exception handler (e.g., catch AttributeError/TypeError) and log the
caught error instead of silencing it; specifically handle cases where accessing
hf_state_dict attributes or methods may fail, and include the error and base_key
in the log message. Apply the same change to the analogous try/except in
_expected_experts, and prefer checking for hasattr/getattr safely before calling
has_glob to avoid exceptions where possible.
| def _should_pack_experts(base_key: str) -> bool: | ||
| try: | ||
| if base_key in hf_state_dict: | ||
| return True | ||
| if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict: | ||
| return True | ||
| except Exception: | ||
| pass | ||
| has_glob = getattr(hf_state_dict, "has_glob", None) | ||
| if callable(has_glob): | ||
| if has_glob(base_key): | ||
| return True | ||
| if base_key.endswith(".weight") and has_glob(base_key[: -len(".weight")]): | ||
| return True | ||
| return False |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Swallow-all except Exception: pass masks real failures in _should_pack_experts.
Lines 354–360 catch every possible exception silently, including bugs like attribute errors or type errors that should surface immediately. The outer try body only performs a dict membership test (in hf_state_dict) which should raise at most a TypeError if the key is unhashable.
🐛 Proposed fix
- def _should_pack_experts(base_key: str) -> bool:
- try:
- if base_key in hf_state_dict:
- return True
- if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
- return True
- except Exception:
- pass
+ def _should_pack_experts(base_key: str) -> bool:
+ try:
+ if base_key in hf_state_dict:
+ return True
+ if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
+ return True
+ except TypeError:
+ passAs per coding guidelines: "When using try-except blocks, limit the except clause to the smallest set of errors possible".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _should_pack_experts(base_key: str) -> bool: | |
| try: | |
| if base_key in hf_state_dict: | |
| return True | |
| if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict: | |
| return True | |
| except Exception: | |
| pass | |
| has_glob = getattr(hf_state_dict, "has_glob", None) | |
| if callable(has_glob): | |
| if has_glob(base_key): | |
| return True | |
| if base_key.endswith(".weight") and has_glob(base_key[: -len(".weight")]): | |
| return True | |
| return False | |
| def _should_pack_experts(base_key: str) -> bool: | |
| try: | |
| if base_key in hf_state_dict: | |
| return True | |
| if base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict: | |
| return True | |
| except TypeError: | |
| pass | |
| has_glob = getattr(hf_state_dict, "has_glob", None) | |
| if callable(has_glob): | |
| if has_glob(base_key): | |
| return True | |
| if base_key.endswith(".weight") and has_glob(base_key[: -len(".weight")]): | |
| return True | |
| return False |
🧰 Tools
🪛 Ruff (0.15.2)
[error] 359-360: try-except-pass detected, consider logging the exception
(S110)
[warning] 359-359: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm/glm45_bridge.py` around lines 353 - 367, The
helper _should_pack_experts swallows all exceptions around the hf_state_dict
membership checks, hiding real bugs; replace the broad "except Exception: pass"
with a narrow handler that only catches expected errors from the membership test
(e.g., TypeError for unhashable keys) and lets other exceptions propagate;
update the try/except around "if base_key in hf_state_dict" and the ".weight"
variant to catch TypeError (or ValueError if applicable) only, and re-raise any
other exception so unexpected AttributeError/TypeError logic bugs in
_should_pack_experts or hf_state_dict surface.
| try: | ||
| if base_key in hf_state_dict: | ||
| expected = int(hf_state_dict[base_key].shape[0]) | ||
| elif base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict: | ||
| expected = int(hf_state_dict[base_key[: -len(".weight")]].shape[0]) | ||
| except Exception: | ||
| expected = None |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Broad except Exception in _expected_experts hides shape-access bugs.
The try body accesses .shape[0] on a tensor. The only expected exceptions are KeyError (key absent) and potentially IndexError (scalar tensor). Catching all Exception silently discards programming errors.
🐛 Proposed fix
- try:
- if base_key in hf_state_dict:
- expected = int(hf_state_dict[base_key].shape[0])
- elif base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
- expected = int(hf_state_dict[base_key[: -len(".weight")]].shape[0])
- except Exception:
- expected = None
+ try:
+ if base_key in hf_state_dict:
+ expected = int(hf_state_dict[base_key].shape[0])
+ elif base_key.endswith(".weight") and base_key[: -len(".weight")] in hf_state_dict:
+ expected = int(hf_state_dict[base_key[: -len(".weight")]].shape[0])
+ except (KeyError, IndexError, TypeError):
+ expected = NoneAs per coding guidelines: "When using try-except blocks, limit the except clause to the smallest set of errors possible".
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 378-378: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm/glm45_bridge.py` around lines 373 - 379, In
_expected_experts, replace the broad "except Exception" around the .shape[0]
access for base_key/hf_state_dict with a narrow except that only catches the
anticipated errors (e.g., KeyError, IndexError, AttributeError) so programming
errors are not swallowed; specifically, in the block that reads
hf_state_dict[base_key].shape[0] (and the .weight fallback using
base_key[:-len(".weight")]) catch only those exceptions and set expected = None
in those cases.
| cache = self._hf_expert_cache.setdefault(base_key, {}) | ||
| if kind is None: | ||
| cache[expert_id] = tensor | ||
| if len(cache) == expected: | ||
| stacked = torch.stack([cache[i] for i in range(expected)], dim=0) | ||
| output[output_base_key] = stacked | ||
| del self._hf_expert_cache[base_key] | ||
| else: | ||
| per_kind = cache.setdefault(expert_id, {}) | ||
| per_kind[kind] = tensor | ||
| if all( | ||
| (i in cache and "gate" in cache[i] and "up" in cache[i]) for i in range(expected) | ||
| ): | ||
| stacked = torch.stack( | ||
| [torch.cat([cache[i]["gate"], cache[i]["up"]], dim=0) for i in range(expected)], | ||
| dim=0, | ||
| ) | ||
| output[output_base_key] = stacked | ||
| del self._hf_expert_cache[base_key] |
There was a problem hiding this comment.
Expert packing assumes contiguous global expert IDs starting at 0 — breaks under expert parallelism.
When packing per-expert slices back into the fused [num_experts, ...] tensor at lines 411 and 420–422, the code iterates range(expected) and looks up cache[i]. Under expert parallelism (EP > 1), the Megatron→HF conversion task on a given rank will only produce weights for the local expert shard, whose global expert IDs are a non-zero-based contiguous range (e.g., experts 64–127 on rank 1). The local cache will contain those global IDs as keys, so cache[i] for i in range(expected) will KeyError or silently assemble the wrong tensor.
The fix is to either (a) reassemble only when all expected global expert IDs are present and use the sorted keys, or (b) document that this path is only safe for single-rank EP=1 conversion and add an assertion.
As per coding guidelines: "Handle tensor parallel and pipeline parallel distribution correctly in weight conversion".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm/glm45_bridge.py` around lines 407 - 425, The
assembly logic in glm45_bridge.py (using self._hf_expert_cache, base_key,
expert_id, expected, output_base_key) assumes cached keys are 0..expected-1 and
so does torch.stack([cache[i] for i in range(expected)]), which breaks under
expert-parallel shards whose global expert IDs are a non-zero-based range;
update the assembly to (a) check that all expected global expert IDs are present
in cache (e.g., compare set(cache.keys()) to the expected global ID set), then
sort the actual global IDs and stack using that sorted order so tensors align to
their global expert index, or (b) if you intend to support only EP==1, add an
explicit assertion that cached expert IDs are contiguous and start at 0 and fail
fast; apply the same fix for both the single-tensor path and the per-kind
("gate"/"up") path that currently uses cache[i].
| def _create_glm5_toy_model(model_dir: Path) -> None: | ||
| model_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| # Create GLM 4.5 config from the toy model config using AutoConfig | ||
| config = AutoConfig.from_pretrained("zai-org/GLM-5") | ||
|
|
||
| # Override with toy model config | ||
| for key, value in HF_GLM5_TOY_MODEL_CONFIG.items(): | ||
| setattr(config, key, value) | ||
|
|
||
| config.torch_dtype = torch.bfloat16 # Explicitly set the torch_dtype in config | ||
|
|
||
| # Create model with random weights and convert to bfloat16 | ||
| from transformers import GlmMoeDsaForCausalLM | ||
|
|
||
| model = GlmMoeDsaForCausalLM(config) | ||
|
|
||
| model = model.bfloat16() # Use .bfloat16() method instead of .to() | ||
| for k, v in model.named_buffers(): | ||
| if "e_score_correction_bias" in k: | ||
| v.data = v.data.to(torch.float32) | ||
|
|
||
| # Download and save tokenizer from a reference GLM model | ||
| tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-5") | ||
| tokenizer.save_pretrained(model_dir) | ||
|
|
||
| # Save model and config to directory | ||
| model.save_pretrained(model_dir, safe_serialization=True) | ||
|
|
||
| # Also save config.json explicitly to ensure compatibility with correct torch_dtype | ||
| config_to_save = HF_GLM5_TOY_MODEL_CONFIG.copy() | ||
| config_path = model_dir / "config.json" | ||
| with open(config_path, "w") as f: | ||
| json.dump(config_to_save, f, indent=2) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "test_glm5_conversion.py" -type fRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 138
🏁 Script executed:
cat -n tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py | sed -n '119,152p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1652
🏁 Script executed:
rg "GlmMoeDsaConfig|AutoConfig\.for_model" --type py -A 2 -B 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 604
🏁 Script executed:
rg "from transformers import" tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py -A 5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 717
🏁 Script executed:
rg "save_pretrained|tokenizer" tests/functional_tests/models/olmoe/test_olmoe_conversion.py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1484
🏁 Script executed:
rg "class.*Config|PretrainedConfig" tests/functional_tests/ -t py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 166
🏁 Script executed:
rg "PretrainedConfig|from_config" tests/functional_tests/ -t py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 490
🏁 Script executed:
grep -r "tokenizer_config.json" tests/functional_tests/ --include="*.py" -B 3 -A 3Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 30162
🏁 Script executed:
rg "vocab_size|vocab\\.json|merges\\.txt" tests/functional_tests/ --include="*.py" -B 2 -A 2 | head -30Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 512
🏁 Script executed:
cat -n tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py | sed -n '1,30p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1229
🏁 Script executed:
cat -n tests/functional_tests/models/llama_nemotron/test_llama_nemotron_conversion.py | sed -n '50,100p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2448
Use local config construction and tokenizer stub to eliminate network dependency.
Lines 123 and 142 call AutoConfig.from_pretrained("zai-org/GLM-5") and AutoTokenizer.from_pretrained("zai-org/GLM-5"), creating a non-hermetic test fixture that fails in offline CI environments or when HuggingFace Hub is unreachable and requires credentials.
Replace with:
AutoConfig.for_model("glm_moe_dsa", **HF_GLM5_TOY_MODEL_CONFIG)to construct config directly from the toy model config (pattern already used intests/functional_tests/models/olmoe/test_olmoe_conversion.py)- Create a minimal
tokenizer_config.jsonlocally instead of downloading (pattern already used in several test files likellama_nemotronandqwen_vl)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around
lines 119 - 152, The test fixture _create_glm5_toy_model uses
AutoConfig.from_pretrained("zai-org/GLM-5") and
AutoTokenizer.from_pretrained("zai-org/GLM-5") which creates a network
dependency; replace these with a hermetic local setup by constructing the config
with AutoConfig.for_model("glm_moe_dsa", **HF_GLM5_TOY_MODEL_CONFIG) (so the
test uses HF_GLM5_TOY_MODEL_CONFIG directly and still sets
config.torch_dtype=torch.bfloat16) and stop calling
AutoTokenizer.from_pretrained; instead create a minimal tokenizer
stub/tokenizer_config.json in model_dir (matching patterns used in other tests)
and save that file before calling model.save_pretrained, ensuring
tokenizer.save_pretrained is not used to fetch from HF Hub and the local
config.json is written from HF_GLM5_TOY_MODEL_CONFIG as already done.
| model = GlmMoeDsaForCausalLM.from_pretrained( | ||
| glm5_toy_model_path, | ||
| torch_dtype=torch.bfloat16, | ||
| low_cpu_mem_usage=False, # Ensure full loading | ||
| trust_remote_code=True, | ||
| ) |
There was a problem hiding this comment.
trust_remote_code=True is a security risk in tests.
Loading a model with trust_remote_code=True executes arbitrary code from the HuggingFace Hub. Since the toy model is created locally from a known class (GlmMoeDsaForCausalLM), the flag is unnecessary here.
🐛 Proposed fix
model = GlmMoeDsaForCausalLM.from_pretrained(
glm5_toy_model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=False,
- trust_remote_code=True,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model = GlmMoeDsaForCausalLM.from_pretrained( | |
| glm5_toy_model_path, | |
| torch_dtype=torch.bfloat16, | |
| low_cpu_mem_usage=False, # Ensure full loading | |
| trust_remote_code=True, | |
| ) | |
| model = GlmMoeDsaForCausalLM.from_pretrained( | |
| glm5_toy_model_path, | |
| torch_dtype=torch.bfloat16, | |
| low_cpu_mem_usage=False, # Ensure full loading | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/models/glm_moe_dsa/test_glm5_conversion.py` around
lines 273 - 278, The test currently calls
GlmMoeDsaForCausalLM.from_pretrained(..., trust_remote_code=True); remove the
security risk by deleting the trust_remote_code argument or setting
trust_remote_code=False when loading from the local glm5_toy_model_path in
test_glm5_conversion.py so the known local class is used without executing
remote code.
d892a2d to
c6b909f
Compare
… tests and cleaning Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
c6b909f to
8a71314
Compare
Signed-off-by: pengdurice <pengduhit@gmail.com>
|
@malay-nagda hi, nice to e-meet you. Do you mind reviewing this PR? thanks! This is to add GLM5 support. |
|
Hi. For base recipes, I think @yaoyu-33 can help review? |
Hi @malay-nagda thank you for the reply. @yaoyu-33 , nice to meet you, do you have time for reviewing this PR? Thanks! |
|
@yaoyu-33, quickly reminder of this PR, thank you! Have a nice week ahead! |
HollowMan6
left a comment
There was a problem hiding this comment.
I checked this PR and looks like the current implementation doesn't properly make GLM5 use DSA but wrongly MLA:
After applying all the fixes I propose (Suggested changes) below, it seems to work fine (tested with LoRA enabled):
Note that it also needs dependency NVIDIA/Megatron-LM#3026, and for now mcore doesn't support CP and thd format for DSA.
Now I opened NVIDIA/Megatron-LM#3674 to integrate the tilelang fused kernels for DSA, as well as supporting CP and thd format. Please feel free to try it out and let me know if you find any issues. |
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
HollowMan6
left a comment
There was a problem hiding this comment.
Enable and fix MTP mappings support
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <j88437182@hotmail.com> Signed-off-by: pengdurice <pengduhit@gmail.com>
|
@pengdurice @HollowMan6 hi, is this pr ready to merge then? |
|
@wuyaoxuehun I haven't tested the convergence yet, just using a dummy model for now due to constrained resources on my side. Other than that, it looks good to me. |
There was a problem hiding this comment.
Some more optional changes so that we don't need to further override build_adapter_conversion_tasks (need to wait until #2644 is merged)
| dsa_indexer_topk: int = 2048 | ||
| dsa_indexer_loss_coeff: float = 0.001 | ||
| dsa_indexer_use_sparse_loss: bool = True | ||
| mtp_num_layers: int = 1 |
There was a problem hiding this comment.
Maybe not pass this param here so that users can decide whether to enable MTP.
| mtp_num_layers: int = 1 |
| "dsa_indexer_loss_coeff": 0.001, | ||
| "dsa_indexer_use_sparse_loss": True, | ||
| # MTP params | ||
| "mtp_num_layers": hf_config.num_nextn_predict_layers, |
There was a problem hiding this comment.
Maybe not pass this param here so that users can decide whether to enable MTP.
| "mtp_num_layers": hf_config.num_nextn_predict_layers, |
| def build_conversion_tasks(self, hf_pretrained, megatron_model): | ||
| """Override to store config before mapping_registry is called.""" | ||
| # Store config on instance for use in mapping_registry | ||
| self._hf_config = hf_pretrained.config | ||
| return super().build_conversion_tasks(hf_pretrained, megatron_model) | ||
|
|
There was a problem hiding this comment.
need to wait until #2644 is merged
| def build_conversion_tasks(self, hf_pretrained, megatron_model): | |
| """Override to store config before mapping_registry is called.""" | |
| # Store config on instance for use in mapping_registry | |
| self._hf_config = hf_pretrained.config | |
| return super().build_conversion_tasks(hf_pretrained, megatron_model) |
| # optionally add MTP mappings | ||
| if not hasattr(self, "_hf_config"): | ||
| logger.warning("No HF config found, skipping MTP mappings.") | ||
| return MegatronMappingRegistry(*mapping_list) | ||
| hf_config = self._hf_config |
There was a problem hiding this comment.
need to wait until #2644 is merged
| # optionally add MTP mappings | |
| if not hasattr(self, "_hf_config"): | |
| logger.warning("No HF config found, skipping MTP mappings.") | |
| return MegatronMappingRegistry(*mapping_list) | |
| hf_config = self._hf_config | |
| hf_config = self.hf_config |
| for layer_prefix in ("transformer_layer", "mtp_model_layer"): | ||
| for megatron_param, hf_param in (param_mappings | layer_specific_mappings).items(): | ||
| megatron_param = ( | ||
| megatron_param.replace(".*", f".*.{layer_prefix}") | ||
| .replace("decoder", "mtp") | ||
| .replace(".*", f".{mtp_layer}") | ||
| ) | ||
| hf_param = hf_param.replace("layers.*", f"layers.{mtp_layer + num_transformer_layers}") | ||
| mapping_list.append(AutoMapping(megatron_param=megatron_param, hf_param=hf_param)) | ||
|
|
||
| # MTP specific mappings | ||
| mapping_list.extend( | ||
| [ | ||
| AutoMapping( | ||
| megatron_param=f"mtp.layers.{mtp_layer}.enorm.weight", | ||
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.enorm.weight", | ||
| ), | ||
| AutoMapping( | ||
| megatron_param=f"mtp.layers.{mtp_layer}.hnorm.weight", | ||
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.hnorm.weight", | ||
| ), | ||
| AutoMapping( | ||
| megatron_param=f"mtp.layers.{mtp_layer}.eh_proj.weight", | ||
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.eh_proj.weight", | ||
| ), | ||
| AutoMapping( | ||
| megatron_param=f"mtp.layers.{mtp_layer}.final_layernorm.weight", | ||
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.shared_head.norm.weight", | ||
| ), | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Move mappings that don't use layer_prefix out of the loop
| for layer_prefix in ("transformer_layer", "mtp_model_layer"): | |
| for megatron_param, hf_param in (param_mappings | layer_specific_mappings).items(): | |
| megatron_param = ( | |
| megatron_param.replace(".*", f".*.{layer_prefix}") | |
| .replace("decoder", "mtp") | |
| .replace(".*", f".{mtp_layer}") | |
| ) | |
| hf_param = hf_param.replace("layers.*", f"layers.{mtp_layer + num_transformer_layers}") | |
| mapping_list.append(AutoMapping(megatron_param=megatron_param, hf_param=hf_param)) | |
| # MTP specific mappings | |
| mapping_list.extend( | |
| [ | |
| AutoMapping( | |
| megatron_param=f"mtp.layers.{mtp_layer}.enorm.weight", | |
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.enorm.weight", | |
| ), | |
| AutoMapping( | |
| megatron_param=f"mtp.layers.{mtp_layer}.hnorm.weight", | |
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.hnorm.weight", | |
| ), | |
| AutoMapping( | |
| megatron_param=f"mtp.layers.{mtp_layer}.eh_proj.weight", | |
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.eh_proj.weight", | |
| ), | |
| AutoMapping( | |
| megatron_param=f"mtp.layers.{mtp_layer}.final_layernorm.weight", | |
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.shared_head.norm.weight", | |
| ), | |
| ] | |
| ) | |
| # MTP specific mappings | |
| mapping_list.extend( | |
| [ | |
| AutoMapping( | |
| megatron_param=f"mtp.layers.{mtp_layer}.enorm.weight", | |
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.enorm.weight", | |
| ), | |
| AutoMapping( | |
| megatron_param=f"mtp.layers.{mtp_layer}.hnorm.weight", | |
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.hnorm.weight", | |
| ), | |
| AutoMapping( | |
| megatron_param=f"mtp.layers.{mtp_layer}.eh_proj.weight", | |
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.eh_proj.weight", | |
| ), | |
| AutoMapping( | |
| megatron_param=f"mtp.layers.{mtp_layer}.final_layernorm.weight", | |
| hf_param=f"model.layers.{mtp_layer + num_transformer_layers}.shared_head.norm.weight", | |
| ), | |
| ] | |
| ) | |
| for layer_prefix in ("transformer_layer", "mtp_model_layer"): | |
| for megatron_param, hf_param in (param_mappings | layer_specific_mappings).items(): | |
| megatron_param = ( | |
| megatron_param.replace(".*", f".*.{layer_prefix}") | |
| .replace("decoder", "mtp") | |
| .replace(".*", f".{mtp_layer}") | |
| ) | |
| hf_param = hf_param.replace("layers.*", f"layers.{mtp_layer + num_transformer_layers}") | |
| mapping_list.append(AutoMapping(megatron_param=megatron_param, hf_param=hf_param)) |
@HollowMan6 thank you so much for your contribution to this PR. Due to resource constraints, I was able to only do local unit tests. I am doing some more testing on my side as well. Hope we get to a workable stage soon! |
What does this PR do ?
This PR adds support of GLM5 Model.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Tests