Skip to content

Conversation

@spacegoing
Copy link

1. DeepSeekV3 / Moonlight Architecture Context

The DeepSeekV3 (and its smaller variant Moonlight) architecture differs from standard MoE models in how it handles the router (gate) layer:

  • Standard Linear Router: Typically nn.Linear(hidden_size, num_experts, bias=False).
  • Score Correction Bias: Instead of a standard additive bias in the linear layer, DeepSeekV3 introduces a separate parameter often termed e_score_correction_bias in Hugging Face.
    • HF Structure: model.layers[i].mlp.gate (Linear, no bias) + e_score_correction_bias.
    • Megatron-Core Mapping (via MBridge): MBridge maps e_score_correction_bias to mlp.router.expert_bias. The standard mlp.router.bias exists on the object attribute but is initialized to None.

2. Execution Trace & Root Cause

The bridge (e.g., DeepseekV3Bridge) instantiates the model via get_model. inside mbridge/core/bridge.py and mbridge/models/deepseek_v3.py.
The GPTModel is created, and MCore initializes the TopKRouter.

  • Crucial Detail: In MCore, TopKRouter attributes are often defined even if unused. For DeepSeekV3, self.bias is defined as an attribute but set to None.

Callback Execution (The Crash)

After model creation, the callbacks are executed.
In mbridge/utils/post_creation_callbacks.py:

def freeze_moe_router(model, ...):
    for layer in model.decoder.layers:
        if hasattr(layer.mlp, "router"):
            # ...
            if hasattr(layer.mlp.router, "bias"):
                # CRASH HERE: layer.mlp.router.bias is None!
                layer.mlp.router.bias.requires_grad = False 

Why it crashes: hasattr(obj, "bias") returns True even if obj.bias is None. Trying to access .requires_grad on NoneType raises AttributeError.

Secondary Issue: The original code completely missed expert_bias, meaning even if it didn't crash, the router wouldn't be fully frozen for DeepSeekV3 models.

3. Solution

The fix makes the callback robust by explicitly checking for None values and expanding the attribute list to include expert_bias.

Code Changes

Modified mbridge/utils/post_creation_callbacks.py:

def freeze_moe_router(model, pre_process, post_process, config, hf_config):
    for layer in model.decoder.layers:
        if hasattr(layer.mlp, "router"):
            router = layer.mlp.router
            # 1. Added 'expert_bias' to support DeepSeekV3
            # 2. Iterate safely over potential attributes
            for attr in ["weight", "bias", "expert_bias"]:
                param = getattr(router, attr, None)
                # 3. Explicit check prevents crash on None parameters
                if param is not None:
                    param.requires_grad = False
        
        # Similar logic applied to shared_experts
        if hasattr(layer.mlp, "shared_experts"):
            shared_experts = layer.mlp.shared_experts
            for attr in ["gate_weight", "gate_bias"]:
                param = getattr(shared_experts, attr, None)
                if param is not None:
                    param.requires_grad = False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant