-
Notifications
You must be signed in to change notification settings - Fork 39
Add openseek modeling code #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
the modeling code available to huggingface community
|
Given that the open-source Deepseek V3 modeling code does not provide MoE training forward, I provide the trainable modeling code here, although, of course, optimization is still needed in terms of expert parallelism. |
|
@LoserCheems We have implemented the training code in the FlagScale repository, which supports various types of parallelism. FlagScale is our official framework for large-scale training. If you want to add any features, we can assist you directly. |
|
Thank you @aoyulong, but I observed that the FlagScale implementation only includes the |
|
@LoserCheems FlagScale can support all the modules you mentioned such as MLA and DeepSeekMoE and can fully pretrain DeepSeek-V3. |
|
@aoyulong I'm sorry for my oversight. I did not fully understand the structure of the Megatron in a short time. I would like to ask if I can provide an improved implementation of Pytorch, and then ask you to help integrate the improvement into FlagScale. |
|
@LoserCheems We will provide a contribution guide to help other contributors who have similar questions. This will also be updated in the README. |
|
@OpenHands please give a summary. |
|
I'm on it! ftgreat can track my progress at all-hands.dev |
Summary of PR #53: "Add openseek modeling code"OverviewPR #53 by @LoserCheems adds comprehensive HuggingFace-compatible modeling code for OpenSeek, implementing DeepSeek V3 architecture with several experimental enhancements. The PR introduces 4 new Python files totaling ~3,000 lines of code. What was AddedThe PR creates a new
Key Features ImplementedCore Architecture
Experimental Enhancements (Referenced Issues #45, #47, #48)
Technical Capabilities
Context from DiscussionThe conversation in the PR reveals:
AssessmentThis is a substantial contribution that fills a gap in the open-source ecosystem by providing trainable DeepSeek V3 modeling code with experimental enhancements. The code appears well-structured and follows HuggingFace conventions, making it accessible to the broader ML community who prefer PyTorch/HuggingFace over Megatron-based frameworks like FlagScale. The PR has been open since March 2025 with ongoing discussion about the best path forward for integration with the project's training infrastructure. |
|
Thank you for tagging me. I have reviewed the summary provided by |
|
@gemini-code-assist give a summary about this PR. |
Summary of ChangesThis pull request introduces comprehensive modeling code for Deepseek V3 and OpenSeek to the HuggingFace community. It enables the implementation and training of Deepseek MLA and Deepseek MoE, addressing the previous lack of a trainable forward pass for Deepseek MoE. Additionally, it incorporates several experimental modules, including learnable residual connections, dynamic mask attention, and a Cross Domain Mixture of Experts feedforward network, enhancing the model's capabilities and flexibility for research and development. Highlights
Changelog
Activity
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces modeling code for DeepSeek and OpenSeek models. The implementation includes several advanced features like MoE, MLA, and DMA. However, there are several critical issues that need to be addressed. The directory modeliing seems to have a typo and should likely be modeling. There are also several critical bugs, such as calls to undefined loss functions, incorrect use of attention function arguments, and unraised exceptions. Additionally, there's some confusing naming and dead code in modeling_openseek.py that should be cleaned up. I've left specific comments on these issues.
|
|
||
| loss = None | ||
| if labels is not None: | ||
| loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.loss_function is called here but it's not defined within this class or its parent classes. This will result in an AttributeError at runtime when labels are provided. You should implement the loss calculation, typically using torch.nn.CrossEntropyLoss with appropriate shifting of logits and labels.
| dropout_p=self.attention_dropout, | ||
| scale=self.scaling, | ||
| is_causal=self.is_causal, | ||
| enable_gqa=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| rate_value = torch.kthvalue(attn_mask, num_dynamic_mask, dim=-1, keepdim=True).values | ||
| attn_mask = attn_mask.masked_fill(attn_mask < rate_value, min_type) | ||
| else: | ||
| ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| loss = None | ||
| if labels is not None: | ||
| loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| loss = None | ||
| if labels is not None: | ||
| loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class DeepseekV3MLP(nn.Module): | ||
| def __init__(self, config: OpenseekConfig, hidden_size=None, intermediate_size=None): | ||
| super().__init__() | ||
| self.config = config | ||
| self.hidden_size = config.hidden_size if hidden_size is None else hidden_size | ||
| self.intermediate_size = ( | ||
| config.intermediate_size if intermediate_size is None else intermediate_size | ||
| ) | ||
|
|
||
| self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) | ||
| self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) | ||
| self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) | ||
| self.act_fn = ACT2FN[config.hidden_act] | ||
|
|
||
| def forward(self, x): | ||
| down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) | ||
| return down_proj | ||
|
|
||
|
|
||
| class DeepseekV3MoEGate(nn.Module): | ||
| def __init__(self, config: OpenseekConfig): | ||
| super().__init__() | ||
| self.config = config | ||
| self.top_k = config.num_experts_per_tok | ||
| self.n_routed_experts = config.num_routed_experts | ||
| self.routed_scaling_factor = config.routed_scaling_factor | ||
| self.scoring_func = config.scoring_func | ||
| self.seq_aux = config.seq_aux | ||
| self.topk_method = config.topk_method | ||
| self.n_group = config.n_group | ||
| self.topk_group = config.topk_group | ||
|
|
||
| # topk selection algorithm | ||
| self.norm_topk_prob = config.norm_topk_prob | ||
| self.gating_dim = config.hidden_size | ||
| self.weight = nn.Parameter( | ||
| torch.empty((self.n_routed_experts, self.gating_dim)) | ||
| ) | ||
| if self.topk_method == "noaux_tc": | ||
| self.e_score_correction_bias = nn.Parameter( | ||
| torch.empty((self.n_routed_experts)) | ||
| ) | ||
| self.reset_parameters() | ||
|
|
||
| def reset_parameters(self) -> None: | ||
| import torch.nn.init as init | ||
|
|
||
| init.kaiming_uniform_(self.weight, a=math.sqrt(5)) | ||
|
|
||
| def forward(self, hidden_states): | ||
| bsz, seq_len, h = hidden_states.shape | ||
| ### compute gating score | ||
| hidden_states = hidden_states.view(-1, h) | ||
| logits = F.linear( | ||
| hidden_states.type(torch.float32), self.weight.type(torch.float32), None | ||
| ) | ||
| if self.scoring_func == "sigmoid": | ||
| scores = logits.sigmoid() | ||
| else: | ||
| raise NotImplementedError( | ||
| f"insupportable scoring function for MoE gating: {self.scoring_func}" | ||
| ) | ||
|
|
||
| ### select top-k experts | ||
| if self.topk_method == "noaux_tc": | ||
| assert not self.training | ||
| scores_for_choice = scores.view(bsz * seq_len, -1) + self.e_score_correction_bias.unsqueeze(0) | ||
| group_scores = ( | ||
| scores_for_choice.view(bsz * seq_len, self.n_group, -1).topk(2, dim=-1)[0].sum(dim = -1) | ||
| ) # [n, n_group] | ||
| group_idx = torch.topk( | ||
| group_scores, k=self.topk_group, dim=-1, sorted=False | ||
| )[ | ||
| 1 | ||
| ] # [n, top_k_group] | ||
| group_mask = torch.zeros_like(group_scores) # [n, n_group] | ||
| group_mask.scatter_(1, group_idx, 1) # [n, n_group] | ||
| score_mask = ( | ||
| group_mask.unsqueeze(-1) | ||
| .expand( | ||
| bsz * seq_len, self.n_group, self.n_routed_experts // self.n_group | ||
| ) | ||
| .reshape(bsz * seq_len, -1) | ||
| ) # [n, e] | ||
| tmp_scores = scores_for_choice.masked_fill(~score_mask.bool(), 0.0) # [n, e] | ||
| _, topk_idx = torch.topk( | ||
| tmp_scores, k=self.top_k, dim=-1, sorted=False | ||
| ) | ||
| topk_weight = scores.gather(1, topk_idx) | ||
| else: | ||
| raise NotImplementedError( | ||
| f"insupportable TopK function for MoE gating: {self.topk_method}" | ||
| ) | ||
|
|
||
| ### norm gate to sum 1 | ||
| if self.top_k > 1 and self.norm_topk_prob: | ||
| denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20 | ||
| topk_weight = topk_weight / denominator | ||
| topk_weight = topk_weight * self.routed_scaling_factor # must multiply the scaling factor | ||
|
|
||
| return topk_idx, topk_weight | ||
|
|
||
|
|
||
| class DeepseekV3MoE(nn.Module): | ||
| """ | ||
| A mixed expert module containing shared experts. | ||
| """ | ||
|
|
||
| def __init__(self, config: OpenseekConfig): | ||
| super().__init__() | ||
| self.config = config | ||
| self.num_experts_per_tok = config.num_experts_per_tok | ||
| self.num_experts = config.num_experts | ||
| self.n_routed_experts = config.num_routed_experts | ||
| self.n_shared_experts = self.num_experts - self.n_routed_experts | ||
|
|
||
| if hasattr(config, "ep_size") and config.ep_size > 1: | ||
| assert config.ep_size == dist.get_world_size() | ||
| self.ep_size = config.ep_size | ||
| self.experts_per_rank = self.n_routed_experts // config.ep_size | ||
| self.ep_rank = dist.get_rank() | ||
| self.experts = nn.ModuleList( | ||
| [ | ||
| ( | ||
| DeepseekV3MLP( | ||
| config, intermediate_size=config.moe_intermediate_size | ||
| ) | ||
| if i >= self.ep_rank * self.experts_per_rank | ||
| and i < (self.ep_rank + 1) * self.experts_per_rank | ||
| else None | ||
| ) | ||
| for i in range(self.n_routed_experts) | ||
| ] | ||
| ) | ||
| else: | ||
| self.ep_size = 1 | ||
| self.experts_per_rank = self.n_routed_experts | ||
| self.ep_rank = 0 | ||
| self.experts = nn.ModuleList( | ||
| [ | ||
| DeepseekV3MLP( | ||
| config, intermediate_size=config.moe_intermediate_size | ||
| ) | ||
| for i in range(self.n_routed_experts) | ||
| ] | ||
| ) | ||
| self.gate = DeepseekV3MoEGate(config) | ||
| if self.n_shared_experts > 0: | ||
| intermediate_size = config.moe_intermediate_size * self.n_shared_experts | ||
| self.shared_experts = DeepseekV3MLP(config, intermediate_size=intermediate_size) | ||
|
|
||
| def forward(self, hidden_states): | ||
| identity = hidden_states | ||
| orig_shape = hidden_states.shape | ||
| topk_idx, topk_weight = self.gate(hidden_states) | ||
| hidden_states = hidden_states.view(-1, hidden_states.shape[-1]) | ||
| flat_topk_idx = topk_idx.view(-1) | ||
| # TODO: Deepseek is not open-source training method, we need to implement the training method | ||
| if not self.training: | ||
| y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape) | ||
| if self.n_shared_experts is not None: | ||
| y = y + self.shared_experts(identity) | ||
| return y | ||
|
|
||
| @torch.no_grad() | ||
| def moe_infer(self, x, topk_ids, topk_weight): | ||
| cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts))) | ||
| cnts.scatter_(1, topk_ids, 1) | ||
| tokens_per_expert = cnts.sum(dim=0) | ||
| idxs = topk_ids.view(-1).argsort() | ||
| sorted_tokens = x[idxs // topk_ids.shape[1]] | ||
| sorted_tokens_shape = sorted_tokens.shape | ||
| if self.ep_size > 1: | ||
| tokens_per_ep_rank = tokens_per_expert.view(self.ep_size, -1).sum(dim=1) | ||
| tokens_per_expert_group = tokens_per_expert.new_empty( | ||
| tokens_per_expert.shape[0] | ||
| ) | ||
| dist.all_to_all_single(tokens_per_expert_group, tokens_per_expert) | ||
| output_splits = ( | ||
| tokens_per_expert_group.view(self.ep_size, -1) | ||
| .sum(1) | ||
| .cpu() | ||
| .numpy() | ||
| .tolist() | ||
| ) | ||
| gathered_tokens = sorted_tokens.new_empty( | ||
| tokens_per_expert_group.sum(dim=0).cpu().item(), sorted_tokens.shape[1] | ||
| ) | ||
| input_split_sizes = tokens_per_ep_rank.cpu().numpy().tolist() | ||
| dist.all_to_all( | ||
| list(gathered_tokens.split(output_splits)), | ||
| list(sorted_tokens.split(input_split_sizes)), | ||
| ) | ||
| tokens_per_expert_post_gather = tokens_per_expert_group.view( | ||
| self.ep_size, self.experts_per_rank | ||
| ).sum(dim=0) | ||
| gatherd_idxs = np.zeros(shape=(gathered_tokens.shape[0],), dtype=np.int32) | ||
| s = 0 | ||
| for i, k in enumerate(tokens_per_expert_group.cpu().numpy()): | ||
| gatherd_idxs[s : s + k] = i % self.experts_per_rank | ||
| s += k | ||
| gatherd_idxs = gatherd_idxs.argsort() | ||
| sorted_tokens = gathered_tokens[gatherd_idxs] | ||
| tokens_per_expert = tokens_per_expert_post_gather | ||
| tokens_per_expert = tokens_per_expert.cpu().numpy() | ||
|
|
||
| outputs = [] | ||
| start_idx = 0 | ||
| for i, num_tokens in enumerate(tokens_per_expert): | ||
| end_idx = start_idx + num_tokens | ||
| if num_tokens == 0: | ||
| continue | ||
| expert = self.experts[i + self.ep_rank * self.experts_per_rank] | ||
| tokens_for_this_expert = sorted_tokens[start_idx:end_idx] | ||
| expert_out = expert(tokens_for_this_expert) | ||
| outputs.append(expert_out) | ||
| start_idx = end_idx | ||
|
|
||
| outs = torch.cat(outputs, dim=0) if len(outputs) else sorted_tokens.new_empty(0) | ||
| if self.ep_size > 1: | ||
| new_x = torch.empty_like(outs) | ||
| new_x[gatherd_idxs] = outs | ||
| gathered_tokens = new_x.new_empty(*sorted_tokens_shape) | ||
| dist.all_to_all( | ||
| list(gathered_tokens.split(input_split_sizes)), | ||
| list(new_x.split(output_splits)), | ||
| ) | ||
| outs = gathered_tokens | ||
|
|
||
| new_x = torch.empty_like(outs) | ||
| new_x[idxs] = outs | ||
| final_out = ( | ||
| new_x.view(*topk_ids.shape, -1) | ||
| .type(topk_weight.dtype) | ||
| .mul_(topk_weight.unsqueeze(dim=-1)) | ||
| .sum(dim=1) | ||
| .type(new_x.dtype) | ||
| ) | ||
| return final_out | ||
| # ==================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.expert_retrieval_dim = config.expert_retrieval_size | ||
| self.num_experts = config.num_experts | ||
| self.top_k = config.num_experts_per_tok | ||
| self.num_keys = int(math.sqrt(self.num_experts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of keys is calculated as int(math.sqrt(self.num_experts)). If self.num_experts is not a perfect square, this will truncate the result, and some experts will never be used. For the default num_experts=2048, sqrt(2048) is ~45.25, which becomes 45. This means only 45*45=2025 experts are addressable. Please ensure this is the intended behavior or add a check to ensure num_experts is a perfect square.
| position_embeddings=position_embeddings, | ||
| **kwargs, | ||
| ) | ||
| self_attn_weights = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| seq_len = torch.max(position_ids) + 1 | ||
| if seq_len > self.max_seq_len_cached: # growth | ||
| inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, seq_len=seq_len) | ||
| self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| past_key_value: Optional[Cache] = None, | ||
| cache_position: Optional[torch.LongTensor] = None, | ||
| **kwargs: Unpack[FlashAttentionKwargs], | ||
| ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature indicates a return type of Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]] but the function returns only two values (attn_output, attn_weights). The past_key_value cache is updated in-place and not returned. Please correct the return type annotation to match the implementation, which should be Tuple[torch.Tensor, Optional[torch.Tensor]].
| ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: | |
| ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive OpenSeek modeling code to provide Hugging Face community access to DeepSeek-style architectures. The implementation includes experimental features like Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), Dynamic Mask Attention (DMA), learnable residuals, and Cross Domain MoE (CDMoE). Notably, the training forward pass for DeepSeek MoE is not yet implemented, limiting its use to inference only.
- Implements DeepSeek MLA and MoE architectures with experimental enhancements
- Adds configuration classes for both OpenSeek and DeepSeek models
- Provides inference-only capabilities for DeepSeek MoE with training methods pending
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| modeliing/hf/modeling_openseek.py | Main modeling implementation with OpenSeek architecture, experimental modules (DMA, CDMoE, learnable residuals), and multiple attention mechanisms |
| modeliing/hf/modeling_deepseek.py | DeepSeek V3 model implementation focusing on MLA attention and MoE with inference-only support |
| modeliing/hf/configuration_openseek.py | Configuration class for OpenSeek models with extensive hyperparameter options |
| modeliing/hf/configuration_deepseek.py | Configuration class for DeepSeek V3 models with MoE-specific parameters |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| @@ -0,0 +1,1560 @@ | |||
| # coding=utf-8 | |||
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory name 'modeliing' contains a typo. It should be 'modeling' (single 'i').
| rate_value = torch.kthvalue(attn_mask, num_dynamic_mask, dim=-1, keepdim=True).values | ||
| attn_mask = attn_mask.masked_fill(attn_mask < rate_value, min_type) | ||
| else: | ||
| ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)") |
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing 'raise' keyword before ValueError. This will not raise an exception but instead create an unused ValueError object.
| ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)") | |
| raise ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)") |
| topk_idx, topk_weight = self.gate(hidden_states) | ||
| hidden_states = hidden_states.view(-1, hidden_states.shape[-1]) | ||
| flat_topk_idx = topk_idx.view(-1) | ||
| # TODO: Deepseek is not open-source training method, we need to implement the training method |
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment contains unclear language. Consider revising to 'TODO: DeepSeek's training method is not open-source, we need to implement the training method' for better clarity.
| # TODO: Deepseek is not open-source training method, we need to implement the training method | |
| # TODO: DeepSeek's training method is not open-source, we need to implement the training method |
| # TODO: Deepseek is not open-source training method, we need to implement the training method | ||
| if not self.training: | ||
| y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape) | ||
| if self.n_shared_experts is not None: |
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks if n_shared_experts is not None, but it should check if it's greater than 0 since it's an integer representing count. The variable is defined as self.n_shared_experts = self.num_experts - self.n_routed_experts which will never be None.
| if self.n_shared_experts is not None: | |
| if self.n_shared_experts > 0: |
| @@ -0,0 +1,1044 @@ | |||
| # coding=utf-8 | |||
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory name 'modeliing' contains a typo. It should be 'modeling' (single 'i').
| @@ -0,0 +1,182 @@ | |||
| # coding=utf-8 | |||
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory name 'modeliing' contains a typo. It should be 'modeling' (single 'i').
| @@ -0,0 +1,226 @@ | |||
| """DeepSeekV3 model configuration""" | |||
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory name 'modeliing' contains a typo. It should be 'modeling' (single 'i').
the modeling code available to the huggingface community
It implements Deepseek MLA and Deepseek MoE, but the drawback is that the
training forwardof Deepseek MoE has not been implemented yet.Some modules including: learnable residual #47, DMA sparse attention #48, and CDMoE feedforward network #45 are added for experimentation.