Add openseek modeling code #53

LoserCheems · 2025-03-02T08:32:06Z

the modeling code available to the huggingface community

It implements Deepseek MLA and Deepseek MoE, but the drawback is that the training forward of Deepseek MoE has not been implemented yet.

Some modules including: learnable residual #47, DMA sparse attention #48, and CDMoE feedforward network #45 are added for experimentation.

the modeling code available to huggingface community

LoserCheems · 2025-03-02T15:20:00Z

Given that the open-source Deepseek V3 modeling code does not provide MoE training forward, I provide the trainable modeling code here, although, of course, optimization is still needed in terms of expert parallelism.

aoyulong · 2025-03-03T02:00:13Z

@LoserCheems We have implemented the training code in the FlagScale repository, which supports various types of parallelism. FlagScale is our official framework for large-scale training. If you want to add any features, we can assist you directly.

LoserCheems · 2025-03-03T02:10:50Z

Thank you @aoyulong, but I observed that the FlagScale implementation only includes the MTP module, MLA, and Deepseek MoE are not implemented. Is there any description of an interface class or abstract class to model in FlagScale? I need to know which training strategy methods must be implemented. Of course, it would be better if you could help me achieve it!🤗

aoyulong · 2025-03-03T08:23:54Z

@LoserCheems FlagScale can support all the modules you mentioned such as MLA and DeepSeekMoE and can fully pretrain DeepSeek-V3.

LoserCheems · 2025-03-03T12:11:35Z

@aoyulong I'm sorry for my oversight. I did not fully understand the structure of the Megatron in a short time. I would like to ask if I can provide an improved implementation of Pytorch, and then ask you to help integrate the improvement into FlagScale.
Besides that, do I need to close this PR and open a new issue in FlagScale?

aoyulong · 2025-03-05T09:25:09Z

@LoserCheems We will provide a contribution guide to help other contributors who have similar questions. This will also be updated in the README.

ftgreat · 2025-08-20T10:46:03Z

@OpenHands please give a summary.

openhands-ai · 2025-08-20T10:46:12Z

I'm on it! ftgreat can track my progress at all-hands.dev

openhands-ai · 2025-08-20T10:48:56Z

Summary of PR #53: "Add openseek modeling code"

Overview

PR #53 by @LoserCheems adds comprehensive HuggingFace-compatible modeling code for OpenSeek, implementing DeepSeek V3 architecture with several experimental enhancements. The PR introduces 4 new Python files totaling ~3,000 lines of code.

What was Added

The PR creates a new modeliing/hf/ directory with:

Configuration Files:
- configuration_deepseek.py (226 lines) - DeepSeek V3 configuration class
- configuration_openseek.py (182 lines) - OpenSeek configuration with experimental features
Modeling Files:
- modeling_deepseek.py (1,043 lines) - DeepSeek V3 model implementation
- modeling_openseek.py (1,560 lines) - Enhanced OpenSeek model with experimental modules

Key Features Implemented

Core Architecture

DeepSeek MLA (Multi-head Latent Attention): Low-rank attention mechanism with separate query/key-value projections
DeepSeek MoE (Mixture of Experts): Trainable MoE implementation with routing and load balancing
HuggingFace Integration: Full compatibility with transformers library (AutoModel, AutoConfig, etc.)

Experimental Enhancements (Referenced Issues #45, #47, #48)

Learnable Residual Connections (Architecture Improvement: Residual part #47): OpenseekResidual class with learnable scaling weights instead of simple addition
DMA Sparse Attention (Architecture Improvement: Self-Attention part #48): Dynamic mask attention for creating sparse attention patterns
CDMoE Feedforward Network (Architecture Improvement: Feedforward network part #45): Enhanced MoE implementation with custom routing

Technical Capabilities

Training Support: Unlike the original DeepSeek V3 release, this provides full training forward passes for MoE
Flexible Configuration: Supports both dense and MoE variants, configurable expert counts, attention heads, etc.
Multiple Model Types: Causal LM and sequence classification variants

Context from Discussion

The conversation in the PR reveals:

Problem Addressed: Original DeepSeek V3 open-source code lacked MoE training capabilities
Alternative Suggested: @aoyulong pointed to FlagScale framework which has full DeepSeek V3 training support
Contributor's Intent: @LoserCheems wanted to provide PyTorch-native implementation for the HuggingFace community
Current Status: Discussion about whether to integrate improvements into FlagScale or maintain separate implementation

Assessment

This is a substantial contribution that fills a gap in the open-source ecosystem by providing trainable DeepSeek V3 modeling code with experimental enhancements. The code appears well-structured and follows HuggingFace conventions, making it accessible to the broader ML community who prefer PyTorch/HuggingFace over Megatron-based frameworks like FlagScale.

The PR has been open since March 2025 with ongoing discussion about the best path forward for integration with the project's training infrastructure.

_{View full conversation}

ftgreat · 2025-08-21T01:36:30Z

@gemini-code-assist

gemini-code-assist · 2025-08-21T01:36:40Z

Thank you for tagging me. I have reviewed the summary provided by openhands-ai[bot]. If you have any specific questions or require further analysis regarding the pull request, please let me know.

ftgreat · 2025-08-21T01:38:01Z

@gemini-code-assist give a summary about this PR.

gemini-code-assist · 2025-08-21T01:38:21Z

Summary of Changes

This pull request introduces comprehensive modeling code for Deepseek V3 and OpenSeek to the HuggingFace community. It enables the implementation and training of Deepseek MLA and Deepseek MoE, addressing the previous lack of a trainable forward pass for Deepseek MoE. Additionally, it incorporates several experimental modules, including learnable residual connections, dynamic mask attention, and a Cross Domain Mixture of Experts feedforward network, enhancing the model's capabilities and flexibility for research and development.

Highlights

DeepSeek MLA and MoE Implementation: This PR introduces the DeepSeek MLA (Multi-head Latent Attention) and DeepSeek MoE (Mixture of Experts) architectures, making them compatible with the HuggingFace transformers library.
Trainable Deepseek MoE: The code provides a trainable forward pass for Deepseek MoE, addressing a noted limitation in the original open-source Deepseek V3 code.
Experimental Module Integration: Experimental modules such as learnable residual connections (OpenseekResidual), DMA (Dynamic Mask Attention) sparse attention, and CDMoE (Cross Domain Mixture of Experts) feedforward networks are integrated for advanced model experimentation.
HuggingFace Compatibility and New Model Files: New configuration and modeling files are added for both DeepseekV3 and Openseek models, ensuring full compatibility and extensibility within the HuggingFace ecosystem.

Changelog

modeliing/hf/configuration_deepseek.py
- Added the DeepseekV3Config class, defining the configuration parameters for the DeepSeekV3 model, including various dimensions, expert counts, and attention settings.
modeliing/hf/configuration_openseek.py
- Added the OpenseekConfig class, providing configuration options for the OpenSeek model, encompassing parameters for attention, MLA, DMA, MoE, and CDMoE experimental features.
modeliing/hf/modeling_deepseek.py
- Added the core DeepseekV3 model implementation, including DeepseekV3RMSNorm, DeepseekV3RotaryEmbedding, DeepseekV3MLP, DeepseekV3TopkRouter, DeepseekV3MoE, DeepseekV3Attention, and DeepseekV3DecoderLayer components.
- Introduced the DeepseekV3ForCausalLM class for causal language modeling.
modeliing/hf/modeling_openseek.py
- Added the Openseek model implementation, featuring OpenseekRMSNorm, OpenseekResidual (learnable residual connection), OpenseekRotaryEmbedding, DeepseekV3Attention (MLA), OpenseekDynamicMaskAttention (DMA), DeepseekV3MoE (Deepseek MoE), OpenseekMLP, and OpenseekCDMoE (Cross Domain Mixture of Experts).
- Included OpenseekForCausalLM and OpenseekForSequenceClassification classes.

Activity

LoserCheems initiated the pull request to provide Deepseek V3 modeling code, including Deepseek MLA and Deepseek MoE, to the HuggingFace community, noting that the training forward for Deepseek MoE was not previously implemented.
aoyulong suggested that the FlagScale repository already implements Deepseek V3 training code and supports various parallelism types.
LoserCheems inquired if FlagScale's implementation included MLA and Deepseek MoE, and asked about an interface for modeling, offering to provide an improved PyTorch implementation for integration.
aoyulong confirmed that FlagScale supports all mentioned modules (MLA, DeepSeekMoE) and can fully pretrain DeepSeek-V3.
LoserCheems acknowledged the oversight and offered to provide an improved PyTorch implementation for integration into FlagScale, asking if the current PR should be closed.
aoyulong stated that a contribution guide would be provided to assist contributors with similar questions.
openhands-ai[bot] provided a detailed summary of the PR's technical contributions and the context of the discussion.

ftgreat · 2025-08-21T01:56:32Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces modeling code for DeepSeek and OpenSeek models. The implementation includes several advanced features like MoE, MLA, and DMA. However, there are several critical issues that need to be addressed. The directory modeliing seems to have a typo and should likely be modeling. There are also several critical bugs, such as calls to undefined loss functions, incorrect use of attention function arguments, and unraised exceptions. Additionally, there's some confusing naming and dead code in modeling_openseek.py that should be cleaned up. I've left specific comments on these issues.

gemini-code-assist · 2025-08-21T01:58:06Z

modeliing/hf/modeling_deepseek.py

+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)


self.loss_function is called here but it's not defined within this class or its parent classes. This will result in an AttributeError at runtime when labels are provided. You should implement the loss calculation, typically using torch.nn.CrossEntropyLoss with appropriate shifting of logits and labels.

gemini-code-assist · 2025-08-21T01:58:06Z

modeliing/hf/modeling_openseek.py

+            dropout_p=self.attention_dropout,
+            scale=self.scaling,
+            is_causal=self.is_causal,
+            enable_gqa=True,


The enable_gqa argument is not supported by torch.nn.functional.scaled_dot_product_attention. This will cause a TypeError at runtime. scaled_dot_product_attention automatically handles GQA by inferring it from the shapes of query, key, and value tensors. You should remove this argument.

gemini-code-assist · 2025-08-21T01:58:06Z

modeliing/hf/modeling_openseek.py

+                    rate_value = torch.kthvalue(attn_mask, num_dynamic_mask, dim=-1, keepdim=True).values
+                    attn_mask = attn_mask.masked_fill(attn_mask < rate_value, min_type)
+            else:
+                ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)")


You are creating a ValueError instance but not raising it. The check will have no effect. You should add the raise keyword.

Suggested change

ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)")

raise ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)")

gemini-code-assist · 2025-08-21T01:58:07Z

modeliing/hf/modeling_openseek.py

+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs)


self.loss_function is called here but it's not defined. This will cause a runtime error. You need to implement the loss calculation, typically using torch.nn.CrossEntropyLoss.

gemini-code-assist · 2025-08-21T01:58:07Z

modeliing/hf/modeling_openseek.py

+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config)


self.loss_function is called here but it's not defined. This will cause a runtime error. You need to implement the loss calculation for sequence classification.

gemini-code-assist · 2025-08-21T01:58:07Z

modeliing/hf/modeling_openseek.py

+class DeepseekV3MLP(nn.Module):
+    def __init__(self, config: OpenseekConfig, hidden_size=None, intermediate_size=None):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
+        self.intermediate_size = (
+            config.intermediate_size if intermediate_size is None else intermediate_size
+        )
+
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+
+
+class DeepseekV3MoEGate(nn.Module):
+    def __init__(self, config: OpenseekConfig):
+        super().__init__()
+        self.config = config
+        self.top_k = config.num_experts_per_tok
+        self.n_routed_experts = config.num_routed_experts
+        self.routed_scaling_factor = config.routed_scaling_factor
+        self.scoring_func = config.scoring_func
+        self.seq_aux = config.seq_aux
+        self.topk_method = config.topk_method
+        self.n_group = config.n_group
+        self.topk_group = config.topk_group
+
+        # topk selection algorithm
+        self.norm_topk_prob = config.norm_topk_prob
+        self.gating_dim = config.hidden_size
+        self.weight = nn.Parameter(
+            torch.empty((self.n_routed_experts, self.gating_dim))
+        )
+        if self.topk_method == "noaux_tc":
+            self.e_score_correction_bias = nn.Parameter(
+                torch.empty((self.n_routed_experts))
+            )
+        self.reset_parameters()
+
+    def reset_parameters(self) -> None:
+        import torch.nn.init as init
+
+        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
+
+    def forward(self, hidden_states):
+        bsz, seq_len, h = hidden_states.shape
+        ### compute gating score
+        hidden_states = hidden_states.view(-1, h)
+        logits = F.linear(
+            hidden_states.type(torch.float32), self.weight.type(torch.float32), None
+        )
+        if self.scoring_func == "sigmoid":
+            scores = logits.sigmoid()
+        else:
+            raise NotImplementedError(
+                f"insupportable scoring function for MoE gating: {self.scoring_func}"
+            )
+
+        ### select top-k experts
+        if self.topk_method == "noaux_tc":
+            assert not self.training
+            scores_for_choice = scores.view(bsz * seq_len, -1) + self.e_score_correction_bias.unsqueeze(0)
+            group_scores = (
+                scores_for_choice.view(bsz * seq_len, self.n_group, -1).topk(2, dim=-1)[0].sum(dim = -1)
+            )  # [n, n_group]
+            group_idx = torch.topk(
+                group_scores, k=self.topk_group, dim=-1, sorted=False
+            )[
+                1
+            ]  # [n, top_k_group]
+            group_mask = torch.zeros_like(group_scores)  # [n, n_group]
+            group_mask.scatter_(1, group_idx, 1)  # [n, n_group]
+            score_mask = (
+                group_mask.unsqueeze(-1)
+                .expand(
+                    bsz * seq_len, self.n_group, self.n_routed_experts // self.n_group
+                )
+                .reshape(bsz * seq_len, -1)
+            )  # [n, e]
+            tmp_scores = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
+            _, topk_idx = torch.topk(
+                tmp_scores, k=self.top_k, dim=-1, sorted=False
+            )
+            topk_weight = scores.gather(1, topk_idx)
+        else:
+            raise NotImplementedError(
+                f"insupportable TopK function for MoE gating: {self.topk_method}"
+            )
+
+        ### norm gate to sum 1
+        if self.top_k > 1 and self.norm_topk_prob:
+            denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20
+            topk_weight = topk_weight / denominator
+        topk_weight = topk_weight * self.routed_scaling_factor # must multiply the scaling factor
+
+        return topk_idx, topk_weight
+
+
+class DeepseekV3MoE(nn.Module):
+    """
+    A mixed expert module containing shared experts.
+    """
+
+    def __init__(self, config: OpenseekConfig):
+        super().__init__()
+        self.config = config
+        self.num_experts_per_tok = config.num_experts_per_tok
+        self.num_experts = config.num_experts
+        self.n_routed_experts = config.num_routed_experts
+        self.n_shared_experts = self.num_experts - self.n_routed_experts
+
+        if hasattr(config, "ep_size") and config.ep_size > 1:
+            assert config.ep_size == dist.get_world_size()
+            self.ep_size = config.ep_size
+            self.experts_per_rank = self.n_routed_experts // config.ep_size
+            self.ep_rank = dist.get_rank()
+            self.experts = nn.ModuleList(
+                [
+                    (
+                        DeepseekV3MLP(
+                            config, intermediate_size=config.moe_intermediate_size
+                        )
+                        if i >= self.ep_rank * self.experts_per_rank
+                        and i < (self.ep_rank + 1) * self.experts_per_rank
+                        else None
+                    )
+                    for i in range(self.n_routed_experts)
+                ]
+            )
+        else:
+            self.ep_size = 1
+            self.experts_per_rank = self.n_routed_experts
+            self.ep_rank = 0
+            self.experts = nn.ModuleList(
+                [
+                    DeepseekV3MLP(
+                        config, intermediate_size=config.moe_intermediate_size
+                    )
+                    for i in range(self.n_routed_experts)
+                ]
+            )
+        self.gate = DeepseekV3MoEGate(config)
+        if self.n_shared_experts > 0:
+            intermediate_size = config.moe_intermediate_size * self.n_shared_experts
+            self.shared_experts = DeepseekV3MLP(config, intermediate_size=intermediate_size)
+
+    def forward(self, hidden_states):
+        identity = hidden_states
+        orig_shape = hidden_states.shape
+        topk_idx, topk_weight = self.gate(hidden_states)
+        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
+        flat_topk_idx = topk_idx.view(-1)
+        # TODO: Deepseek is not open-source training method, we need to implement the training method
+        if not self.training:
+            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
+        if self.n_shared_experts is not None:
+            y = y + self.shared_experts(identity)
+        return y
+
+    @torch.no_grad()
+    def moe_infer(self, x, topk_ids, topk_weight):
+        cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts)))
+        cnts.scatter_(1, topk_ids, 1)
+        tokens_per_expert = cnts.sum(dim=0)
+        idxs = topk_ids.view(-1).argsort()
+        sorted_tokens = x[idxs // topk_ids.shape[1]]
+        sorted_tokens_shape = sorted_tokens.shape
+        if self.ep_size > 1:
+            tokens_per_ep_rank = tokens_per_expert.view(self.ep_size, -1).sum(dim=1)
+            tokens_per_expert_group = tokens_per_expert.new_empty(
+                tokens_per_expert.shape[0]
+            )
+            dist.all_to_all_single(tokens_per_expert_group, tokens_per_expert)
+            output_splits = (
+                tokens_per_expert_group.view(self.ep_size, -1)
+                .sum(1)
+                .cpu()
+                .numpy()
+                .tolist()
+            )
+            gathered_tokens = sorted_tokens.new_empty(
+                tokens_per_expert_group.sum(dim=0).cpu().item(), sorted_tokens.shape[1]
+            )
+            input_split_sizes = tokens_per_ep_rank.cpu().numpy().tolist()
+            dist.all_to_all(
+                list(gathered_tokens.split(output_splits)),
+                list(sorted_tokens.split(input_split_sizes)),
+            )
+            tokens_per_expert_post_gather = tokens_per_expert_group.view(
+                self.ep_size, self.experts_per_rank
+            ).sum(dim=0)
+            gatherd_idxs = np.zeros(shape=(gathered_tokens.shape[0],), dtype=np.int32)
+            s = 0
+            for i, k in enumerate(tokens_per_expert_group.cpu().numpy()):
+                gatherd_idxs[s : s + k] = i % self.experts_per_rank
+                s += k
+            gatherd_idxs = gatherd_idxs.argsort()
+            sorted_tokens = gathered_tokens[gatherd_idxs]
+            tokens_per_expert = tokens_per_expert_post_gather
+        tokens_per_expert = tokens_per_expert.cpu().numpy()
+
+        outputs = []
+        start_idx = 0
+        for i, num_tokens in enumerate(tokens_per_expert):
+            end_idx = start_idx + num_tokens
+            if num_tokens == 0:
+                continue
+            expert = self.experts[i + self.ep_rank * self.experts_per_rank]
+            tokens_for_this_expert = sorted_tokens[start_idx:end_idx]
+            expert_out = expert(tokens_for_this_expert)
+            outputs.append(expert_out)
+            start_idx = end_idx
+
+        outs = torch.cat(outputs, dim=0) if len(outputs) else sorted_tokens.new_empty(0)
+        if self.ep_size > 1:
+            new_x = torch.empty_like(outs)
+            new_x[gatherd_idxs] = outs
+            gathered_tokens = new_x.new_empty(*sorted_tokens_shape)
+            dist.all_to_all(
+                list(gathered_tokens.split(input_split_sizes)),
+                list(new_x.split(output_splits)),
+            )
+            outs = gathered_tokens
+
+        new_x = torch.empty_like(outs)
+        new_x[idxs] = outs
+        final_out = (
+            new_x.view(*topk_ids.shape, -1)
+            .type(topk_weight.dtype)
+            .mul_(topk_weight.unsqueeze(dim=-1))
+            .sum(dim=1)
+            .type(new_x.dtype)
+        )
+        return final_out
+# ====================


The classes DeepseekV3MLP, DeepseekV3MoEGate, and DeepseekV3MoE are defined here but appear to be unused. The OpenseekDecoderLayer uses OpenseekMLP and OpenseekCDMoE. This dead code should be removed to improve maintainability and reduce confusion.

gemini-code-assist · 2025-08-21T01:58:07Z

modeliing/hf/modeling_openseek.py

+        self.expert_retrieval_dim = config.expert_retrieval_size
+        self.num_experts = config.num_experts
+        self.top_k = config.num_experts_per_tok
+        self.num_keys = int(math.sqrt(self.num_experts))


The number of keys is calculated as int(math.sqrt(self.num_experts)). If self.num_experts is not a perfect square, this will truncate the result, and some experts will never be used. For the default num_experts=2048, sqrt(2048) is ~45.25, which becomes 45. This means only 45*45=2025 experts are addressable. Please ensure this is the intended behavior or add a check to ensure num_experts is a perfect square.

gemini-code-assist · 2025-08-21T01:58:07Z

modeliing/hf/modeling_openseek.py

+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        self_attn_weights = None


self_attn_weights is unconditionally set to None after the attention call. This will prevent attention weights from being returned even when output_attentions=True. The attention weights should be passed through.

gemini-code-assist · 2025-08-21T01:58:07Z

modeliing/hf/modeling_deepseek.py

+        seq_len = torch.max(position_ids) + 1
+        if seq_len > self.max_seq_len_cached:  # growth
+            inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, seq_len=seq_len)
+            self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: may break with compilation


There is a TODO comment indicating a potential issue with torch.compile. This should be investigated and resolved before merging.

gemini-code-assist · 2025-08-21T01:58:08Z

modeliing/hf/modeling_deepseek.py

+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:


The function signature indicates a return type of Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]] but the function returns only two values (attn_output, attn_weights). The past_key_value cache is updated in-place and not returned. Please correct the return type annotation to match the implementation, which should be Tuple[torch.Tensor, Optional[torch.Tensor]].

Suggested change

) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:

) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:

Copilot

Pull Request Overview

This PR adds comprehensive OpenSeek modeling code to provide Hugging Face community access to DeepSeek-style architectures. The implementation includes experimental features like Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), Dynamic Mask Attention (DMA), learnable residuals, and Cross Domain MoE (CDMoE). Notably, the training forward pass for DeepSeek MoE is not yet implemented, limiting its use to inference only.

Implements DeepSeek MLA and MoE architectures with experimental enhancements
Adds configuration classes for both OpenSeek and DeepSeek models
Provides inference-only capabilities for DeepSeek MoE with training methods pending

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
modeliing/hf/modeling_openseek.py	Main modeling implementation with OpenSeek architecture, experimental modules (DMA, CDMoE, learnable residuals), and multiple attention mechanisms
modeliing/hf/modeling_deepseek.py	DeepSeek V3 model implementation focusing on MLA attention and MoE with inference-only support
modeliing/hf/configuration_openseek.py	Configuration class for OpenSeek models with extensive hyperparameter options
modeliing/hf/configuration_deepseek.py	Configuration class for DeepSeek V3 models with MoE-specific parameters

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-21T01:58:43Z

modeliing/hf/modeling_openseek.py

@@ -0,0 +1,1560 @@
+# coding=utf-8


The directory name 'modeliing' contains a typo. It should be 'modeling' (single 'i').

Copilot · 2025-08-21T01:58:43Z

modeliing/hf/modeling_openseek.py

+                    rate_value = torch.kthvalue(attn_mask, num_dynamic_mask, dim=-1, keepdim=True).values
+                    attn_mask = attn_mask.masked_fill(attn_mask < rate_value, min_type)
+            else:
+                ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)")


Missing 'raise' keyword before ValueError. This will not raise an exception but instead create an unused ValueError object.

Suggested change

ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)")

raise ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)")

Copilot · 2025-08-21T01:58:43Z

modeliing/hf/modeling_openseek.py

+        topk_idx, topk_weight = self.gate(hidden_states)
+        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
+        flat_topk_idx = topk_idx.view(-1)
+        # TODO: Deepseek is not open-source training method, we need to implement the training method


[nitpick] The comment contains unclear language. Consider revising to 'TODO: DeepSeek's training method is not open-source, we need to implement the training method' for better clarity.

Suggested change

# TODO: Deepseek is not open-source training method, we need to implement the training method

# TODO: DeepSeek's training method is not open-source, we need to implement the training method

Copilot · 2025-08-21T01:58:43Z

modeliing/hf/modeling_openseek.py

+        # TODO: Deepseek is not open-source training method, we need to implement the training method
+        if not self.training:
+            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
+        if self.n_shared_experts is not None:


The condition checks if n_shared_experts is not None, but it should check if it's greater than 0 since it's an integer representing count. The variable is defined as self.n_shared_experts = self.num_experts - self.n_routed_experts which will never be None.

Suggested change

if self.n_shared_experts is not None:

if self.n_shared_experts > 0:

Copilot · 2025-08-21T01:58:44Z

modeliing/hf/modeling_deepseek.py

@@ -0,0 +1,1044 @@
+# coding=utf-8


The directory name 'modeliing' contains a typo. It should be 'modeling' (single 'i').

Copilot · 2025-08-21T01:58:44Z

modeliing/hf/configuration_openseek.py

@@ -0,0 +1,182 @@
+# coding=utf-8


The directory name 'modeliing' contains a typo. It should be 'modeling' (single 'i').

Copilot · 2025-08-21T01:58:44Z

modeliing/hf/configuration_deepseek.py

@@ -0,0 +1,226 @@
+"""DeepSeekV3 model configuration"""


The directory name 'modeliing' contains a typo. It should be 'modeling' (single 'i').

algo-home added 2 commits March 2, 2025 16:24

Add openseek modeling code

91fb9c2

the modeling code available to huggingface community

Add trainable Deepseek V3 modeling code

e5f38ae

Merge branch 'main' into Add-openseek-modeling

6efd144

Copilot AI review requested due to automatic review settings August 21, 2025 01:57

gemini-code-assist bot reviewed Aug 21, 2025

View reviewed changes

Copilot AI reviewed Aug 21, 2025

View reviewed changes

	ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)")
	raise ValueError("`dynamic_mask_ratio` should be in the range (0.0, 1.0)")

	) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
	) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:

	# TODO: Deepseek is not open-source training method, we need to implement the training method
	# TODO: DeepSeek's training method is not open-source, we need to implement the training method

	if self.n_shared_experts is not None:
	if self.n_shared_experts > 0:

Add openseek modeling code #53

Are you sure you want to change the base?

Add openseek modeling code #53

Conversation

LoserCheems commented Mar 2, 2025

Uh oh!

LoserCheems commented Mar 2, 2025

Uh oh!

aoyulong commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LoserCheems commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aoyulong commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LoserCheems commented Mar 3, 2025

Uh oh!

aoyulong commented Mar 5, 2025

Uh oh!

ftgreat commented Aug 20, 2025

Uh oh!

openhands-ai bot commented Aug 20, 2025

Uh oh!

openhands-ai bot commented Aug 20, 2025

Summary of PR #53: "Add openseek modeling code"

Overview

What was Added

Key Features Implemented

Core Architecture

Experimental Enhancements (Referenced Issues #45, #47, #48)

Technical Capabilities

Context from Discussion

Assessment

Uh oh!

ftgreat commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 21, 2025

Uh oh!

ftgreat commented Aug 21, 2025

Uh oh!

gemini-code-assist bot commented Aug 21, 2025

Summary of Changes

Highlights

Uh oh!

ftgreat commented Aug 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

aoyulong commented Mar 3, 2025 •

edited

Loading

LoserCheems commented Mar 3, 2025 •

edited

Loading

aoyulong commented Mar 3, 2025 •

edited

Loading

ftgreat commented Aug 21, 2025 •

edited

Loading