Skip to content

[RFC]: Parallel strategy for Deepseek v3.2 like models #196

@ganyi1996ppo

Description

@ganyi1996ppo

Suggestion Description

Current Deepseek v3.2 is inefficient with both tp or dp alone, based on deepseek v3.2's attribute, we found deploy this model with sequence parallel with specialized metadata preparing will be much more efficient than other parallel strategy.

Old Impl

Image

Problem with tp:

  • Redundant calculation in indexer path where wq_b and w_k is data parallel
  • mla runs with num_head 16 and each token requires separate cache calculation, the compute intensity is quite low during prefill especially for long prompt

Problem with dp:

  • Risk of high latency
  • potential load imbalance over request

New proposal

Sequence parallel over token dim for both prefill and decode over tensor parallel group. No shard on tensor parallel dim for the weight. All gather over cache before the kv cache update step

Image

Pros :

  • full head on mla part, high tflops
  • no redundant calculation on indexer or mla part
  • gathered kv cache is quite small and fast, which can also been overlapped by two stream
  • communication data amount cut to half to hidden_states
  • support both piecewise and fullgraph

Cons :

  • complicated metadata calculation
  • potential execution time divergence over ranks, especially in piecewise graph

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions