[RFC]: Parallel strategy for Deepseek v3.2 like models

### Suggestion Description
Current Deepseek v3.2 is inefficient with both tp or dp alone, based on deepseek v3.2's attribute, we found deploy this model with sequence parallel with specialized metadata preparing will be much more efficient than other parallel strategy.

# Old Impl

<img width="1870" height="1880" alt="Image" src="https://github.com/user-attachments/assets/83be3086-fb17-4954-87f4-e13c3a9ec8b8" />

Problem with tp:
- Redundant calculation in indexer path where wq_b and w_k is data parallel
- mla runs with num_head 16 and each token requires separate cache calculation, the compute intensity is quite low during prefill especially for long prompt

Problem with dp:
- Risk of high latency
- potential load imbalance over request

# New proposal 

Sequence parallel over token dim for both prefill and decode over tensor parallel group. No shard on tensor parallel dim for the weight. All gather over cache before the kv cache update step

<img width="1918" height="2024" alt="Image" src="https://github.com/user-attachments/assets/c0f5c85d-5f18-4878-86b5-ff0f4becbc8e" />

**Pros** :

- full head on mla part, high tflops
- no redundant calculation on indexer or mla part
- gathered kv cache is quite small and fast, which can also been overlapped by two stream
- communication data amount cut to half to hidden_states
- support both piecewise and fullgraph

**Cons** :
- complicated metadata calculation 
- potential execution time divergence over ranks, especially in piecewise graph


### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Parallel strategy for Deepseek v3.2 like models #196

Suggestion Description

Old Impl

New proposal

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Parallel strategy for Deepseek v3.2 like models #196

Description

Suggestion Description

Old Impl

New proposal

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions