-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Suggestion Description
Current Deepseek v3.2 is inefficient with both tp or dp alone, based on deepseek v3.2's attribute, we found deploy this model with sequence parallel with specialized metadata preparing will be much more efficient than other parallel strategy.
Old Impl
Problem with tp:
- Redundant calculation in indexer path where wq_b and w_k is data parallel
- mla runs with num_head 16 and each token requires separate cache calculation, the compute intensity is quite low during prefill especially for long prompt
Problem with dp:
- Risk of high latency
- potential load imbalance over request
New proposal
Sequence parallel over token dim for both prefill and decode over tensor parallel group. No shard on tensor parallel dim for the weight. All gather over cache before the kv cache update step
Pros :
- full head on mla part, high tflops
- no redundant calculation on indexer or mla part
- gathered kv cache is quite small and fast, which can also been overlapped by two stream
- communication data amount cut to half to hidden_states
- support both piecewise and fullgraph
Cons :
- complicated metadata calculation
- potential execution time divergence over ranks, especially in piecewise graph
Operating System
No response
GPU
No response
ROCm Component
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels