-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
Suggestion Description
Hello, I've briefly compared deepseek-ai/DeepEP and ROCm/DeepEP and have the following questions:
- deepseek-ai/DeepEP implements its own low-latency IGBD process, avoiding polling CQ when issuing WQEs. The official response is: about ibgda_reserve_wqe_slots deepseek-ai/DeepEP#180. ROCm/DeepEP, however, directly calls the API provided by rocshmem. Each time a WQE is issued, it checks for available space; if not, it polls CQ. This approach is essentially the same as the IGBD process implemented by nvshmem, potentially leading to higher latency in low-latency mode.
- ROCm/DeepEP calls a warp interface similar to put_nbi_warp. In rocshmem, only one thread actually issues WQEs, while in deepseek-ai/DeepEP, all threads participate in the warp. Wouldn't this affect performance?
Operating System
No response
GPU
No response
ROCm Component
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels