Skip to content

[Feature]: Suggestions Regarding IBGDA Performance #13

@ywj55555

Description

@ywj55555

Suggestion Description

Hello, I've briefly compared deepseek-ai/DeepEP and ROCm/DeepEP and have the following questions:

  1. deepseek-ai/DeepEP implements its own low-latency IGBD process, avoiding polling CQ when issuing WQEs. The official response is: about ibgda_reserve_wqe_slots deepseek-ai/DeepEP#180. ROCm/DeepEP, however, directly calls the API provided by rocshmem. Each time a WQE is issued, it checks for available space; if not, it polls CQ. This approach is essentially the same as the IGBD process implemented by nvshmem, potentially leading to higher latency in low-latency mode.
  2. ROCm/DeepEP calls a warp interface similar to put_nbi_warp. In rocshmem, only one thread actually issues WQEs, while in deepseek-ai/DeepEP, all threads participate in the warp. Wouldn't this affect performance?

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions