[Feature]: Suggestions Regarding IBGDA Performance

### Suggestion Description

Hello, I've briefly compared deepseek-ai/DeepEP and ROCm/DeepEP and have the following questions:
1. deepseek-ai/DeepEP implements its own low-latency IGBD process, avoiding polling CQ when issuing WQEs. The official response is: https://github.com/deepseek-ai/DeepEP/issues/180. ROCm/DeepEP, however, directly calls the API provided by rocshmem. Each time a WQE is issued, it checks for available space; if not, it polls CQ. This approach is essentially the same as the IGBD process implemented by nvshmem, potentially leading to higher latency in low-latency mode.
2. ROCm/DeepEP calls a warp interface similar to put_nbi_warp. In rocshmem, only one thread actually issues WQEs, while in deepseek-ai/DeepEP, all threads participate in the warp. Wouldn't this affect performance?

### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Suggestions Regarding IBGDA Performance #13

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Suggestions Regarding IBGDA Performance #13

Description

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions