[Issue]: 2GPU per node hangs with rccl-tests's alltoall_perf and aws-ofi-rccl plugin

### Problem Description

When running the `alltoall_perf` test with openmpi 5.0.3, the aws-ofi-rccl plugin, and OPX libfabric provider using (-np 4 -N 2) to specify two processes per node and using the -g 1 argument to specify 1 gpu per process, we are experiencing infrequent hangs at message sizes 8MB and 16MB.

It seems that RCCL plugin calls fi_endpoint to establish connection, which makes OPX open endoints and generate rx and tx flowkeys. When we hit the hang, and check the flowkeys, it seems they were overridden during runtime by RCCL plugin calling fi_endpoint again, even though connections were already established.

### Operating System

RHEL 9.4

### CPU

AMD Milan

### GPU

AMD Instinct MI210

### ROCm Version

ROCm 6.2.0

### ROCm Component

rccl

### Steps to Reproduce

Run alltoall_perf from rccl tests on two nodes, each node has 2 gpus.

Run command to reproduce:

mpirun -np 4 -N 2 -x MPIR_CVAR_CH4_OFI_ENABLE_RMA=1 -x MPIR_CVAR_ENABLE_GPU=1 -x MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 -mca mtl ofi -x FI_PROVIDER=opx -mca btl self,vader alltoall_perf -b 4 -e 268435456 -f 2 -g 1



### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: 2GPU per node hangs with rccl-tests's alltoall_perf and aws-ofi-rccl plugin #90

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: 2GPU per node hangs with rccl-tests's alltoall_perf and aws-ofi-rccl plugin #90

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions