-
Notifications
You must be signed in to change notification settings - Fork 61
Description
Problem Description
When running the alltoall_perf test with openmpi 5.0.3, the aws-ofi-rccl plugin, and OPX libfabric provider using (-np 4 -N 2) to specify two processes per node and using the -g 1 argument to specify 1 gpu per process, we are experiencing infrequent hangs at message sizes 8MB and 16MB.
It seems that RCCL plugin calls fi_endpoint to establish connection, which makes OPX open endoints and generate rx and tx flowkeys. When we hit the hang, and check the flowkeys, it seems they were overridden during runtime by RCCL plugin calling fi_endpoint again, even though connections were already established.
Operating System
RHEL 9.4
CPU
AMD Milan
GPU
AMD Instinct MI210
ROCm Version
ROCm 6.2.0
ROCm Component
rccl
Steps to Reproduce
Run alltoall_perf from rccl tests on two nodes, each node has 2 gpus.
Run command to reproduce:
mpirun -np 4 -N 2 -x MPIR_CVAR_CH4_OFI_ENABLE_RMA=1 -x MPIR_CVAR_ENABLE_GPU=1 -x MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 -mca mtl ofi -x FI_PROVIDER=opx -mca btl self,vader alltoall_perf -b 4 -e 268435456 -f 2 -g 1
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response