Skip to content

[Issue]: 2GPU per node hangs with rccl-tests's alltoall_perf and aws-ofi-rccl plugin #90

@tmh97

Description

@tmh97

Problem Description

When running the alltoall_perf test with openmpi 5.0.3, the aws-ofi-rccl plugin, and OPX libfabric provider using (-np 4 -N 2) to specify two processes per node and using the -g 1 argument to specify 1 gpu per process, we are experiencing infrequent hangs at message sizes 8MB and 16MB.

It seems that RCCL plugin calls fi_endpoint to establish connection, which makes OPX open endoints and generate rx and tx flowkeys. When we hit the hang, and check the flowkeys, it seems they were overridden during runtime by RCCL plugin calling fi_endpoint again, even though connections were already established.

Operating System

RHEL 9.4

CPU

AMD Milan

GPU

AMD Instinct MI210

ROCm Version

ROCm 6.2.0

ROCm Component

rccl

Steps to Reproduce

Run alltoall_perf from rccl tests on two nodes, each node has 2 gpus.

Run command to reproduce:

mpirun -np 4 -N 2 -x MPIR_CVAR_CH4_OFI_ENABLE_RMA=1 -x MPIR_CVAR_ENABLE_GPU=1 -x MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 -mca mtl ofi -x FI_PROVIDER=opx -mca btl self,vader alltoall_perf -b 4 -e 268435456 -f 2 -g 1

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions