Skip to content

[Issue]: rccl tests on two Mi300x nodes gives me poor performance using mpirun #146

@rgee18

Description

@rgee18

Problem Description

echo "OS:" && cat /etc/os-release | grep -E "^(NAME=|VERSION=)";
OS:
NAME="Ubuntu"
VERSION="22.04.5 LTS (Jammy Jellyfish)"

echo "CPU: " && cat /proc/cpuinfo | grep "model name" | sort --unique;
CPU:
model name	: AMD EPYC 9534 64-Core Processor

echo "GPU:" && /opt/rocm/bin/rocminfo | grep -E "^\s*(Name|Marketing Name)";
GPU:
  Name:                    AMD EPYC 9534 64-Core Processor
  Marketing Name:          AMD EPYC 9534 64-Core Processor
  Name:                    AMD EPYC 9534 64-Core Processor
  Marketing Name:          AMD EPYC 9534 64-Core Processor
  Name:                    gfx942
  Marketing Name:          AMD Instinct MI300X
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
  Name:                    gfx942
  Marketing Name:          AMD Instinct MI300X
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
  Name:                    gfx942
  Marketing Name:          AMD Instinct MI300X
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
  Name:                    gfx942
  Marketing Name:          AMD Instinct MI300X
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
  Name:                    gfx942
  Marketing Name:          AMD Instinct MI300X
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
  Name:                    gfx942
  Marketing Name:          AMD Instinct MI300X
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
  Name:                    gfx942
  Marketing Name:          AMD Instinct MI300X
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
  Name:                    gfx942
  Marketing Name:          AMD Instinct MI300X
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-

Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

CPU

AMD EPYC 9534 64-Core Processor

GPU

AMD EPYC 9534 64-Core Processor

ROCm Version

ROCm version: 6.2.0

ROCm Component

No response

Steps to Reproduce

I followed the steps here to compile the rccl-tests https://github.com/ROCm/rccl-tests

I used MPI=1 and added a hostfile with my private IP addresses along with slots=8.

The command i'm using to run mpi job is HSA_NO_SCRATCH_RECLAIM=1 mpirun -np 16 --hostfile hostfile.txt --bind-to numa ./all_reduce_perf -b 8 -e 128M -f 2 -g 8

And i'm getting the following results attached.

all_reduce_test.txt

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions