Skip to content

Add NCCL send/recv ring benchmark for multi-GPU testing#1013

Open
paulogallotti wants to merge 1 commit intoawslabs:mainfrom
paulogallotti:main
Open

Add NCCL send/recv ring benchmark for multi-GPU testing#1013
paulogallotti wants to merge 1 commit intoawslabs:mainfrom
paulogallotti:main

Conversation

@paulogallotti
Copy link

Purpose

Minimal point-to-point benchmark using ncclSend/ncclRecv in a ring topology. Measures latency and bandwidth across GPUs over 100 timed iterations with a 64 MB buffer.

Includes a Slurm batch script configured for SageMaker HyperPod clusters with EFA networking (p4d.24xlarge, 2 nodes, 8 GPUs/node).

Minimal point-to-point benchmark using ncclSend/ncclRecv in a ring
topology. Measures latency and bandwidth across GPUs over 100 timed
iterations with a 64 MB buffer.

Includes a Slurm batch script configured for SageMaker HyperPod
clusters with EFA networking (p4d.24xlarge, 2 nodes, 8 GPUs/node).
Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 1/4 — Structure & Repository Hygiene

Thanks for putting this together, @paulogallotti! A point-to-point ring benchmark is a genuinely useful addition to the micro-benchmarks collection. I have a few suggestions below.

Directory placement: this is not NVIDIA nccl-tests

The micro-benchmarks/nccl-tests/ directory hosts benchmarks that wrap NVIDIA's nccl-tests — running the upstream all_reduce_perf, alltoall_perf, etc. binaries against different configurations (AMI, container, topology-aware). This PR adds a custom C benchmark that uses NCCL APIs directly; it has no dependency on the nccl-tests project.

Placing it under nccl-tests/ could confuse users looking for official nccl-tests wrappers. I'd suggest moving this to a new top-level sibling directory under micro-benchmarks/, for example:

  • micro-benchmarks/nccl-sendrecv/ — describes exactly what the benchmark does
  • micro-benchmarks/nccl-p2p/ — a bit more general if you plan to add other point-to-point patterns later

This aligns with how micro-benchmarks/ is currently organized: each directory (nccl-tests, nccom-tests, nvshmem, expert-parallelism) corresponds to a distinct tool or benchmark suite.

@@ -0,0 +1,21 @@
CUDA_HOME ?= /usr/local/cuda
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing license header

The repo convention requires every file to include the Amazon copyright header. The existing nccl-tests scripts (e.g., nccl-tests-ami.sbatch) all follow this pattern. Could you add headers to all four files?

For the Makefile and sbatch, use # comments; for the .c file use // or /* */.

Suggested change
CUDA_HOME ?= /usr/local/cuda
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
CUDA_HOME ?= /usr/local/cuda

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 2/4 — Deployment Pipeline (Slurm)

Comment on lines +10 to +11

###############################################################################
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing set -ex

The existing nccl-tests sbatch scripts (e.g., nccl-tests-ami.sbatch) include set -ex immediately after the #SBATCH directives and copyright header. This ensures the script fails fast on errors and logs each command for debugging. I'd suggest adding it here to follow the established pattern.

Suggested change
###############################################################################
#SBATCH --wait-all-nodes=1
set -ex

Comment on lines +52 to +55
srun \
--mpi=pmix \
--gpus-per-node=${SLURM_GPUS_PER_NODE} \
--ntasks-per-node=${SLURM_NTASKS_PER_NODE} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unquoted variable expansions

Per repo conventions, variable expansions should be quoted. While Slurm sets these to numeric values (so word-splitting isn't a practical risk here), quoting is the repo-wide standard.

Suggested change
srun \
--mpi=pmix \
--gpus-per-node=${SLURM_GPUS_PER_NODE} \
--ntasks-per-node=${SLURM_NTASKS_PER_NODE} \
srun \
--mpi=pmix \
--gpus-per-node="${SLURM_GPUS_PER_NODE}" \
--ntasks-per-node="${SLURM_NTASKS_PER_NODE}" \

#
# Current config: p4d.24xlarge (8 A100 GPUs/node, 2 nodes)
###############################################################################

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyperPod auto-resume detection

Since the README explicitly targets SageMaker HyperPod, it might be worth including the standard HyperPod auto-resume detection block that other HyperPod-targeted scripts in the repo use:

if [ -d "/opt/sagemaker_cluster" ]; then
    # HyperPod auto-resume logic
fi

Not strictly required, but would be consistent with the repo convention.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few comments

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 3/4 — Infrastructure & NCCL Configuration

Comment on lines +27 to +28
# (previously forced Ring/Simple which limited performance)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog-style comment in a fresh contribution

This reads like a changelog entry ("previously forced Ring/Simple which limited performance") which could be confusing for readers since there's no prior version to compare against. I'd suggest simplifying:

Suggested change
# (previously forced Ring/Simple which limited performance)
# Let NCCL auto-select the best algorithm and protocol
# for optimal performance across different message sizes.


# Use all 4 EFA interfaces on p4d.24xlarge
export NCCL_NET_GDR_LEVEL=SYS

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief comment on GDR level

It might be worth adding a note that this setting enables GPU Direct RDMA and assumes EFA is available on the instance, e.g.:

Suggested change
export NCCL_NET_GDR_LEVEL=SYS # GPU Direct RDMA at system level (requires EFA)

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review 4/4 — Documentation Consistency & Positives

Branch setup note

The head branch is main from a fork, which means the contributor's fork main will diverge from upstream after merge. Not a blocker — just worth being aware of for future PRs.


Things That Look Great

  • Clean, readable C code: The benchmark is well-structured with proper error checking via CUDACHECK and NCCLCHECK macros, clean resource cleanup, and a logical flow from init → warm-up → timed iterations → reporting.
  • Good use of MPI_Comm_split_type for local rank: This is the correct way to determine local rank for GPU assignment, rather than relying on environment variables or hostname parsing.
  • Ring topology fills a real gap: The existing nccl-tests wrappers focus on collective operations (allreduce, alltoall). A point-to-point ring benchmark is genuinely useful for diagnosing pairwise connectivity issues.
  • Comprehensive README: The documentation covers building, running, customization, and configuration in a clear, well-formatted way with helpful tables.
  • Makefile supports path overrides: The ?= syntax for CUDA_HOME, NCCL_HOME, and MPI_HOME makes the build portable across different environments without editing the file.

| Define | Default | Description |
|---|---|---|
| `MSG_SIZE` | 64 MB | Buffer size per send/recv |
| `ITERATIONS` | 100 | Timed iterations |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example output could note values are illustrative

The specific bandwidth numbers shown (27.39 GB/s algo BW, 54.78 GB/s bus BW) are plausible for p4d but users might interpret them as expected baselines. It might be worth adding a brief note like:

Values will vary based on instance type, message size, and network conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants