Add NCCL send/recv ring benchmark for multi-GPU testing by paulogallotti · Pull Request #1013 · awslabs/awsome-distributed-training

paulogallotti · 2026-03-10T16:28:38Z

Purpose

Minimal point-to-point benchmark using ncclSend/ncclRecv in a ring topology. Measures latency and bandwidth across GPUs over 100 timed iterations with a 64 MB buffer.

Includes a Slurm batch script configured for SageMaker HyperPod clusters with EFA networking (p4d.24xlarge, 2 nodes, 8 GPUs/node).

Minimal point-to-point benchmark using ncclSend/ncclRecv in a ring topology. Measures latency and bandwidth across GPUs over 100 timed iterations with a 64 MB buffer. Includes a Slurm batch script configured for SageMaker HyperPod clusters with EFA networking (p4d.24xlarge, 2 nodes, 8 GPUs/node).

KeitaW

Review 1/4 — Structure & Repository Hygiene

Thanks for putting this together, @paulogallotti! A point-to-point ring benchmark is a genuinely useful addition to the micro-benchmarks collection. I have a few suggestions below.

Directory placement: this is not NVIDIA nccl-tests

The micro-benchmarks/nccl-tests/ directory hosts benchmarks that wrap NVIDIA's nccl-tests — running the upstream all_reduce_perf, alltoall_perf, etc. binaries against different configurations (AMI, container, topology-aware). This PR adds a custom C benchmark that uses NCCL APIs directly; it has no dependency on the nccl-tests project.

Placing it under nccl-tests/ could confuse users looking for official nccl-tests wrappers. I'd suggest moving this to a new top-level sibling directory under micro-benchmarks/, for example:

micro-benchmarks/nccl-sendrecv/ — describes exactly what the benchmark does
micro-benchmarks/nccl-p2p/ — a bit more general if you plan to add other point-to-point patterns later

This aligns with how micro-benchmarks/ is currently organized: each directory (nccl-tests, nccom-tests, nvshmem, expert-parallelism) corresponds to a distinct tool or benchmark suite.

KeitaW · 2026-03-11T12:47:39Z

micro-benchmarks/nccl-tests/slurm/nccl-tests-c/Makefile

@@ -0,0 +1,21 @@
+CUDA_HOME  ?= /usr/local/cuda


Missing license header

The repo convention requires every file to include the Amazon copyright header. The existing nccl-tests scripts (e.g., nccl-tests-ami.sbatch) all follow this pattern. Could you add headers to all four files?

For the Makefile and sbatch, use # comments; for the .c file use // or /* */.

Suggested change

CUDA_HOME ?= /usr/local/cuda

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

# SPDX-License-Identifier: MIT-0

CUDA_HOME ?= /usr/local/cuda

KeitaW

Review 2/4 — Deployment Pipeline (Slurm)

KeitaW · 2026-03-11T12:47:56Z

micro-benchmarks/nccl-tests/slurm/nccl-tests-c/run_nccl_test.sbatch

+
+###############################################################################


Missing set -ex

The existing nccl-tests sbatch scripts (e.g., nccl-tests-ami.sbatch) include set -ex immediately after the #SBATCH directives and copyright header. This ensures the script fails fast on errors and logs each command for debugging. I'd suggest adding it here to follow the established pattern.

Suggested change

###############################################################################

#SBATCH --wait-all-nodes=1

set -ex

KeitaW · 2026-03-11T12:47:56Z

micro-benchmarks/nccl-tests/slurm/nccl-tests-c/run_nccl_test.sbatch

+srun \
+    --mpi=pmix \
+    --gpus-per-node=${SLURM_GPUS_PER_NODE} \
+    --ntasks-per-node=${SLURM_NTASKS_PER_NODE} \


Unquoted variable expansions

Per repo conventions, variable expansions should be quoted. While Slurm sets these to numeric values (so word-splitting isn't a practical risk here), quoting is the repo-wide standard.

Suggested change

srun \

--mpi=pmix \

--gpus-per-node=${SLURM_GPUS_PER_NODE} \

--ntasks-per-node=${SLURM_NTASKS_PER_NODE} \

srun \

--mpi=pmix \

--gpus-per-node="${SLURM_GPUS_PER_NODE}" \

--ntasks-per-node="${SLURM_NTASKS_PER_NODE}" \

KeitaW · 2026-03-11T12:47:56Z

micro-benchmarks/nccl-tests/slurm/nccl-tests-c/run_nccl_test.sbatch

+#
+# Current config: p4d.24xlarge (8 A100 GPUs/node, 2 nodes)
+###############################################################################
+


HyperPod auto-resume detection

Since the README explicitly targets SageMaker HyperPod, it might be worth including the standard HyperPod auto-resume detection block that other HyperPod-targeted scripts in the repo use:

if [ -d "/opt/sagemaker_cluster" ]; then # HyperPod auto-resume logic fi

Not strictly required, but would be consistent with the repo convention.

KeitaW

Left few comments

KeitaW

Review 3/4 — Infrastructure & NCCL Configuration

KeitaW · 2026-03-11T12:48:07Z

micro-benchmarks/nccl-tests/slurm/nccl-tests-c/run_nccl_test.sbatch

+# (previously forced Ring/Simple which limited performance)
+


Changelog-style comment in a fresh contribution

This reads like a changelog entry ("previously forced Ring/Simple which limited performance") which could be confusing for readers since there's no prior version to compare against. I'd suggest simplifying:

Suggested change

# (previously forced Ring/Simple which limited performance)

# Let NCCL auto-select the best algorithm and protocol

# for optimal performance across different message sizes.

KeitaW · 2026-03-11T12:48:07Z

micro-benchmarks/nccl-tests/slurm/nccl-tests-c/run_nccl_test.sbatch

+
+# Use all 4 EFA interfaces on p4d.24xlarge
+export NCCL_NET_GDR_LEVEL=SYS
+


Brief comment on GDR level

It might be worth adding a note that this setting enables GPU Direct RDMA and assumes EFA is available on the instance, e.g.:

Suggested change

export NCCL_NET_GDR_LEVEL=SYS # GPU Direct RDMA at system level (requires EFA)

KeitaW

Review 4/4 — Documentation Consistency & Positives

Branch setup note

The head branch is main from a fork, which means the contributor's fork main will diverge from upstream after merge. Not a blocker — just worth being aware of for future PRs.

Things That Look Great

Clean, readable C code: The benchmark is well-structured with proper error checking via CUDACHECK and NCCLCHECK macros, clean resource cleanup, and a logical flow from init → warm-up → timed iterations → reporting.
Good use of MPI_Comm_split_type for local rank: This is the correct way to determine local rank for GPU assignment, rather than relying on environment variables or hostname parsing.
Ring topology fills a real gap: The existing nccl-tests wrappers focus on collective operations (allreduce, alltoall). A point-to-point ring benchmark is genuinely useful for diagnosing pairwise connectivity issues.
Comprehensive README: The documentation covers building, running, customization, and configuration in a clear, well-formatted way with helpful tables.
Makefile supports path overrides: The ?= syntax for CUDA_HOME, NCCL_HOME, and MPI_HOME makes the build portable across different environments without editing the file.

KeitaW · 2026-03-11T12:48:28Z

micro-benchmarks/nccl-tests/slurm/nccl-tests-c/README.md

+| Define | Default | Description |
+|---|---|---|
+| `MSG_SIZE` | 64 MB | Buffer size per send/recv |
+| `ITERATIONS` | 100 | Timed iterations |


Example output could note values are illustrative

The specific bandwidth numbers shown (27.39 GB/s algo BW, 54.78 GB/s bus BW) are plausible for p4d but users might interpret them as expected baselines. It might be worth adding a brief note like:

Values will vary based on instance type, message size, and network conditions.

KeitaW reviewed Mar 11, 2026

View reviewed changes

KeitaW requested changes Mar 11, 2026

View reviewed changes

KeitaW reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NCCL send/recv ring benchmark for multi-GPU testing#1013

Add NCCL send/recv ring benchmark for multi-GPU testing#1013
paulogallotti wants to merge 1 commit intoawslabs:mainfrom
paulogallotti:main

paulogallotti commented Mar 10, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 11, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 11, 2026

Uh oh!

KeitaW Mar 11, 2026

Uh oh!

KeitaW Mar 11, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 11, 2026

Uh oh!

KeitaW Mar 11, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		###############################################################################

	# (previously forced Ring/Simple which limited performance)
	# Let NCCL auto-select the best algorithm and protocol
	# for optimal performance across different message sizes.


		# Use all 4 EFA interfaces on p4d.24xlarge
		export NCCL_NET_GDR_LEVEL=SYS


	export NCCL_NET_GDR_LEVEL=SYS # GPU Direct RDMA at system level (requires EFA)

Conversation

paulogallotti commented Mar 10, 2026

Purpose

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review 1/4 — Structure & Repository Hygiene

Directory placement: this is not NVIDIA nccl-tests

Uh oh!

KeitaW Mar 11, 2026

Choose a reason for hiding this comment

Missing license header

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review 2/4 — Deployment Pipeline (Slurm)

Uh oh!

KeitaW Mar 11, 2026

Choose a reason for hiding this comment

Missing set -ex

Uh oh!

KeitaW Mar 11, 2026

Choose a reason for hiding this comment

Unquoted variable expansions

Uh oh!

KeitaW Mar 11, 2026

Choose a reason for hiding this comment

HyperPod auto-resume detection

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review 3/4 — Infrastructure & NCCL Configuration

Uh oh!

KeitaW Mar 11, 2026

Choose a reason for hiding this comment

Changelog-style comment in a fresh contribution

Uh oh!

KeitaW Mar 11, 2026

Choose a reason for hiding this comment

Brief comment on GDR level

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review 4/4 — Documentation Consistency & Positives

Branch setup note

Things That Look Great

Uh oh!

KeitaW Mar 11, 2026

Choose a reason for hiding this comment

Example output could note values are illustrative

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Missing `set -ex`