Add NCCL send/recv ring benchmark for multi-GPU testing#1013
Add NCCL send/recv ring benchmark for multi-GPU testing#1013paulogallotti wants to merge 1 commit intoawslabs:mainfrom
Conversation
Minimal point-to-point benchmark using ncclSend/ncclRecv in a ring topology. Measures latency and bandwidth across GPUs over 100 timed iterations with a 64 MB buffer. Includes a Slurm batch script configured for SageMaker HyperPod clusters with EFA networking (p4d.24xlarge, 2 nodes, 8 GPUs/node).
KeitaW
left a comment
There was a problem hiding this comment.
Review 1/4 — Structure & Repository Hygiene
Thanks for putting this together, @paulogallotti! A point-to-point ring benchmark is a genuinely useful addition to the micro-benchmarks collection. I have a few suggestions below.
Directory placement: this is not NVIDIA nccl-tests
The micro-benchmarks/nccl-tests/ directory hosts benchmarks that wrap NVIDIA's nccl-tests — running the upstream all_reduce_perf, alltoall_perf, etc. binaries against different configurations (AMI, container, topology-aware). This PR adds a custom C benchmark that uses NCCL APIs directly; it has no dependency on the nccl-tests project.
Placing it under nccl-tests/ could confuse users looking for official nccl-tests wrappers. I'd suggest moving this to a new top-level sibling directory under micro-benchmarks/, for example:
micro-benchmarks/nccl-sendrecv/— describes exactly what the benchmark doesmicro-benchmarks/nccl-p2p/— a bit more general if you plan to add other point-to-point patterns later
This aligns with how micro-benchmarks/ is currently organized: each directory (nccl-tests, nccom-tests, nvshmem, expert-parallelism) corresponds to a distinct tool or benchmark suite.
| @@ -0,0 +1,21 @@ | |||
| CUDA_HOME ?= /usr/local/cuda | |||
There was a problem hiding this comment.
Missing license header
The repo convention requires every file to include the Amazon copyright header. The existing nccl-tests scripts (e.g., nccl-tests-ami.sbatch) all follow this pattern. Could you add headers to all four files?
For the Makefile and sbatch, use # comments; for the .c file use // or /* */.
| CUDA_HOME ?= /usr/local/cuda | |
| # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | |
| # SPDX-License-Identifier: MIT-0 | |
| CUDA_HOME ?= /usr/local/cuda |
KeitaW
left a comment
There was a problem hiding this comment.
Review 2/4 — Deployment Pipeline (Slurm)
|
|
||
| ############################################################################### |
There was a problem hiding this comment.
Missing set -ex
The existing nccl-tests sbatch scripts (e.g., nccl-tests-ami.sbatch) include set -ex immediately after the #SBATCH directives and copyright header. This ensures the script fails fast on errors and logs each command for debugging. I'd suggest adding it here to follow the established pattern.
| ############################################################################### | |
| #SBATCH --wait-all-nodes=1 | |
| set -ex |
| srun \ | ||
| --mpi=pmix \ | ||
| --gpus-per-node=${SLURM_GPUS_PER_NODE} \ | ||
| --ntasks-per-node=${SLURM_NTASKS_PER_NODE} \ |
There was a problem hiding this comment.
Unquoted variable expansions
Per repo conventions, variable expansions should be quoted. While Slurm sets these to numeric values (so word-splitting isn't a practical risk here), quoting is the repo-wide standard.
| srun \ | |
| --mpi=pmix \ | |
| --gpus-per-node=${SLURM_GPUS_PER_NODE} \ | |
| --ntasks-per-node=${SLURM_NTASKS_PER_NODE} \ | |
| srun \ | |
| --mpi=pmix \ | |
| --gpus-per-node="${SLURM_GPUS_PER_NODE}" \ | |
| --ntasks-per-node="${SLURM_NTASKS_PER_NODE}" \ |
| # | ||
| # Current config: p4d.24xlarge (8 A100 GPUs/node, 2 nodes) | ||
| ############################################################################### | ||
|
|
There was a problem hiding this comment.
HyperPod auto-resume detection
Since the README explicitly targets SageMaker HyperPod, it might be worth including the standard HyperPod auto-resume detection block that other HyperPod-targeted scripts in the repo use:
if [ -d "/opt/sagemaker_cluster" ]; then
# HyperPod auto-resume logic
fiNot strictly required, but would be consistent with the repo convention.
KeitaW
left a comment
There was a problem hiding this comment.
Review 3/4 — Infrastructure & NCCL Configuration
| # (previously forced Ring/Simple which limited performance) | ||
|
|
There was a problem hiding this comment.
Changelog-style comment in a fresh contribution
This reads like a changelog entry ("previously forced Ring/Simple which limited performance") which could be confusing for readers since there's no prior version to compare against. I'd suggest simplifying:
| # (previously forced Ring/Simple which limited performance) | |
| # Let NCCL auto-select the best algorithm and protocol | |
| # for optimal performance across different message sizes. |
|
|
||
| # Use all 4 EFA interfaces on p4d.24xlarge | ||
| export NCCL_NET_GDR_LEVEL=SYS | ||
|
|
There was a problem hiding this comment.
Brief comment on GDR level
It might be worth adding a note that this setting enables GPU Direct RDMA and assumes EFA is available on the instance, e.g.:
| export NCCL_NET_GDR_LEVEL=SYS # GPU Direct RDMA at system level (requires EFA) |
KeitaW
left a comment
There was a problem hiding this comment.
Review 4/4 — Documentation Consistency & Positives
Branch setup note
The head branch is main from a fork, which means the contributor's fork main will diverge from upstream after merge. Not a blocker — just worth being aware of for future PRs.
Things That Look Great
- Clean, readable C code: The benchmark is well-structured with proper error checking via
CUDACHECKandNCCLCHECKmacros, clean resource cleanup, and a logical flow from init → warm-up → timed iterations → reporting. - Good use of
MPI_Comm_split_typefor local rank: This is the correct way to determine local rank for GPU assignment, rather than relying on environment variables or hostname parsing. - Ring topology fills a real gap: The existing nccl-tests wrappers focus on collective operations (allreduce, alltoall). A point-to-point ring benchmark is genuinely useful for diagnosing pairwise connectivity issues.
- Comprehensive README: The documentation covers building, running, customization, and configuration in a clear, well-formatted way with helpful tables.
- Makefile supports path overrides: The
?=syntax forCUDA_HOME,NCCL_HOME, andMPI_HOMEmakes the build portable across different environments without editing the file.
| | Define | Default | Description | | ||
| |---|---|---| | ||
| | `MSG_SIZE` | 64 MB | Buffer size per send/recv | | ||
| | `ITERATIONS` | 100 | Timed iterations | |
There was a problem hiding this comment.
Example output could note values are illustrative
The specific bandwidth numbers shown (27.39 GB/s algo BW, 54.78 GB/s bus BW) are plausible for p4d but users might interpret them as expected baselines. It might be worth adding a brief note like:
Values will vary based on instance type, message size, and network conditions.
Purpose
Minimal point-to-point benchmark using ncclSend/ncclRecv in a ring topology. Measures latency and bandwidth across GPUs over 100 timed iterations with a 64 MB buffer.
Includes a Slurm batch script configured for SageMaker HyperPod clusters with EFA networking (p4d.24xlarge, 2 nodes, 8 GPUs/node).