Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010
Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010dmvevents wants to merge 1 commit intoawslabs:mainfrom
Conversation
New test case for NVIDIA NeMo RL GRPO training on Amazon EKS with: - NVIDIA Resiliency Extension (NVRx) for fault tolerance - Multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100) - KubeRay RayJob manifests with FSx Lustre shared storage - Training scripts with heartbeat monitoring and straggler detection - Before/after evaluation demonstrating math reasoning improvement Tested on 2x g5.8xlarge with Qwen2.5-1.5B-Instruct, 20 GRPO steps.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/3 — Structure & Repository Hygiene
This PR adds a comprehensive new test case for NVIDIA NeMo RL GRPO training with fault tolerance (NVRx) on Amazon EKS. The contribution is well-structured and clearly written, though there are several repo-convention issues that I'd suggest addressing before merging.
Missing copyright and license headers on all files
None of the 8 files in this PR include the required copyright header:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
I'd suggest adding it to the top of each file — for the Dockerfile, before the first comment block; for Python files, before the docstring; for YAML files, as the first line(s); for the shell script, after the shebang line.
Reference: CLAUDE.md conventions
| # Build (multi-arch, ~40 min on c5.4xlarge): | ||
| # docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop . |
There was a problem hiding this comment.
Dockerfile header references wrong file path
This looks like a leftover from when the file was at a different path in the upstream NeMo RL repo. The actual file is 3.test_cases/pytorch/nemo-rl/Dockerfile, not docker/Dockerfile.workshop.
I'd suggest updating these comments to match the actual location in this repo:
| # Build (multi-arch, ~40 min on c5.4xlarge): | |
| # docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop . | |
| # Build (multi-arch, ~40 min on c5.4xlarge): | |
| # docker build -t nemo-rl-workshop . |
| RUN cd /tmp && \ | ||
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \ | ||
| | tar xz && \ |
There was a problem hiding this comment.
EFA installer uses unpinned latest URL
The EFA installer is fetched via aws-efa-installer-latest.tar.gz, which resolves to whatever the current version is at build time. This violates the repo convention that all external dependencies must be pinned to a specific version.
I'd suggest pinning to a specific version, consistent with how OFI_NCCL_VERSION is already pinned:
| RUN cd /tmp && \ | |
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \ | |
| | tar xz && \ | |
| RUN cd /tmp && \ | |
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz \ | |
| | tar xz && \ |
Alternatively, add an ARG EFA_INSTALLER_VERSION=1.38.0 alongside the other version ARGs and use it here.
Reference: CI version check enforces EFA >= 1.47.0.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/3 — Deployment Pipeline & Infrastructure
| @@ -0,0 +1,89 @@ | |||
| #!/bin/bash | |||
| set -e | |||
There was a problem hiding this comment.
Shell script uses set -e instead of set -euo pipefail
The repo convention requires shell scripts to start with set -euo pipefail (or set -ex at minimum). Using only set -e means unset variable references won't be caught (-u) and failures in piped commands won't propagate (-o pipefail).
| set -e | |
| set -exuo pipefail |
| # 3. NVRx scripts on FSx: /shared/nvrx-demo/{patches,scripts}/ | ||
| # 4. NVIDIA + EFA device plugins on g5 nodes | ||
| # | ||
| # Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml |
There was a problem hiding this comment.
rayjob.yaml comment references wrong filename
The deploy comment references g5-rayjob-qwen-nvrx.yaml but the actual filename is rayjob.yaml.
| # Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml | |
| # Deploy: kubectl apply -f kubernetes/rayjob.yaml |
| - name: NCCL_SOCKET_IFNAME | ||
| value: "ens5" |
There was a problem hiding this comment.
NCCL_SOCKET_IFNAME uses positive interface selection
The repo convention requires exclusion-based patterns (^lo for K8s, ^docker,lo,veth for Dockerfiles), not positive selection like ens5. While the comment explains the g5 dual-ENI rationale, positive selection breaks portability on other instance types.
I noticed the Dockerfile correctly uses ^lo,docker,veth,eni (line 218). I'd suggest using an exclusion pattern here too, since hostNetwork: true means container interfaces match the host:
| - name: NCCL_SOCKET_IFNAME | |
| value: "ens5" | |
| - name: NCCL_SOCKET_IFNAME | |
| value: "^lo,docker,veth" |
Reference: EFA cheatsheet.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/3 — Things That Look Great
- Excellent README structure: The walkthrough is thorough, with clear architecture diagrams, a troubleshooting table, and step-by-step instructions. The ASCII architecture and resiliency stack diagrams are genuinely helpful.
- Multi-arch Dockerfile is well-engineered: The multi-stage build with clear separation (clone → deps → python → release → EFA) is clean. The CUDA arch handling (SM 9.0-only for deep_ep/deep_gemm) shows deep understanding of the compilation requirements.
- YAML anchor reuse: The
&common-env,&common-mounts, and&common-volumesanchors in rayjob.yaml eliminate duplication between head and worker specs — exactly the right pattern. - Defensive patching code:
patch_nvrx_features.pyis careful about idempotency (checks if already patched), provides clear logging, and handles missing files gracefully. - Workshop-optimized design: The
CLEAR_CHECKPOINTSenv var, pre-cache job, and evaluation script show this was designed for live demos with real operator ergonomics in mind. - Conservative backoffLimit:
backoffLimit: 2in both manifests follows the repo convention for GPU jobs. - HF_HOME over TRANSFORMERS_CACHE: Correctly uses the non-deprecated env var throughout.
- EFA configuration in Dockerfile: Proper
--skip-kmod --skip-limit-conf --no-verifyflags and exclusion-basedNCCL_SOCKET_IFNAMEpattern.
KeitaW
left a comment
There was a problem hiding this comment.
thanks for the PR. left few comments
Summary
Details
This example shows how to run reinforcement learning-based LLM alignment using NeMo RL's GRPO algorithm on EKS GPU clusters with:
What's included
README.mdDockerfilekubernetes/rayjob.yamlkubernetes/dataset-download-job.yamlscripts/rayjob_entrypoint.shscripts/run_grpo_nvrx.pyscripts/evaluate_before_after.pypatches/patch_nvrx_features.pyTested on
Test plan