Skip to content

Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010

Open
dmvevents wants to merge 1 commit intoawslabs:mainfrom
dmvevents:add-nemo-rl-grpo-example
Open

Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010
dmvevents wants to merge 1 commit intoawslabs:mainfrom
dmvevents:add-nemo-rl-grpo-example

Conversation

@dmvevents
Copy link

Summary

  • Add a new test case for NVIDIA NeMo RL GRPO (Group Relative Policy Optimization) training on Amazon EKS with fault tolerance via NVIDIA Resiliency Extension (NVRx)
  • Includes a multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100), Kubernetes RayJob manifests, training scripts, and a workshop walkthrough
  • Demonstrates resilient LLM fine-tuning with automatic checkpoint recovery on GPU failure

Details

This example shows how to run reinforcement learning-based LLM alignment using NeMo RL's GRPO algorithm on EKS GPU clusters with:

  • KubeRay for declarative Ray cluster management
  • NVIDIA NVRx (ft_launcher, heartbeat monitoring, straggler detection) for process-level fault tolerance
  • Amazon FSx for Lustre for shared checkpoint storage and model caching
  • EFA networking via aws-ofi-nccl for low-latency GPU communication
  • LoRA for parameter-efficient training on smaller GPUs (A10G 24GB)

What's included

File Purpose
README.md Overview, prerequisites, step-by-step instructions
Dockerfile Multi-arch container image (g5/g6e/p5)
kubernetes/rayjob.yaml RayJob manifest for EKS deployment
kubernetes/dataset-download-job.yaml Pre-cache dataset and model to FSx
scripts/rayjob_entrypoint.sh Training entrypoint
scripts/run_grpo_nvrx.py NVRx wrapper (heartbeat + health check)
scripts/evaluate_before_after.py Before/after training evaluation
patches/patch_nvrx_features.py Runtime NVRx feature patches

Tested on

  • 2x g5.8xlarge (1x A10G each) with EKS 1.31, KubeRay 1.3.0
  • Qwen2.5-1.5B-Instruct, 20 GRPO steps, ~15 min end-to-end
  • Fault injection (kill -9) with automatic recovery from FSx checkpoint

Test plan

  • Deploy on fresh EKS cluster with 2x g5.8xlarge
  • Run dataset download job to completion
  • Deploy RayJob and verify 20/20 training steps complete
  • Inject fault at step 6+ and verify checkpoint recovery
  • Run evaluation script and verify improvement (3/6 -> 6/6)
  • Build Docker image from Dockerfile on CPU-only instance

New test case for NVIDIA NeMo RL GRPO training on Amazon EKS with:
- NVIDIA Resiliency Extension (NVRx) for fault tolerance
- Multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100)
- KubeRay RayJob manifests with FSx Lustre shared storage
- Training scripts with heartbeat monitoring and straggler detection
- Before/after evaluation demonstrating math reasoning improvement

Tested on 2x g5.8xlarge with Qwen2.5-1.5B-Instruct, 20 GRPO steps.
Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/3 — Structure & Repository Hygiene

This PR adds a comprehensive new test case for NVIDIA NeMo RL GRPO training with fault tolerance (NVRx) on Amazon EKS. The contribution is well-structured and clearly written, though there are several repo-convention issues that I'd suggest addressing before merging.

Missing copyright and license headers on all files

None of the 8 files in this PR include the required copyright header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

I'd suggest adding it to the top of each file — for the Dockerfile, before the first comment block; for Python files, before the docstring; for YAML files, as the first line(s); for the shell script, after the shebang line.

Reference: CLAUDE.md conventions

Comment on lines +22 to +23
# Build (multi-arch, ~40 min on c5.4xlarge):
# docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockerfile header references wrong file path

This looks like a leftover from when the file was at a different path in the upstream NeMo RL repo. The actual file is 3.test_cases/pytorch/nemo-rl/Dockerfile, not docker/Dockerfile.workshop.

I'd suggest updating these comments to match the actual location in this repo:

Suggested change
# Build (multi-arch, ~40 min on c5.4xlarge):
# docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .
# Build (multi-arch, ~40 min on c5.4xlarge):
# docker build -t nemo-rl-workshop .

Comment on lines +182 to +184
RUN cd /tmp && \
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
| tar xz && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EFA installer uses unpinned latest URL

The EFA installer is fetched via aws-efa-installer-latest.tar.gz, which resolves to whatever the current version is at build time. This violates the repo convention that all external dependencies must be pinned to a specific version.

I'd suggest pinning to a specific version, consistent with how OFI_NCCL_VERSION is already pinned:

Suggested change
RUN cd /tmp && \
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
| tar xz && \
RUN cd /tmp && \
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz \
| tar xz && \

Alternatively, add an ARG EFA_INSTALLER_VERSION=1.38.0 alongside the other version ARGs and use it here.

Reference: CI version check enforces EFA >= 1.47.0.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/3 — Deployment Pipeline & Infrastructure

@@ -0,0 +1,89 @@
#!/bin/bash
set -e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell script uses set -e instead of set -euo pipefail

The repo convention requires shell scripts to start with set -euo pipefail (or set -ex at minimum). Using only set -e means unset variable references won't be caught (-u) and failures in piped commands won't propagate (-o pipefail).

Suggested change
set -e
set -exuo pipefail

# 3. NVRx scripts on FSx: /shared/nvrx-demo/{patches,scripts}/
# 4. NVIDIA + EFA device plugins on g5 nodes
#
# Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rayjob.yaml comment references wrong filename

The deploy comment references g5-rayjob-qwen-nvrx.yaml but the actual filename is rayjob.yaml.

Suggested change
# Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml
# Deploy: kubectl apply -f kubernetes/rayjob.yaml

Comment on lines +85 to +86
- name: NCCL_SOCKET_IFNAME
value: "ens5"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL_SOCKET_IFNAME uses positive interface selection

The repo convention requires exclusion-based patterns (^lo for K8s, ^docker,lo,veth for Dockerfiles), not positive selection like ens5. While the comment explains the g5 dual-ENI rationale, positive selection breaks portability on other instance types.

I noticed the Dockerfile correctly uses ^lo,docker,veth,eni (line 218). I'd suggest using an exclusion pattern here too, since hostNetwork: true means container interfaces match the host:

Suggested change
- name: NCCL_SOCKET_IFNAME
value: "ens5"
- name: NCCL_SOCKET_IFNAME
value: "^lo,docker,veth"

Reference: EFA cheatsheet.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/3 — Things That Look Great

  • Excellent README structure: The walkthrough is thorough, with clear architecture diagrams, a troubleshooting table, and step-by-step instructions. The ASCII architecture and resiliency stack diagrams are genuinely helpful.
  • Multi-arch Dockerfile is well-engineered: The multi-stage build with clear separation (clone → deps → python → release → EFA) is clean. The CUDA arch handling (SM 9.0-only for deep_ep/deep_gemm) shows deep understanding of the compilation requirements.
  • YAML anchor reuse: The &common-env, &common-mounts, and &common-volumes anchors in rayjob.yaml eliminate duplication between head and worker specs — exactly the right pattern.
  • Defensive patching code: patch_nvrx_features.py is careful about idempotency (checks if already patched), provides clear logging, and handles missing files gracefully.
  • Workshop-optimized design: The CLEAR_CHECKPOINTS env var, pre-cache job, and evaluation script show this was designed for live demos with real operator ergonomics in mind.
  • Conservative backoffLimit: backoffLimit: 2 in both manifests follows the repo convention for GPU jobs.
  • HF_HOME over TRANSFORMERS_CACHE: Correctly uses the non-deprecated env var throughout.
  • EFA configuration in Dockerfile: Proper --skip-kmod --skip-limit-conf --no-verify flags and exclusion-based NCCL_SOCKET_IFNAME pattern.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR. left few comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants