Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS by dmvevents · Pull Request #1010 · awslabs/awsome-distributed-training

dmvevents · 2026-03-09T20:30:04Z

Summary

Add a new test case for NVIDIA NeMo RL GRPO (Group Relative Policy Optimization) training on Amazon EKS with fault tolerance via NVIDIA Resiliency Extension (NVRx)
Includes a multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100), Kubernetes RayJob manifests, training scripts, and a workshop walkthrough
Demonstrates resilient LLM fine-tuning with automatic checkpoint recovery on GPU failure

Details

This example shows how to run reinforcement learning-based LLM alignment using NeMo RL's GRPO algorithm on EKS GPU clusters with:

KubeRay for declarative Ray cluster management
NVIDIA NVRx (ft_launcher, heartbeat monitoring, straggler detection) for process-level fault tolerance
Amazon FSx for Lustre for shared checkpoint storage and model caching
EFA networking via aws-ofi-nccl for low-latency GPU communication
LoRA for parameter-efficient training on smaller GPUs (A10G 24GB)

What's included

File	Purpose
`README.md`	Overview, prerequisites, step-by-step instructions
`Dockerfile`	Multi-arch container image (g5/g6e/p5)
`kubernetes/rayjob.yaml`	RayJob manifest for EKS deployment
`kubernetes/dataset-download-job.yaml`	Pre-cache dataset and model to FSx
`scripts/rayjob_entrypoint.sh`	Training entrypoint
`scripts/run_grpo_nvrx.py`	NVRx wrapper (heartbeat + health check)
`scripts/evaluate_before_after.py`	Before/after training evaluation
`patches/patch_nvrx_features.py`	Runtime NVRx feature patches

Tested on

2x g5.8xlarge (1x A10G each) with EKS 1.31, KubeRay 1.3.0
Qwen2.5-1.5B-Instruct, 20 GRPO steps, ~15 min end-to-end
Fault injection (kill -9) with automatic recovery from FSx checkpoint

Test plan

Deploy on fresh EKS cluster with 2x g5.8xlarge
Run dataset download job to completion
Deploy RayJob and verify 20/20 training steps complete
Inject fault at step 6+ and verify checkpoint recovery
Run evaluation script and verify improvement (3/6 -> 6/6)
Build Docker image from Dockerfile on CPU-only instance

New test case for NVIDIA NeMo RL GRPO training on Amazon EKS with: - NVIDIA Resiliency Extension (NVRx) for fault tolerance - Multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100) - KubeRay RayJob manifests with FSx Lustre shared storage - Training scripts with heartbeat monitoring and straggler detection - Before/after evaluation demonstrating math reasoning improvement Tested on 2x g5.8xlarge with Qwen2.5-1.5B-Instruct, 20 GRPO steps.

KeitaW

Review Batch 1/3 — Structure & Repository Hygiene

This PR adds a comprehensive new test case for NVIDIA NeMo RL GRPO training with fault tolerance (NVRx) on Amazon EKS. The contribution is well-structured and clearly written, though there are several repo-convention issues that I'd suggest addressing before merging.

Missing copyright and license headers on all files

None of the 8 files in this PR include the required copyright header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

I'd suggest adding it to the top of each file — for the Dockerfile, before the first comment block; for Python files, before the docstring; for YAML files, as the first line(s); for the shell script, after the shebang line.

Reference: CLAUDE.md conventions

KeitaW · 2026-03-10T02:20:28Z

3.test_cases/pytorch/nemo-rl/Dockerfile

+# Build (multi-arch, ~40 min on c5.4xlarge):
+#   docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .


Dockerfile header references wrong file path

This looks like a leftover from when the file was at a different path in the upstream NeMo RL repo. The actual file is 3.test_cases/pytorch/nemo-rl/Dockerfile, not docker/Dockerfile.workshop.

I'd suggest updating these comments to match the actual location in this repo:

Suggested change

# Build (multi-arch, ~40 min on c5.4xlarge):

# docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .

# Build (multi-arch, ~40 min on c5.4xlarge):

# docker build -t nemo-rl-workshop .

KeitaW · 2026-03-10T02:20:28Z

3.test_cases/pytorch/nemo-rl/Dockerfile

+RUN cd /tmp && \
+    curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
+        | tar xz && \


EFA installer uses unpinned latest URL

The EFA installer is fetched via aws-efa-installer-latest.tar.gz, which resolves to whatever the current version is at build time. This violates the repo convention that all external dependencies must be pinned to a specific version.

I'd suggest pinning to a specific version, consistent with how OFI_NCCL_VERSION is already pinned:

Suggested change

RUN cd /tmp && \

curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \

| tar xz && \

RUN cd /tmp && \

curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz \

| tar xz && \

Alternatively, add an ARG EFA_INSTALLER_VERSION=1.38.0 alongside the other version ARGs and use it here.

Reference: CI version check enforces EFA >= 1.47.0.

KeitaW

Review Batch 2/3 — Deployment Pipeline & Infrastructure

KeitaW · 2026-03-10T02:20:46Z

3.test_cases/pytorch/nemo-rl/scripts/rayjob_entrypoint.sh

@@ -0,0 +1,89 @@
+#!/bin/bash
+set -e


Shell script uses set -e instead of set -euo pipefail

The repo convention requires shell scripts to start with set -euo pipefail (or set -ex at minimum). Using only set -e means unset variable references won't be caught (-u) and failures in piped commands won't propagate (-o pipefail).

Suggested change

set -e

set -exuo pipefail

KeitaW · 2026-03-10T02:20:46Z

3.test_cases/pytorch/nemo-rl/kubernetes/rayjob.yaml

+#   3. NVRx scripts on FSx: /shared/nvrx-demo/{patches,scripts}/
+#   4. NVIDIA + EFA device plugins on g5 nodes
+#
+# Deploy:   kubectl apply -f g5-rayjob-qwen-nvrx.yaml


rayjob.yaml comment references wrong filename

The deploy comment references g5-rayjob-qwen-nvrx.yaml but the actual filename is rayjob.yaml.

Suggested change

# Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml

# Deploy: kubectl apply -f kubernetes/rayjob.yaml

KeitaW · 2026-03-10T02:20:46Z

3.test_cases/pytorch/nemo-rl/kubernetes/rayjob.yaml

+                - name: NCCL_SOCKET_IFNAME
+                  value: "ens5"


NCCL_SOCKET_IFNAME uses positive interface selection

The repo convention requires exclusion-based patterns (^lo for K8s, ^docker,lo,veth for Dockerfiles), not positive selection like ens5. While the comment explains the g5 dual-ENI rationale, positive selection breaks portability on other instance types.

I noticed the Dockerfile correctly uses ^lo,docker,veth,eni (line 218). I'd suggest using an exclusion pattern here too, since hostNetwork: true means container interfaces match the host:

Suggested change

- name: NCCL_SOCKET_IFNAME

value: "ens5"

- name: NCCL_SOCKET_IFNAME

value: "^lo,docker,veth"

Reference: EFA cheatsheet.

KeitaW

Review Batch 3/3 — Things That Look Great

Excellent README structure: The walkthrough is thorough, with clear architecture diagrams, a troubleshooting table, and step-by-step instructions. The ASCII architecture and resiliency stack diagrams are genuinely helpful.
Multi-arch Dockerfile is well-engineered: The multi-stage build with clear separation (clone → deps → python → release → EFA) is clean. The CUDA arch handling (SM 9.0-only for deep_ep/deep_gemm) shows deep understanding of the compilation requirements.
YAML anchor reuse: The &common-env, &common-mounts, and &common-volumes anchors in rayjob.yaml eliminate duplication between head and worker specs — exactly the right pattern.
Defensive patching code: patch_nvrx_features.py is careful about idempotency (checks if already patched), provides clear logging, and handles missing files gracefully.
Workshop-optimized design: The CLEAR_CHECKPOINTS env var, pre-cache job, and evaluation script show this was designed for live demos with real operator ergonomics in mind.
Conservative backoffLimit: backoffLimit: 2 in both manifests follows the repo convention for GPU jobs.
HF_HOME over TRANSFORMERS_CACHE: Correctly uses the non-deprecated env var throughout.
EFA configuration in Dockerfile: Proper --skip-kmod --skip-limit-conf --no-verify flags and exclusion-based NCCL_SOCKET_IFNAME pattern.

KeitaW

thanks for the PR. left few comments

KeitaW reviewed Mar 10, 2026

View reviewed changes

KeitaW requested changes Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010

Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010
dmvevents wants to merge 1 commit intoawslabs:mainfrom
dmvevents:add-nemo-rl-grpo-example

dmvevents commented Mar 9, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 10, 2026

Uh oh!

KeitaW Mar 10, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Mar 10, 2026

Uh oh!

KeitaW Mar 10, 2026

Uh oh!

KeitaW Mar 10, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Build (multi-arch, ~40 min on c5.4xlarge):
		# docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .

	# Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml
	# Deploy: kubectl apply -f kubernetes/rayjob.yaml

Conversation

dmvevents commented Mar 9, 2026

Summary

Details

What's included

Tested on

Test plan

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/3 — Structure & Repository Hygiene

Missing copyright and license headers on all files

Uh oh!

KeitaW Mar 10, 2026

Choose a reason for hiding this comment

Dockerfile header references wrong file path

Uh oh!

KeitaW Mar 10, 2026

Choose a reason for hiding this comment

EFA installer uses unpinned latest URL

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/3 — Deployment Pipeline & Infrastructure

Uh oh!

KeitaW Mar 10, 2026

Choose a reason for hiding this comment

Shell script uses set -e instead of set -euo pipefail

Uh oh!

KeitaW Mar 10, 2026

Choose a reason for hiding this comment

rayjob.yaml comment references wrong filename

Uh oh!

KeitaW Mar 10, 2026

Choose a reason for hiding this comment

NCCL_SOCKET_IFNAME uses positive interface selection

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/3 — Things That Look Great

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EFA installer uses unpinned `latest` URL

Shell script uses `set -e` instead of `set -euo pipefail`