diff --git a/3.test_cases/23.SMHP-esm2/README.md b/3.test_cases/23.SMHP-esm2/README.md index 8780202d5..7eaab4ead 100644 --- a/3.test_cases/23.SMHP-esm2/README.md +++ b/3.test_cases/23.SMHP-esm2/README.md @@ -1,5 +1,18 @@ # How to finetune ESM2 with SageMaker Hyperpod using Amazon G5 instances +## Tested Configurations + +| Instance | GPUs | Model | Status | Notes | +|----------|------|-------|--------|-------| +| g5.24xlarge | 4 x A10G 24 GB | ESM2 150M | Tested | Primary target | +| g5.12xlarge | 4 x A10G 24 GB | ESM2 150M | Tested | See benchmark tables below | +| p5.48xlarge | 8 x H100 80 GB | ESM2 150M | Tested | See benchmark tables below | +| p4de.24xlarge | 8 x A100 80 GB | ESM2 | Untested | Expected to work | +| p5en.48xlarge | 8 x H200 80 GB | ESM2 | Untested | Expected to work | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## What is SageMaker Hyperpod? [Amazon SageMaker Hyperpod](https://aws.amazon.com/sagemaker/hyperpod/) offers advanced training tools to help you accelerate scalable, reliable, and secure generative AI application development. It removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure for training foundation models (FMs) significantly reducing training time. SageMaker Hyperpod ensure customers can continue FM training uninterrupted by periodically saving checkpoints. When a hardware failure occurs during training, SageMaker Hyperpod automatically detects the failure, repairs, or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for customers to manually manage this process and helping them train for week or months in a distributed setting without disruption. diff --git a/3.test_cases/jax/README.md b/3.test_cases/jax/README.md index 0195ff75c..d3a3f4374 100644 --- a/3.test_cases/jax/README.md +++ b/3.test_cases/jax/README.md @@ -2,6 +2,18 @@ Ths directory contains a sample Dockerfile `jax_paxml.Dockerfile` to run [JAX](https://github.com/google/jax) and [Paxml](https://github.com/google/paxml) on AWS. +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need smaller model configs | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## Container description In principle, the reference `Dockerfile` does the following: diff --git a/3.test_cases/megatron/bionemo/2.esm1nv_pretrain.slurm b/3.test_cases/megatron/bionemo/2.esm1nv_pretrain.slurm index 8470b0ae9..b85a5561f 100644 --- a/3.test_cases/megatron/bionemo/2.esm1nv_pretrain.slurm +++ b/3.test_cases/megatron/bionemo/2.esm1nv_pretrain.slurm @@ -5,7 +5,37 @@ #SBATCH --exclusive # exclusive node access #SBATCH --output slurm-esm1nv-train-%j.out -export FI_EFA_USE_HUGE_PAGE=0 +########################### +###### Instance Profile ### +########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, MICRO_BATCH_SIZE, EFA/NCCL vars. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, p4de-style)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, p4de-style)." +fi + +# Fallback defaults when no profile is loaded (assumes P4de-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} + +# EFA — configured by profile or legacy default +if [[ "$PROFILE_LOADED" != "1" ]]; then + export FI_EFA_USE_HUGE_PAGE=0 +fi ########################### @@ -26,7 +56,7 @@ declare -a ARGS=( # Training parameters # ========================= -MICRO_BATCH_SIZE=256 # micro batch size per GPU, for best efficiency should be set to occupy ~85% of GPU memory. Suggested value for A100 80GB is 256 +MICRO_BATCH_SIZE=${MICRO_BATCH_SIZE:-256} # micro batch size per GPU, for best efficiency should be set to occupy ~85% of GPU memory. Suggested value for A100 80GB is 256 ACCUMULATE_GRAD_BATCHES=1 # gradient accumulation TENSOR_MODEL_PARALLEL_SIZE=1 # tensor model parallel size VAL_CHECK_INTERVAL=500 # how often validation step is performed, including downstream task validation diff --git a/3.test_cases/megatron/bionemo/README.md b/3.test_cases/megatron/bionemo/README.md index be54a5e7e..21e515d85 100644 --- a/3.test_cases/megatron/bionemo/README.md +++ b/3.test_cases/megatron/bionemo/README.md @@ -17,6 +17,18 @@ NVIDIA BioNeMo is a domain-specific machine learning framework for training and This project provides a guide to run [Nvidia's BioNemo](https://docs.nvidia.com/bionemo-framework/latest/index.html) on AWS ParallelCluster and pretrain the popular [ESM models](https://github.com/facebookresearch/esm) specifically the [ESM1nv](https://docs.nvidia.com/bionemo-framework/latest/notebooks/model_training_esm1nv.html) model. +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p4de.24xlarge | 8 x A100 80 GB | Tested | Primary target (4 nodes) | +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need smaller model or offloading | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## 0. Prerequisites 0. You have access to the bionemo container. To get the access to BioNeMo, visit the [information website](https://www.nvidia.com/en-us/clara/bionemo/). diff --git a/3.test_cases/megatron/bionemo/bionemo_2.5/train-esm.sbatch b/3.test_cases/megatron/bionemo/bionemo_2.5/train-esm.sbatch index e231741e5..42a3bfb66 100644 --- a/3.test_cases/megatron/bionemo/bionemo_2.5/train-esm.sbatch +++ b/3.test_cases/megatron/bionemo/bionemo_2.5/train-esm.sbatch @@ -5,9 +5,39 @@ #SBATCH --exclusive # exclusive node access #SBATCH --output slurm-esm2-train-%j.out -#export FI_EFA_USE_HUGE_PAGE=0 #Uncomment if you get os.fork() memory error -export FI_PROVIDER=efa -export NCCL_DEBUG=INFO +########################### +###### Instance Profile ### +########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, EFA/NCCL vars. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See ../profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi + +# Fallback defaults when no profile is loaded +GPUS_PER_NODE=${GPUS_PER_NODE:-8} + +# EFA — configured by profile or legacy defaults +if [[ "$PROFILE_LOADED" != "1" ]]; then + #export FI_EFA_USE_HUGE_PAGE=0 #Uncomment if you get os.fork() memory error + export FI_PROVIDER=efa + export NCCL_DEBUG=INFO +fi #Path to store data and checkpoints export DATA_HOME_DIR=/fsxl/awsankur/bionemo @@ -36,8 +66,8 @@ srun -l "${ARGS[@]}" python3 /workspace/bionemo2/sub-packages/bionemo-esm2/src/ --valid-cluster-path ${DATA_DIR}/2024_03_sanity/valid_clusters.parquet \ --valid-database-path ${DATA_DIR}/2024_03_sanity/validation.db \ --precision="bf16-mixed" \ - --num-gpus 8 \ - --num-nodes 2 \ + --num-gpus ${GPUS_PER_NODE} \ + --num-nodes ${SLURM_JOB_NUM_NODES} \ --num-steps 100 \ --val-check-interval 25 \ --max-seq-length 1024 \ diff --git a/3.test_cases/megatron/bionemo/profiles/README.md b/3.test_cases/megatron/bionemo/profiles/README.md new file mode 100644 index 000000000..80ce10555 --- /dev/null +++ b/3.test_cases/megatron/bionemo/profiles/README.md @@ -0,0 +1,46 @@ +# BioNeMo Instance Profiles + +Instance profiles configure GPU count, micro-batch size, and EFA/NCCL +networking variables for each supported EC2 instance type. Model architecture +parameters (num_layers, hidden_size, etc.) are handled by the training scripts +or BioNeMo config files. + +## Auto-detection + +The training scripts auto-detect the running instance type and source the +matching `.env` profile. Override with: + +```bash +export INSTANCE_PROFILE=g5-12xlarge +``` + +See [docs/instance-compatibility.md](../../../docs/instance-compatibility.md) +for full details. + +## Available Profiles + +| Profile | Instance | GPUs | VRAM | EFA | Default MBS | Status | +|---------|----------|------|------|-----|-------------|--------| +| `p5en-48xlarge.env` | p5en.48xlarge | 8x H200 | 141 GB | 32 adapters | 256 | Supported | +| `p5-48xlarge.env` | p5.48xlarge | 8x H100 | 80 GB | 32 adapters | 256 | Supported | +| `p4de-24xlarge.env` | p4de.24xlarge | 8x A100 | 80 GB | 4 adapters | 256 | Supported (original target) | +| `g6e-12xlarge.env` | g6e.12xlarge | 4x L40S | 48 GB | None | 128 | Experimental | +| `g5-12xlarge.env` | g5.12xlarge | 4x A10G | 24 GB | None | 64 | Experimental | + +## Model Compatibility + +### ESM-1nv (BioNeMo 1.2, `2.esm1nv_pretrain.slurm`) + +The key tunable is `MICRO_BATCH_SIZE`, which occupies ~85% of GPU memory at 256 +on A100 80GB. Profile-sourced MBS values: + +| Instance | VRAM | Profile MBS | Notes | +|----------|------|-------------|-------| +| p5en/p5/p4de | 80-141 GB | 256 | Original documented value | +| g6e | 48 GB | 128 | Estimated; tune based on actual usage | +| g5 | 24 GB | 64 | Estimated; may need further reduction | + +### ESM-2 (BioNeMo 2.5, `bionemo_2.5/train-esm.sbatch`) + +Uses fixed MBS=2 with 650M parameter model. Fits on all instance types. +The profile's `GPUS_PER_NODE` adjusts `--num-gpus` and SBATCH `--gpus-per-node`. diff --git a/3.test_cases/megatron/bionemo/profiles/_detect.sh b/3.test_cases/megatron/bionemo/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/megatron/bionemo/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/megatron/bionemo/profiles/g5-12xlarge.env b/3.test_cases/megatron/bionemo/profiles/g5-12xlarge.env new file mode 100644 index 000000000..3e148a724 --- /dev/null +++ b/3.test_cases/megatron/bionemo/profiles/g5-12xlarge.env @@ -0,0 +1,26 @@ +# g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA +# Severely memory-constrained for BioNeMo. ESM-1nv with MBS=256 will OOM. +# +# MODEL COMPATIBILITY (g5.12xlarge, 4x A10G 24GB each): +# - ESM-1nv (pretrain_small): Must reduce MBS dramatically (try MBS=32-64). +# The original script says "A100 80GB → 256". 24GB is ~3.3x less VRAM, +# so MBS ~64-80 may fit. Start with 64 and adjust. +# - ESM-2 (650M, BioNeMo 2.5): MBS=2 should fit (small model). +# +# Key differences from p4de/p5/p5en: +# - 4 GPUs instead of 8 +# - No EFA +# - 24GB VRAM → ESM-1nv micro batch size must be reduced + +# --- Hardware --- +export GPUS_PER_NODE=4 + +# --- Training defaults --- +# Reduced MBS for 24GB VRAM. This is a starting point — tune based on +# actual memory usage. ESM-1nv may still OOM; try reducing further to 32. +export MICRO_BATCH_SIZE=64 + +# --- EFA / NCCL --- +# No EFA on g5 — do NOT set FI_PROVIDER or FI_EFA_* variables. +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" +export NCCL_DEBUG=INFO diff --git a/3.test_cases/megatron/bionemo/profiles/g6e-12xlarge.env b/3.test_cases/megatron/bionemo/profiles/g6e-12xlarge.env new file mode 100644 index 000000000..5bd19d385 --- /dev/null +++ b/3.test_cases/megatron/bionemo/profiles/g6e-12xlarge.env @@ -0,0 +1,20 @@ +# g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA +# Moderate VRAM; ESM-1nv MBS can be higher than g5 but lower than p4de. +# +# MODEL COMPATIBILITY (g6e.12xlarge, 4x L40S 48GB each): +# - ESM-1nv (pretrain_small): MBS ~128-160 may fit (48GB vs 80GB). +# Start with 128 and adjust upward. +# - ESM-2 (650M, BioNeMo 2.5): MBS=2 fits easily. + +# --- Hardware --- +export GPUS_PER_NODE=4 + +# --- Training defaults --- +# Scaled MBS for 48GB VRAM (60% of A100's 80GB → ~60% of 256 ≈ 150). +# Start conservative at 128. +export MICRO_BATCH_SIZE=128 + +# --- EFA / NCCL --- +# No EFA on g6e — do NOT set FI_PROVIDER or FI_EFA_* variables. +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" +export NCCL_DEBUG=INFO diff --git a/3.test_cases/megatron/bionemo/profiles/p4de-24xlarge.env b/3.test_cases/megatron/bionemo/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..d7aea774f --- /dev/null +++ b/3.test_cases/megatron/bionemo/profiles/p4de-24xlarge.env @@ -0,0 +1,19 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA +# This is the primary target instance for BioNeMo. The original scripts +# were written for 4x p4de.24xlarge nodes. +# +# MODEL ASSUMPTIONS: +# ESM-1nv: "Suggested value for A100 80GB is 256" (micro batch size) +# ESM-2 (650M): MBS=2 is the documented value + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- Training defaults --- +export MICRO_BATCH_SIZE=256 + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" +export NCCL_DEBUG=INFO diff --git a/3.test_cases/megatron/bionemo/profiles/p5-48xlarge.env b/3.test_cases/megatron/bionemo/profiles/p5-48xlarge.env new file mode 100644 index 000000000..bcec0eae4 --- /dev/null +++ b/3.test_cases/megatron/bionemo/profiles/p5-48xlarge.env @@ -0,0 +1,18 @@ +# p5.48xlarge — 8x H100 80GB, 32 EFA, NVLink, GPUDirect RDMA +# +# MODEL ASSUMPTIONS: +# ESM-1nv: MBS=256 fits (same 80GB VRAM as A100) +# ESM-2 (650M): MBS=2 + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- Training defaults --- +export MICRO_BATCH_SIZE=256 + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" +export NCCL_DEBUG=INFO diff --git a/3.test_cases/megatron/bionemo/profiles/p5en-48xlarge.env b/3.test_cases/megatron/bionemo/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..562d18b5c --- /dev/null +++ b/3.test_cases/megatron/bionemo/profiles/p5en-48xlarge.env @@ -0,0 +1,21 @@ +# p5en.48xlarge — 8x H200 141GB, 32 EFA, NVLink, GPUDirect RDMA +# +# MODEL ASSUMPTIONS: +# Instance-driven: GPUS_PER_NODE, EFA/NCCL vars +# Coupling: MICRO_BATCH_SIZE depends on GPU VRAM +# ESM-1nv (BioNeMo 1.2): MBS=256 fits comfortably (80GB+ VRAM) +# ESM-2 (BioNeMo 2.5): MBS=2 is typical for 650M model + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- Training defaults --- +# ESM-1nv: MBS=256 occupies ~85% of 80GB VRAM (per original inline comment) +# H200 141GB allows even larger MBS, but 256 is already efficient. +export MICRO_BATCH_SIZE=256 + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" +export NCCL_DEBUG=INFO diff --git a/3.test_cases/megatron/megatron-lm/README.md b/3.test_cases/megatron/megatron-lm/README.md index 9fe90e3fd..48f406ceb 100755 --- a/3.test_cases/megatron/megatron-lm/README.md +++ b/3.test_cases/megatron/megatron-lm/README.md @@ -1,5 +1,17 @@ # MegatronLM Test Case +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need adjusted TP/PP for smaller VRAM | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + [MegatronLM](https://github.com/NVIDIA/Megatron-LM) is a framework from Nvidia designed for training large language models (LLMs). We recommend reading the following papers to understand the various tuning options available: - [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) diff --git a/3.test_cases/megatron/megatron-lm/profiles/README.md b/3.test_cases/megatron/megatron-lm/profiles/README.md new file mode 100644 index 000000000..0b0cd7ae8 --- /dev/null +++ b/3.test_cases/megatron/megatron-lm/profiles/README.md @@ -0,0 +1,88 @@ +# Megatron-LM Instance Profiles + +Instance profiles configure GPU count, parallelism defaults (TP/PP), micro-batch +size, and EFA/NCCL networking variables for each supported EC2 instance type. + +**Note on TP/PP coupling:** Megatron-LM's tensor and pipeline parallelism must +divide evenly into the available GPUs. The profiles set conservative defaults, +but you should tune TP/PP for your specific model size and node count. The GPT3 +training script has built-in conditional logic that overrides these defaults +based on node count. + +## Auto-detection + +The training scripts auto-detect the running instance type and source the +matching `.env` profile. Detection order: + +1. `INSTANCE_PROFILE` env var (explicit override, e.g. `g5-12xlarge`) +2. `INSTANCE_TYPE` env var +3. EC2 instance metadata API (IMDSv2) +4. GPU name from `nvidia-smi` (fallback) + +To override auto-detection: + +```bash +export INSTANCE_PROFILE=g5-12xlarge +``` + +See [docs/instance-compatibility.md](../../../docs/instance-compatibility.md) +for full details. + +## Available Profiles + +| Profile | Instance | GPUs | VRAM | EFA | Default TP/PP | Status | +|---------|----------|------|------|-----|---------------|--------| +| `p5en-48xlarge.env` | p5en.48xlarge | 8x H200 | 141 GB | 32 adapters | TP=8, PP=1 | Supported | +| `p5-48xlarge.env` | p5.48xlarge | 8x H100 | 80 GB | 32 adapters | TP=8, PP=1 | Supported | +| `p4de-24xlarge.env` | p4de.24xlarge | 8x A100 | 80 GB | 4 adapters | TP=4, PP=2 | Supported (original target) | +| `g6e-12xlarge.env` | g6e.12xlarge | 4x L40S | 48 GB | None | TP=4, PP=1 | Supported (medium models) | +| `g5-12xlarge.env` | g5.12xlarge | 4x A10G | 24 GB | None | TP=4, PP=1 | Supported (small models only) | + +## Model Compatibility Matrix + +### GPT3 (2.distributed-training.sbatch) + +| Model | p5en | p5 | p4de | g6e | g5 | +|-------|------|----|------|-----|----| +| 1.7B | Yes | Yes | Yes | Yes | Yes | +| 3.6B | Yes | Yes | Yes | Yes | Yes | +| 7.5B (default) | Yes | Yes | Yes | Yes | Tight | +| 18.4B | Yes | Yes | Yes | Tight | No | +| 39.1B | Yes | Yes | Yes | No | No | +| 76.1B+ | Yes | Yes | Yes | No | No | + +### Llama2 (pretrain-llama2.sbatch) + +| Model | p5en | p5 | p4de | g6e | g5 | +|-------|------|----|------|-----|----| +| 7B (TP=1,PP=1) | Yes | Yes | Yes | Yes | Tight | +| 13B (TP=2,PP=1) | Yes | Yes | Yes | Tight | No | +| 70B (TP=4,PP=4) | Yes | Yes | Yes | No | No | + +**Notes:** +- "Tight" means it may work but needs `--recompute-activations` and MBS=1 +- "No" means the model will not fit in the available VRAM +- g5/g6e have 4 GPUs, so Llama2-70B's TP=4,PP=4 preset (requiring 16 GPUs) + cannot run without adjusting to multi-node configurations + +## Kubernetes Integration + +The K8s `pytorchjob.yaml-template` already uses `envsubst` placeholders for +`GPU_PER_NODE`, `EFA_PER_NODE`, `TENSOR_PARALLEL`, etc. Source the profile +before running `envsubst` to set these variables: + +```bash +# Detect instance and source profile +PROFILES_DIR="$(pwd)/../../profiles" +PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}") +source "$PROFILE_ENV" + +# Set remaining model-specific variables +export NUM_LAYERS=36 HIDDEN_SIZE=4096 NUM_ATTENTION_HEADS=32 +export SEQ_LENGTH=2048 MAX_POSITION_EMBEDDINGS=2048 +export MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=288 +export NUM_NODES=2 FI_PROVIDER=${FI_PROVIDER:-efa} + +# Generate K8s manifest +cat pytorchjob.yaml-template | envsubst > pytorchjob.yaml +``` diff --git a/3.test_cases/megatron/megatron-lm/profiles/_detect.sh b/3.test_cases/megatron/megatron-lm/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/megatron/megatron-lm/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/megatron/megatron-lm/profiles/g5-12xlarge.env b/3.test_cases/megatron/megatron-lm/profiles/g5-12xlarge.env new file mode 100644 index 000000000..cd4a8f053 --- /dev/null +++ b/3.test_cases/megatron/megatron-lm/profiles/g5-12xlarge.env @@ -0,0 +1,41 @@ +# g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA +# Severely memory-constrained for Megatron-LM workloads. +# +# MODEL ASSUMPTIONS: +# Instance-driven: GPUS_PER_NODE (4), no EFA vars +# Coupling: TP/PP must divide into 4 GPUs; 24GB VRAM limits model size +# +# MODEL COMPATIBILITY (g5.12xlarge, 4x A10G 24GB each): +# - GPT3 7.5B (default): TIGHT — TP=4, PP=1, MBS=1. May need +# --recompute-activations (already enabled). +# - GPT3 1.7B/3.6B: WORKS — TP=2 or TP=1, PP=1, MBS=1 +# - GPT3 18B+: WILL NOT FIT +# - Llama2-7B (TP=1,PP=1): TIGHT — needs --recompute-activations, MBS=1 +# - Llama2-13B: WILL NOT FIT without multi-node PP +# - Llama2-70B: WILL NOT FIT +# +# Key differences from p4de/p5/p5en: +# - 4 GPUs instead of 8 → TP*PP must be <= 4 +# - No EFA → must not set FI_* env vars +# - 24GB VRAM → only small models fit; use --recompute-activations +# - A10G lacks bf16 tensor cores — use --fp16 (already default) + +# --- Hardware --- +export GPUS_PER_NODE=4 + +# --- Default parallelism --- +# TP=4, PP=1 uses all GPUs for tensor parallelism (best for single-node). +# For multi-node, consider TP=2, PP=2. +: "${TENSOR_PARALLEL:=4}" +: "${PIPELINE_PARALLEL:=1}" +: "${MICRO_BATCH_SIZE:=1}" + +# --- EFA / NCCL --- +# No EFA on g5 — do NOT set FI_PROVIDER or FI_EFA_* variables. +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" +# Disable NVLS (not supported on A10G) +export NCCL_NVLS_ENABLE=0 + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=4 +export EFA_PER_NODE=0 diff --git a/3.test_cases/megatron/megatron-lm/profiles/g6e-12xlarge.env b/3.test_cases/megatron/megatron-lm/profiles/g6e-12xlarge.env new file mode 100644 index 000000000..66c5757f6 --- /dev/null +++ b/3.test_cases/megatron/megatron-lm/profiles/g6e-12xlarge.env @@ -0,0 +1,37 @@ +# g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA +# Moderate VRAM; can fit medium models that won't fit on g5. +# +# MODEL ASSUMPTIONS: +# Instance-driven: GPUS_PER_NODE (4), no EFA vars +# Coupling: TP/PP must divide into 4 GPUs; 48GB allows medium models +# +# MODEL COMPATIBILITY (g6e.12xlarge, 4x L40S 48GB each): +# - GPT3 7.5B (default): WORKS — TP=4, PP=1, MBS=1 +# - GPT3 1.7B/3.6B: WORKS — TP=2 or TP=1, PP=1 +# - GPT3 18B: TIGHT — TP=4, PP=1 may work with recompute +# - GPT3 39B+: WILL NOT FIT +# - Llama2-7B (TP=2,PP=1): WORKS — 48GB per GPU is sufficient +# - Llama2-13B (TP=4,PP=1): TIGHT — may need --recompute-activations +# - Llama2-70B: WILL NOT FIT +# +# Key differences from p4de/p5/p5en: +# - 4 GPUs instead of 8 → TP*PP must be <= 4 +# - No EFA → must not set FI_* env vars +# - 48GB VRAM → medium models fit; --recompute-activations helps + +# --- Hardware --- +export GPUS_PER_NODE=4 + +# --- Default parallelism --- +: "${TENSOR_PARALLEL:=4}" +: "${PIPELINE_PARALLEL:=1}" +: "${MICRO_BATCH_SIZE:=1}" + +# --- EFA / NCCL --- +# No EFA on g6e — do NOT set FI_PROVIDER or FI_EFA_* variables. +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" +export NCCL_NVLS_ENABLE=0 + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=4 +export EFA_PER_NODE=0 diff --git a/3.test_cases/megatron/megatron-lm/profiles/p4de-24xlarge.env b/3.test_cases/megatron/megatron-lm/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..666f8981f --- /dev/null +++ b/3.test_cases/megatron/megatron-lm/profiles/p4de-24xlarge.env @@ -0,0 +1,28 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA +# This is the instance type assumed by the original scripts ("2 p4d(e) = 16 A100 GPUs"). +# +# MODEL ASSUMPTIONS: +# Instance-driven: GPUS_PER_NODE, EFA/NCCL vars +# Model-driven: NUM_LAYERS, HIDDEN_SIZE, NUM_ATTENTION_HEADS, SEQ_LENGTH +# Coupling: TP, PP, MBS, GBS +# +# The original GPT3 script defaults (TP=4, PP=2 for <=4 nodes) were designed +# for this instance type. + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- Default parallelism --- +: "${TENSOR_PARALLEL:=4}" +: "${PIPELINE_PARALLEL:=2}" +: "${MICRO_BATCH_SIZE:=1}" + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" +export NCCL_NVLS_ENABLE=0 + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=8 +export EFA_PER_NODE=4 diff --git a/3.test_cases/megatron/megatron-lm/profiles/p5-48xlarge.env b/3.test_cases/megatron/megatron-lm/profiles/p5-48xlarge.env new file mode 100644 index 000000000..53badd04b --- /dev/null +++ b/3.test_cases/megatron/megatron-lm/profiles/p5-48xlarge.env @@ -0,0 +1,26 @@ +# p5.48xlarge — 8x H100 80GB, 32 EFA, NVLink, GPUDirect RDMA +# +# MODEL ASSUMPTIONS: +# Instance-driven: GPUS_PER_NODE, EFA/NCCL vars +# Model-driven: NUM_LAYERS, HIDDEN_SIZE, NUM_ATTENTION_HEADS, SEQ_LENGTH +# Coupling: TP, PP, MBS, GBS depend on both instance VRAM and model size + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- Default parallelism --- +: "${TENSOR_PARALLEL:=8}" +: "${PIPELINE_PARALLEL:=1}" +: "${MICRO_BATCH_SIZE:=1}" + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" +# H100 supports NVLS but it can be unstable in some configs — default off +export NCCL_NVLS_ENABLE=0 + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=8 +export EFA_PER_NODE=32 diff --git a/3.test_cases/megatron/megatron-lm/profiles/p5en-48xlarge.env b/3.test_cases/megatron/megatron-lm/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..02fe8a1eb --- /dev/null +++ b/3.test_cases/megatron/megatron-lm/profiles/p5en-48xlarge.env @@ -0,0 +1,35 @@ +# p5en.48xlarge — 8x H200 141GB, 32 EFA, NVLink, GPUDirect RDMA +# Primary target for Megatron-LM training at scale. +# +# MODEL ASSUMPTIONS: +# Instance-driven (set by this profile): +# GPUS_PER_NODE, EFA/NCCL vars, NCCL_NVLS_ENABLE +# Model-driven (set by the training script or user): +# NUM_LAYERS, HIDDEN_SIZE, NUM_ATTENTION_HEADS, SEQ_LENGTH +# Coupling (depend on both instance and model): +# TENSOR_PARALLEL, PIPELINE_PARALLEL — must divide GPUS_PER_NODE +# and fit the model in VRAM. Defaults below suit most models up to 70B. +# MICRO_BATCH_SIZE — 141GB allows MBS=2 for most models +# GLOBAL_BATCH_SIZE — scales with data parallelism + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- Default parallelism (user/script can override) --- +# These are profile defaults; the GPT3 script has its own conditional logic +# that will take precedence if TP/PP are already set. +: "${TENSOR_PARALLEL:=8}" +: "${PIPELINE_PARALLEL:=1}" +: "${MICRO_BATCH_SIZE:=2}" + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" +# H200 supports NVLink SHARP (NVLS) — enable it +export NCCL_NVLS_ENABLE=1 + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=8 +export EFA_PER_NODE=32 diff --git a/3.test_cases/megatron/megatron-lm/slurm/gpt3/2.distributed-training.sbatch b/3.test_cases/megatron/megatron-lm/slurm/gpt3/2.distributed-training.sbatch index a1392debe..e5141895c 100755 --- a/3.test_cases/megatron/megatron-lm/slurm/gpt3/2.distributed-training.sbatch +++ b/3.test_cases/megatron/megatron-lm/slurm/gpt3/2.distributed-training.sbatch @@ -10,11 +10,41 @@ set -ex; +########################### +###### Instance Profile ### +########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, TENSOR_PARALLEL, PIPELINE_PARALLEL, +# MICRO_BATCH_SIZE, EFA vars, NCCL settings. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, p4de-style)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, p4de-style)." +fi + +# Fallback defaults when no profile is loaded (assumes P4de-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} + ########################### ###### User Variables ##### ########################### -# configure TP/PP/GBS based on node count +NODES=${SLURM_JOB_NUM_NODES:-2} + +# configure TP/PP/GBS based on node count and GPU count if [ $NODES -le 4 ]; then : "${TENSOR_PARALLEL:=4}" : "${PIPELINE_PARALLEL:=2}" @@ -22,7 +52,7 @@ if [ $NODES -le 4 ]; then elif [ $NODES -ge 8 ]; then : "${TENSOR_PARALLEL:=8}" : "${PIPELINE_PARALLEL:=4}" - TOTAL_GPUS=$((NODES * 8)) + TOTAL_GPUS=$((NODES * GPUS_PER_NODE)) DATA_PARALLEL=$((TOTAL_GPUS / (TENSOR_PARALLEL * PIPELINE_PARALLEL))) : "${GLOBAL_BATCH_SIZE:=$((DATA_PARALLEL * 576))}" fi @@ -47,11 +77,21 @@ fi ## Environment Variables ## ########################### +# EFA networking — configured by profile or defaults. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA/NCCL settings. + true +else + # No profile — use legacy defaults (P4de) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 +fi + # https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352 # https://github.com/pytorch/pytorch/issues/68893 -#export NCCL_SOCKET_IFNAME=ens export NCCL_ASYNC_ERROR_HANDLING=1 -export NCCL_DEBUG=INFO +export NCCL_DEBUG=${NCCL_DEBUG:-INFO} +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth"} # async runtime error ... export CUDA_DEVICE_MAX_CONNECTIONS=1 @@ -66,8 +106,7 @@ declare -a ARGS=( ) declare -a TORCHRUN_ARGS=( - # change this to match the number of gpus per node: - --nproc_per_node=8 + --nproc_per_node=$GPUS_PER_NODE --nnodes=$SLURM_JOB_NUM_NODES --rdzv_id=$SLURM_JOB_ID --rdzv_backend=c10d diff --git a/3.test_cases/megatron/megatron-lm/slurm/llama2/pretrain-llama2.sbatch b/3.test_cases/megatron/megatron-lm/slurm/llama2/pretrain-llama2.sbatch index e9ab80d14..8707cf761 100755 --- a/3.test_cases/megatron/megatron-lm/slurm/llama2/pretrain-llama2.sbatch +++ b/3.test_cases/megatron/megatron-lm/slurm/llama2/pretrain-llama2.sbatch @@ -10,6 +10,34 @@ set -exuo pipefail +########################### +###### Instance Profile ### +########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, EFA vars, NCCL settings, and default TP/PP. +# Note: The model architecture presets below override the profile's TP/PP. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, p4de-style)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, p4de-style)." +fi + +# Fallback defaults when no profile is loaded (assumes P4de-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} + ################################################## ###### Model architectures (example presets) ##### @@ -96,14 +124,23 @@ MEGATRON_ARGS+=( ## Environment Variables ## ########################### +# EFA networking — configured by profile or defaults. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA/NCCL settings. + true +else + # No profile — use legacy defaults (P4de) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 +fi + # https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352 # https://github.com/pytorch/pytorch/issues/68893 -#export NCCL_SOCKET_IFNAME=ens export NCCL_ASYNC_ERROR_HANDLING=1 -export NCCL_NVLS_ENABLE=0 -#export NCCL_DEBUG=INFO +export NCCL_NVLS_ENABLE=${NCCL_NVLS_ENABLE:-0} export NCCL_AVOID_RECORD_STREAMS=1 # torch<2.2 export TORCH_NCCL_AVOID_RECORD_STREAMS=1 # torch>=2.2 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth"} # async runtime error ... export CUDA_DEVICE_MAX_CONNECTIONS=1 @@ -119,8 +156,7 @@ declare -a ARGS=( ) declare -a TORCHRUN_ARGS=( - # change this to match the number of gpus per node: - --nproc_per_node=8 + --nproc_per_node=$GPUS_PER_NODE --nnodes=$SLURM_JOB_NUM_NODES --rdzv_id=$SLURM_JOB_ID --rdzv_backend=c10d diff --git a/3.test_cases/megatron/nemo/README.md b/3.test_cases/megatron/nemo/README.md index 00b1dd7b5..f2ca85d13 100644 --- a/3.test_cases/megatron/nemo/README.md +++ b/3.test_cases/megatron/nemo/README.md @@ -2,6 +2,19 @@ This test case contains examples and configurations for running distributed training with NVIDIA NeMo 2.0. +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Tested | Primary target; see PERFORMANCE.md in slurm/ | +| p5.48xlarge | 8 x H100 80 GB | Tested | | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need smaller model configs | +| g6e.12xlarge | 4 x L40S 48 GB | Untested | | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## Overview [NVIDIA NeMo](https://developer.nvidia.com/nemo-framework) is a cloud-native framework for training and deploying generative AI models, optimized for architectures ranging from billions to trillions of parameters. NeMo 2.0 introduces a Python-based configuration system, providing enhanced flexibility, better IDE integration, and streamlined customization for large language model training. diff --git a/3.test_cases/megatron/nemo/profiles/README.md b/3.test_cases/megatron/nemo/profiles/README.md new file mode 100644 index 000000000..7edcfbe0f --- /dev/null +++ b/3.test_cases/megatron/nemo/profiles/README.md @@ -0,0 +1,70 @@ +# NeMo Instance Profiles + +NeMo uses Python-based NeMo-Run scripts (`run.py`, K8s scripts) that accept +environment variables via `--env_vars_file`. This directory provides per-instance +`env_vars_.json` files with the correct EFA/NCCL networking settings. + +**Key difference from other test cases:** NeMo profiles only control networking +variables. Training parameters (TP, PP, MBS, GBS, etc.) are managed by NeMo +recipes and are passed as Python CLI arguments to the launch scripts, not via +the env vars file. + +## Usage + +### Slurm + +```bash +# EFA instances (p5en, p5, p4de) +python run.py --container_image ~/aws-nemo.sqsh --nodes 2 \ + --env_vars_file ../profiles/env_vars_p5en.json + +# Non-EFA instances (g5, g6e) +python run.py --container_image ~/aws-nemo.sqsh --nodes 2 \ + --ntasks_per_node 4 \ + --env_vars_file ../profiles/env_vars_g5.json +``` + +### Kubernetes (SkyPilot) + +```bash +# EFA instances +python pretrain_mock_dataset.py --nodes 2 --gpu-devices 8 \ + --efa-devices 32 --env_vars_file ../profiles/env_vars_p5en.json + +# Non-EFA instances +python pretrain_mock_dataset.py --nodes 1 --gpu-devices 4 \ + --env_vars_file ../profiles/env_vars_g6e.json +``` + +## Available Profiles + +| Profile | Instance | GPUs | EFA | FI_PROVIDER | NVLS | +|---------|----------|------|-----|-------------|------| +| `env_vars_p5en.json` | p5en.48xlarge | 8x H200 | 32 | efa | 1 | +| `env_vars_p5.json` | p5.48xlarge | 8x H100 | 32 | efa | 0 | +| `env_vars_p4de.json` | p4de.24xlarge | 8x A100 | 4 | efa | 0 | +| `env_vars_g6e.json` | g6e.12xlarge | 4x L40S | 0 | (unset) | 0 | +| `env_vars_g5.json` | g5.12xlarge | 4x A10G | 0 | (unset) | 0 | + +## Instance-Specific CLI Arguments + +The `env_vars.json` profile handles networking, but you must also adjust the +CLI arguments to match the instance: + +| Instance | `--ntasks_per_node` (Slurm) | `--gpu-devices` (K8s) | `--efa-devices` (K8s) | +|----------|---------------------------|----------------------|---------------------| +| p5en.48xlarge | 8 (default) | 8 (default) | 32 | +| p5.48xlarge | 8 (default) | 8 (default) | 32 | +| p4de.24xlarge | 8 (default) | 8 (default) | 4 | +| g6e.12xlarge | 4 | 4 | (omit) | +| g5.12xlarge | 4 | 4 | (omit) | + +## Performance Recipes + +The `PERFORMANCE.md` in the parent directory documents validated TP/PP/GBS +configurations for various models on H100/H200/B200. Use those recipes as-is +on p5en/p5 instances. For g5/g6e, smaller model configurations should be +selected (e.g., smaller batch sizes, fewer pipeline stages). + +See [docs/instance-compatibility.md](../../../docs/instance-compatibility.md) +for full instance reference. diff --git a/3.test_cases/megatron/nemo/profiles/env_vars_g5.json b/3.test_cases/megatron/nemo/profiles/env_vars_g5.json new file mode 100644 index 000000000..8e3b7d476 --- /dev/null +++ b/3.test_cases/megatron/nemo/profiles/env_vars_g5.json @@ -0,0 +1,9 @@ +{ + "TORCH_NCCL_AVOID_RECORD_STREAMS": "1", + "NVTE_DP_AMAX_REDUCE_INTERVAL": "0", + "NVTE_ASYNC_AMAX_REDUCTION": "1", + "NVTE_FUSED_ATTN": "0", + "NCCL_SOCKET_IFNAME": "^docker,lo,veth,eth", + "NCCL_NVLS_ENABLE": "0", + "NCCL_DEBUG": "INFO" +} diff --git a/3.test_cases/megatron/nemo/profiles/env_vars_g6e.json b/3.test_cases/megatron/nemo/profiles/env_vars_g6e.json new file mode 100644 index 000000000..8e3b7d476 --- /dev/null +++ b/3.test_cases/megatron/nemo/profiles/env_vars_g6e.json @@ -0,0 +1,9 @@ +{ + "TORCH_NCCL_AVOID_RECORD_STREAMS": "1", + "NVTE_DP_AMAX_REDUCE_INTERVAL": "0", + "NVTE_ASYNC_AMAX_REDUCTION": "1", + "NVTE_FUSED_ATTN": "0", + "NCCL_SOCKET_IFNAME": "^docker,lo,veth,eth", + "NCCL_NVLS_ENABLE": "0", + "NCCL_DEBUG": "INFO" +} diff --git a/3.test_cases/megatron/nemo/profiles/env_vars_p4de.json b/3.test_cases/megatron/nemo/profiles/env_vars_p4de.json new file mode 100644 index 000000000..7c6267596 --- /dev/null +++ b/3.test_cases/megatron/nemo/profiles/env_vars_p4de.json @@ -0,0 +1,11 @@ +{ + "TORCH_NCCL_AVOID_RECORD_STREAMS": "1", + "NVTE_DP_AMAX_REDUCE_INTERVAL": "0", + "NVTE_ASYNC_AMAX_REDUCTION": "1", + "NVTE_FUSED_ATTN": "0", + "FI_EFA_USE_HUGE_PAGE": "0", + "FI_PROVIDER": "efa", + "NCCL_SOCKET_IFNAME": "^docker,lo,veth", + "NCCL_NVLS_ENABLE": "0", + "NCCL_DEBUG": "INFO" +} diff --git a/3.test_cases/megatron/nemo/profiles/env_vars_p5.json b/3.test_cases/megatron/nemo/profiles/env_vars_p5.json new file mode 100644 index 000000000..7c6267596 --- /dev/null +++ b/3.test_cases/megatron/nemo/profiles/env_vars_p5.json @@ -0,0 +1,11 @@ +{ + "TORCH_NCCL_AVOID_RECORD_STREAMS": "1", + "NVTE_DP_AMAX_REDUCE_INTERVAL": "0", + "NVTE_ASYNC_AMAX_REDUCTION": "1", + "NVTE_FUSED_ATTN": "0", + "FI_EFA_USE_HUGE_PAGE": "0", + "FI_PROVIDER": "efa", + "NCCL_SOCKET_IFNAME": "^docker,lo,veth", + "NCCL_NVLS_ENABLE": "0", + "NCCL_DEBUG": "INFO" +} diff --git a/3.test_cases/megatron/nemo/profiles/env_vars_p5en.json b/3.test_cases/megatron/nemo/profiles/env_vars_p5en.json new file mode 100644 index 000000000..0e5bd24cd --- /dev/null +++ b/3.test_cases/megatron/nemo/profiles/env_vars_p5en.json @@ -0,0 +1,12 @@ +{ + "TORCH_NCCL_AVOID_RECORD_STREAMS": "1", + "NVTE_DP_AMAX_REDUCE_INTERVAL": "0", + "NVTE_ASYNC_AMAX_REDUCTION": "1", + "NVTE_FUSED_ATTN": "0", + "FI_EFA_USE_HUGE_PAGE": "0", + "FI_PROVIDER": "efa", + "FI_EFA_SET_CUDA_SYNC_MEMOPS": "0", + "NCCL_SOCKET_IFNAME": "^docker,lo,veth", + "NCCL_NVLS_ENABLE": "1", + "NCCL_DEBUG": "INFO" +} diff --git a/3.test_cases/megatron/nemo1.0/README.md b/3.test_cases/megatron/nemo1.0/README.md index 6bddc87fe..5ff472319 100644 --- a/3.test_cases/megatron/nemo1.0/README.md +++ b/3.test_cases/megatron/nemo1.0/README.md @@ -17,6 +17,18 @@ Table of contents: - [8. References](#8-references) - [9. Authors / Reviewers](#9-authors--reviewers) +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p4de.24xlarge | 8 x A100 80 GB | Tested | Primary target | +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | Likely needs smaller model sizes | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## 1. Pre-requisites The following pre-requisites are needed to run this example: diff --git a/3.test_cases/pytorch/FSDP/README.md b/3.test_cases/pytorch/FSDP/README.md index 3d48bf79a..39210da76 100644 --- a/3.test_cases/pytorch/FSDP/README.md +++ b/3.test_cases/pytorch/FSDP/README.md @@ -3,6 +3,30 @@ This content provides a quickstart with multinode PyTorch [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) training on Slurm and Kubernetes. It is designed to be simple with no data preparation or tokenizer to download, and uses Python virtual environment. +## Tested Configurations + +| Instance | GPUs | Models | Nodes | Status | +|----------|------|--------|-------|--------| +| p5en.48xlarge | 8 x H200 80 GB | Llama 2/3, Mixtral 8x7B | Various | Tested (CI) | +| p5.48xlarge | 8 x H100 80 GB | Llama 2/3, Mixtral 8x7B | Various | Tested (CI) | +| p4de.24xlarge | 8 x A100 80 GB | Llama 2/3 | Various | Tested | +| g5.12xlarge | 4 x A10G 24 GB | Various | Various | Tested | +| g5.xlarge | 1 x A10G 24 GB | Various | 1 | Tested | +| g4dn | Various | Various | Various | Tested | +| g6e.12xlarge | 4 x L40S 48 GB | — | — | Untested | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types, and +> [instance profiles](../../../docs/instance-profiles/) for hardware details. + +### Instance Profiles + +This test case includes an [instance profile system](profiles/) that auto-detects +your EC2 instance type and configures GPU count, EFA networking, NCCL settings, +and FSDP memory optimizations automatically. The Slurm scripts source the +matching profile at runtime — no manual editing of `GPUS_PER_NODE` or EFA +variables needed. See [profiles/README.md](profiles/README.md) for details. + ## Prerequisites To run FSDP training, you will need to create a training cluster based on Slurm or Kubermetes with an [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) diff --git a/3.test_cases/pytorch/FSDP/kubernetes/README.md b/3.test_cases/pytorch/FSDP/kubernetes/README.md index 2561bcc47..a4bf7294b 100644 --- a/3.test_cases/pytorch/FSDP/kubernetes/README.md +++ b/3.test_cases/pytorch/FSDP/kubernetes/README.md @@ -87,33 +87,57 @@ If you'd like to instead use your own dataset, you can do so by [formatting it a Generate the Kubernetes manifest and apply it to the cluster. -Create environment variables: +Create environment variables. + +You can use the [instance profiles](../profiles/) to set GPU count, EFA, and +other instance-specific values automatically, or set them manually: + +**Option A: Use an instance profile** (recommended) + +Source the profile for your instance type to set `GPU_PER_NODE`, `EFA_PER_NODE`, +and EFA variables automatically: ``` bash +# Auto-detect or override with INSTANCE_TYPE +export INSTANCE_TYPE=p5.48xlarge +PROFILE_ENV=$(../profiles/_detect.sh ../profiles/) +source "$PROFILE_ENV" + cat << EOF > env_vars export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1 -export INSTANCE_TYPE= -export NUM_NODES= -export GPU_PER_NODE= -export EFA_PER_NODE= -export FI_PROVIDER=efa +export INSTANCE_TYPE=${INSTANCE_TYPE} +export NUM_NODES=4 +export GPU_PER_NODE=${GPU_PER_NODE} +export EFA_PER_NODE=${EFA_PER_NODE} +export FI_PROVIDER=${FI_PROVIDER:-} export HF_TOKEN= EOF ``` -For reference, we are running the Llama 3.1 8B model on 4 x p5.48xlarge instances and below is the configuration of our environment variables: +**Option B: Set values manually** + ``` bash cat << EOF > env_vars export IMAGE_URI=${REGISTRY}fsdp:pytorch2.7.1 -export INSTANCE_TYPE=p5.48xlarge -export NUM_NODES=4 -export GPU_PER_NODE=8 -export EFA_PER_NODE=32 +export INSTANCE_TYPE= +export NUM_NODES= +export GPU_PER_NODE= +export EFA_PER_NODE= export FI_PROVIDER=efa export HF_TOKEN= EOF ``` +Quick reference for common instance types: + +| Instance | GPU_PER_NODE | EFA_PER_NODE | FI_PROVIDER | +|----------|-------------|-------------|-------------| +| p5en.48xlarge | 8 | 32 | efa | +| p5.48xlarge | 8 | 32 | efa | +| p4de.24xlarge | 8 | 4 | efa | +| g5.12xlarge | 4 | 0 | (unset) | +| g6e.12xlarge | 4 | 0 | (unset) | + Fill in `env_vars` and then source variables: ``` bash @@ -125,9 +149,9 @@ Apply yaml: envsubst < llama3_1_8b-fsdp.yaml | kubectl apply -f - ``` -EFA level variables are available for adjustment in fsdp.yaml-template -Keep FI_* values commented out for non-efa instances (G5, G4d, P3) or P5 -Uncomment FI_* values for P4d instances +> **Note on EFA variables:** The FI_* env vars in the YAML templates are commented +> out by default. For EFA-enabled instances (p4de, p5, p5en), uncomment them. +> For non-EFA instances (g5, g6e), leave them commented out and set `EFA_PER_NODE=0`. You can also adjust the training parameters in `TRAINING_ARGS` (for example, to train Llama 3.1 70B). Additional parameters can be found in `src/model_utils/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint. diff --git a/3.test_cases/pytorch/FSDP/profiles/README.md b/3.test_cases/pytorch/FSDP/profiles/README.md new file mode 100644 index 000000000..812db33f3 --- /dev/null +++ b/3.test_cases/pytorch/FSDP/profiles/README.md @@ -0,0 +1,74 @@ +# FSDP Instance Profiles + +Instance profiles configure GPU count, EFA networking, NCCL settings, and FSDP +memory optimizations per EC2 instance type. The profile is auto-detected at +runtime so the same sbatch script works across different clusters. + +## How Detection Works + +The detection script (`_detect.sh`) selects the profile automatically: + +1. **`INSTANCE_PROFILE`** env var — explicit override (e.g., `g5-12xlarge`) +2. **`INSTANCE_TYPE`** env var — from your `env_vars` (e.g., `g5.12xlarge`) +3. **EC2 Instance Metadata API** — works on bare metal and K8s with host networking +4. **GPU name from nvidia-smi** — fallback when metadata is unavailable + +To override: `export INSTANCE_PROFILE=g5-12xlarge` before running sbatch. + +## Available Profiles + +| Profile | Instance | GPU | VRAM | Compatible Models | Status | +|---------|----------|-----|------|------------------|--------| +| [p5en-48xlarge.env](p5en-48xlarge.env) | p5en.48xlarge | 8x H200 | 141 GB | All 9 models | Tested | +| [p5-48xlarge.env](p5-48xlarge.env) | p5.48xlarge | 8x H100 | 80 GB | All 9 models | Tested | +| [p4de-24xlarge.env](p4de-24xlarge.env) | p4de.24xlarge | 8x A100 | 80 GB | All 9 models | Tested | +| [g5-12xlarge.env](g5-12xlarge.env) | g5.12xlarge | 4x A10G | 24 GB | 1B, 3B; tight for 7B-8B | Tested | +| [g6e-12xlarge.env](g6e-12xlarge.env) | g6e.12xlarge | 4x L40S | 48 GB | 1B-8B; tight for 13B | Untested | + +## What the Profile Controls + +Settings that change per instance (set by profile): + +| Setting | Description | +|---------|-------------| +| `GPUS_PER_NODE` | Number of GPUs (4 for g5/g6e, 8 for p4de/p5/p5en) | +| `FI_PROVIDER`, `FI_EFA_*` | EFA networking (set for EFA instances, unset for g5/g6e) | +| `EFA_PER_NODE` | EFA adapter count for K8s resource requests | +| `NCCL_SOCKET_IFNAME` | NCCL network interface filter | +| `FSDP_CPU_OFFLOAD` | Whether to offload FSDP parameters to CPU | +| `FSDP_CPU_OFFLOAD_MIN_LAYERS` | Only enable cpu_offload if model has >= this many layers. Avoids the ~30% offloading overhead on small models that fit without it (e.g., 1B/3B on g5). | + +Settings that stay the same across instances (set in `models/*.txt`): + +| Setting | Description | +|---------|-------------| +| Model architecture args | `hidden_width`, `num_layers`, `num_heads`, etc. | +| `--train_batch_size` | Per-GPU micro batch size | +| `--offload_activations` | Activation offloading (always 1) | +| `--sharding_strategy` | FSDP sharding (always `full`) | + +## Model Compatibility by Instance + +| Model | p5en/p5 (80-141 GB) | p4de (80 GB) | g6e (48 GB) | g5 (24 GB) | +|-------|---------------------|--------------|-------------|------------| +| Llama 3.2 1B | OK | OK | OK | OK | +| Llama 3.2 3B | OK | OK | OK | OK | +| Llama 2 7B | OK | OK | OK | Tight (cpu_offload) | +| Llama 3.1 8B | OK | OK | OK | Tight (cpu_offload) | +| Mathstral 7B | OK | OK | OK | Tight (cpu_offload) | +| Llama 2 13B | OK | OK | Tight | Won't fit | +| Llama 2 70B | OK | OK | Won't fit | Won't fit | +| Llama 3.1 70B | OK | OK | Won't fit | Won't fit | +| Mixtral 8x7B | OK | OK | Won't fit | Won't fit | + +## Creating a New Profile + +1. Copy the closest existing profile: + ```bash + cp profiles/p5en-48xlarge.env profiles/p4d-24xlarge.env + ``` +2. Adjust GPU count, EFA settings, and FSDP overrides +3. Run detection test: + ```bash + INSTANCE_TYPE=p4d.24xlarge bash profiles/_detect.sh profiles/ + ``` diff --git a/3.test_cases/pytorch/FSDP/profiles/_detect.sh b/3.test_cases/pytorch/FSDP/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/pytorch/FSDP/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/pytorch/FSDP/profiles/g5-12xlarge.env b/3.test_cases/pytorch/FSDP/profiles/g5-12xlarge.env new file mode 100644 index 000000000..67cdcf6e6 --- /dev/null +++ b/3.test_cases/pytorch/FSDP/profiles/g5-12xlarge.env @@ -0,0 +1,40 @@ +# g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA +# Limited to small/medium models due to 24 GB VRAM per GPU. +# +# Validated with: Llama 3.2 1B, Llama 3.2 3B on 4 nodes +# +# MODEL COMPATIBILITY (g5.12xlarge, 4x A10G 24GB each): +# - Llama 3.2 1B: WORKS — fits easily, no offload needed +# - Llama 3.2 3B: WORKS — fits with activation offload +# - Llama 2 7B: TIGHT — may need cpu_offload=1 with 4 GPUs +# - Llama 3.1 8B: TIGHT — may need cpu_offload=1 with 4 GPUs +# - Mathstral 7B: TIGHT — may need cpu_offload=1 with 4 GPUs +# - Llama 2 13B: WILL NOT FIT — even with offloading (24 GB too small) +# - Llama 2 70B: WILL NOT FIT +# - Llama 3.1 70B: WILL NOT FIT +# - Mixtral 8x7B: WILL NOT FIT +# +# Key differences from p5/p5en: +# - 4 GPUs instead of 8 → fewer shards → more memory per shard +# - No EFA → must remove all FI_* env vars +# - 24 GB VRAM → large models won't fit even with FSDP sharding +# - cpu_offload is the main lever for fitting 7B-8B models + +# --- Hardware --- +export GPUS_PER_NODE=4 + +# --- EFA / NCCL --- +# No EFA on g5 — do NOT set FI_PROVIDER or FI_EFA_* variables. +# The profile-aware template will skip EFA vars when these are unset. +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=4 +export EFA_PER_NODE=0 + +# --- FSDP training overrides --- +# cpu_offload=1 needed for 7B+ models on 24 GB GPUs. +# MIN_LAYERS threshold skips offloading for small models (1B=16, 3B=28 layers) +# that fit without it, avoiding the ~30% offloading overhead. +export FSDP_CPU_OFFLOAD=1 +export FSDP_CPU_OFFLOAD_MIN_LAYERS=32 diff --git a/3.test_cases/pytorch/FSDP/profiles/g6e-12xlarge.env b/3.test_cases/pytorch/FSDP/profiles/g6e-12xlarge.env new file mode 100644 index 000000000..3a8f65ac8 --- /dev/null +++ b/3.test_cases/pytorch/FSDP/profiles/g6e-12xlarge.env @@ -0,0 +1,37 @@ +# g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA +# Handles most models up to ~13B comfortably; 70B models will not fit. +# +# Status: UNTESTED (extrapolated from g5 and p4de profiles) +# +# MODEL COMPATIBILITY (g6e.12xlarge, 4x L40S 48GB each): +# - Llama 3.2 1B: WORKS — fits easily +# - Llama 3.2 3B: WORKS — fits easily +# - Llama 2 7B: WORKS — 48 GB provides comfortable headroom +# - Llama 3.1 8B: WORKS — 48 GB provides comfortable headroom +# - Mathstral 7B: WORKS — 48 GB provides comfortable headroom +# - Llama 2 13B: TIGHT — may need cpu_offload=1 with 4 GPUs +# - Llama 2 70B: WILL NOT FIT +# - Llama 3.1 70B: WILL NOT FIT +# - Mixtral 8x7B: WILL NOT FIT +# +# Key differences from g5: +# - 48 GB VRAM vs 24 GB → much more headroom for 7B-8B models +# - Still no EFA, no NVLink +# - L40S supports bf16 natively (good throughput) + +# --- Hardware --- +export GPUS_PER_NODE=4 + +# --- EFA / NCCL --- +# No EFA on g6e. +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=4 +export EFA_PER_NODE=0 + +# --- FSDP training overrides --- +# 48 GB is enough for 7B-8B without cpu_offload. +# Enable cpu_offload for 13B+ models (13B=40 layers, 7B/8B=32 layers). +export FSDP_CPU_OFFLOAD=1 +export FSDP_CPU_OFFLOAD_MIN_LAYERS=40 diff --git a/3.test_cases/pytorch/FSDP/profiles/p4de-24xlarge.env b/3.test_cases/pytorch/FSDP/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..bbad36542 --- /dev/null +++ b/3.test_cases/pytorch/FSDP/profiles/p4de-24xlarge.env @@ -0,0 +1,28 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA +# All models in models/ work as-is. +# +# Validated with: Llama 2 7B on 4 nodes +# +# MODEL NOTES: +# All 9 model configs should work on p4de. The 80 GB VRAM is +# sufficient for all models including 70B with FSDP sharding. +# EFA count is lower (4 vs 32) so multi-node all-reduce is slower +# than p5/p5en but still functional. + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +# GPUDirect RDMA is available on p4de +export FI_EFA_USE_DEVICE_RDMA=1 +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=8 +export EFA_PER_NODE=4 + +# --- FSDP training overrides --- +export FSDP_CPU_OFFLOAD=0 diff --git a/3.test_cases/pytorch/FSDP/profiles/p5-48xlarge.env b/3.test_cases/pytorch/FSDP/profiles/p5-48xlarge.env new file mode 100644 index 000000000..7bbceff5f --- /dev/null +++ b/3.test_cases/pytorch/FSDP/profiles/p5-48xlarge.env @@ -0,0 +1,24 @@ +# p5.48xlarge — 8x H100 80GB, 32 EFA, NVLink, GPUDirect RDMA +# All models in models/ work as-is. +# +# Validated with: Llama 3.1 8B, Llama 2 70B, Mixtral 8x7B on 4 nodes +# +# MODEL NOTES: +# All 9 model configs run on p5 without changes. Same as p5en but +# with 80 GB VRAM instead of 141 GB — still sufficient for all models. + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" + +# --- Kubernetes resource requests --- +export GPU_PER_NODE=8 +export EFA_PER_NODE=32 + +# --- FSDP training overrides --- +export FSDP_CPU_OFFLOAD=0 diff --git a/3.test_cases/pytorch/FSDP/profiles/p5en-48xlarge.env b/3.test_cases/pytorch/FSDP/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..147bb8f24 --- /dev/null +++ b/3.test_cases/pytorch/FSDP/profiles/p5en-48xlarge.env @@ -0,0 +1,26 @@ +# p5en.48xlarge — 8x H200 141GB, 32 EFA, NVLink, GPUDirect RDMA +# Reference instance for FSDP training. All models in models/ work as-is. +# +# Validated with: Llama 3.1 8B, Llama 2 70B, Mixtral 8x7B on 4 nodes +# +# MODEL NOTES: +# All 9 model configs (models/*.txt) run on p5en without changes. +# The large GPU VRAM (141 GB) and high interconnect bandwidth mean +# no cpu_offload or batch size reduction is needed for any model. + +# --- Hardware --- +export GPUS_PER_NODE=8 + +# --- EFA / NCCL --- +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" + +# --- Kubernetes resource requests (used by envsubst in K8s manifests) --- +export GPU_PER_NODE=8 +export EFA_PER_NODE=32 + +# --- FSDP training overrides (empty = use model config defaults) --- +# No overrides needed; model configs work as-is on p5en. +export FSDP_CPU_OFFLOAD=0 diff --git a/3.test_cases/pytorch/FSDP/slurm/README.md b/3.test_cases/pytorch/FSDP/slurm/README.md index 203798d49..b027bf111 100644 --- a/3.test_cases/pytorch/FSDP/slurm/README.md +++ b/3.test_cases/pytorch/FSDP/slurm/README.md @@ -65,9 +65,16 @@ If you are using a container image, you need to uncomment the line below in the #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh ``` -If you are using non-EFA enabled instances, such as G4dn, or single GPU g5 nodes, comment out all EFA environment variables on lines 24-25. +If you are using non-EFA enabled instances, such as G4dn, or single GPU g5 nodes, the [instance profile system](../profiles/) handles this automatically — the matching profile disables EFA variables for non-EFA instances. If you're not using profiles, comment out all EFA environment variables on lines 24-25. -Also, under `User Variables` make sure to adjust `GPUS_PER_NODE` to match the number of GPUs on your instance type (8 for P4d(e)/P5/P6-B200), 4 for G5.12xlarge, 1 for G5.xlarge). +The instance profile also sets `GPUS_PER_NODE` automatically. If you're not using profiles, under `User Variables` make sure to adjust `GPUS_PER_NODE` to match the number of GPUs on your instance type (8 for P4d(e)/P5/P6-B200), 4 for G5.12xlarge, 1 for G5.xlarge). + +To override the profile, set `INSTANCE_PROFILE` or `INSTANCE_TYPE` before running sbatch: + +```bash +export INSTANCE_TYPE=g5.12xlarge +sbatch llama3_2_1b-training.sbatch +``` You can also adjust the training parameters in `TRAINING_ARGS` (for example, to increase batch size). Additional parameters can be found in `src/model_utils/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint. diff --git a/3.test_cases/pytorch/FSDP/slurm/llama2_13b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/llama2_13b-training.sbatch index 23de071cf..c544d0e9c 100644 --- a/3.test_cases/pytorch/FSDP/slurm/llama2_13b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/llama2_13b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -105,9 +132,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/llama2_70b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/llama2_70b-training.sbatch index e1c8c1aa1..5c809928a 100644 --- a/3.test_cases/pytorch/FSDP/slurm/llama2_70b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/llama2_70b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -105,9 +132,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/llama2_7b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/llama2_7b-training.sbatch index 81870db75..5f5df0c5b 100644 --- a/3.test_cases/pytorch/FSDP/slurm/llama2_7b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/llama2_7b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -105,9 +132,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/llama3_1_70b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/llama3_1_70b-training.sbatch index 031988dc2..566cfb453 100644 --- a/3.test_cases/pytorch/FSDP/slurm/llama3_1_70b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/llama3_1_70b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -105,9 +132,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/llama3_1_8b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/llama3_1_8b-training.sbatch index 650388d43..f961c4e4e 100644 --- a/3.test_cases/pytorch/FSDP/slurm/llama3_1_8b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/llama3_1_8b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -105,9 +132,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/llama3_2_1b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/llama3_2_1b-training.sbatch index f4c93ffb7..ec26891b4 100644 --- a/3.test_cases/pytorch/FSDP/slurm/llama3_2_1b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/llama3_2_1b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -105,9 +132,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/llama3_2_3b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/llama3_2_3b-training.sbatch index 9c9ab745a..11d10f298 100644 --- a/3.test_cases/pytorch/FSDP/slurm/llama3_2_3b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/llama3_2_3b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -105,9 +132,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/mathstral_7b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/mathstral_7b-training.sbatch index 7934abb86..53b45a434 100644 --- a/3.test_cases/pytorch/FSDP/slurm/mathstral_7b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/mathstral_7b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -127,9 +154,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/mistral_8x7b-training.sbatch b/3.test_cases/pytorch/FSDP/slurm/mistral_8x7b-training.sbatch index cee9bb8f3..7f000a99a 100644 --- a/3.test_cases/pytorch/FSDP/slurm/mistral_8x7b-training.sbatch +++ b/3.test_cases/pytorch/FSDP/slurm/mistral_8x7b-training.sbatch @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -124,9 +151,30 @@ declare -a TRAINING_ARGS=( --offload_activations=1 ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" +srun ${AUTO_RESUME} -l "${ARGS[@]}" ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file diff --git a/3.test_cases/pytorch/FSDP/slurm/training-sub.template b/3.test_cases/pytorch/FSDP/slurm/training-sub.template index a9c59737c..60d5524a6 100644 --- a/3.test_cases/pytorch/FSDP/slurm/training-sub.template +++ b/3.test_cases/pytorch/FSDP/slurm/training-sub.template @@ -12,15 +12,37 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + # _detect.sh prints diagnostics to stderr and the profile path to stdout. + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi -GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ############################### ###### Container Variable ##### ############################### -# Uncomment if you want to use a container instea of Virtual Environment. +# Uncomment if you want to use a container instead of Virtual Environment. #export CONTAINER_IMAGE=$(pwd)/pytorch-fsdp.sqsh export DATA_PATH=/fsx export FSX_MOUNT=$(pwd):$DATA_PATH @@ -29,22 +51,27 @@ export FSX_MOUNT=$(pwd):$DATA_PATH ## Environment Variables ## ########################### -## Plenty of EFA level variables -## For G4dn and other G5, comment out all -#export FI_LOG_LEVEL=warn +# EFA networking — configured by profile or defaults. +# Non-EFA instances (g5, g6e) have EFA_PER_NODE=0 in their profile, +# which skips these variables. EFA instances set FI_PROVIDER in the profile. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + # If profile set FI_PROVIDER, EFA is enabled. Otherwise, it's a non-EFA instance. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 @@ -87,6 +114,27 @@ declare -a TRAINING_ARGS=( {{ MODEL_PARAMETERS | indent(4) }} ) +# Conditionally append cpu_offload from instance profile. +# The profile sets FSDP_CPU_OFFLOAD=1 for memory-constrained instances (e.g., g5) +# and FSDP_CPU_OFFLOAD_MIN_LAYERS as a threshold so small models (1B, 3B) that +# fit without offloading aren't penalized. Offloading adds ~30% overhead. +if [[ "${FSDP_CPU_OFFLOAD:-0}" == "1" ]]; then + NUM_LAYERS="" + for arg in "${TRAINING_ARGS[@]}"; do + if [[ "$arg" == --num_layers=* ]]; then + NUM_LAYERS="${arg#*=}" + break + fi + done + MIN_LAYERS="${FSDP_CPU_OFFLOAD_MIN_LAYERS:-0}" + if [[ -n "$NUM_LAYERS" && "$NUM_LAYERS" -ge "$MIN_LAYERS" ]]; then + echo "Enabling --cpu_offload=1 (num_layers=$NUM_LAYERS >= threshold=$MIN_LAYERS)" + TRAINING_ARGS+=(--cpu_offload=1) + else + echo "Skipping cpu_offload (num_layers=${NUM_LAYERS:-unknown} < threshold=$MIN_LAYERS)" + fi +fi + AUTO_RESUME="" if [ -d "/opt/sagemaker_cluster" ]; then echo "Detected Hyperpod cluster.. enabling --auto-resume=1" diff --git a/3.test_cases/pytorch/ddp/README.md b/3.test_cases/pytorch/ddp/README.md index 35fec1f61..6948dd0da 100644 --- a/3.test_cases/pytorch/ddp/README.md +++ b/3.test_cases/pytorch/ddp/README.md @@ -9,6 +9,19 @@ This example showcases [PyTorch DDP](https://pytorch.org/tutorials/beginner/ddp_ - **CPU Training**: Uses the GLOO backend for distributed training on CPU nodes - **GPU Training**: Automatically switches to NCCL backend when GPUs are available, providing optimized multi-GPU training +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | Expected to work | + +> DDP requires the full model to fit in each GPU. See the +> [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for guidance on which models fit on which instances. + ## Training ### Basic Usage diff --git a/3.test_cases/pytorch/deepspeed/README.md b/3.test_cases/pytorch/deepspeed/README.md index fd2ef7524..9eafa0389 100644 --- a/3.test_cases/pytorch/deepspeed/README.md +++ b/3.test_cases/pytorch/deepspeed/README.md @@ -2,6 +2,18 @@ [DeepSpeed](https://github.com/microsoft/DeepSpeed) enables world's most powerful language models like MT-530B and BLOOM. It is an easy-to-use deep learning optimization software suite that powers unprecedented scale and speed for both training and inference. `deepspeed` illustrates several example test cases for DeepSpeed training on AWS. +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | ZeRO-3 offloading may be needed for large models | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## 1. Preparation This guide assumes that you have the following: diff --git a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/finetune_hf_llama/scripts/finetune_llama.sbatch b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/finetune_hf_llama/scripts/finetune_llama.sbatch index 3d0a4a54b..36292a2f9 100644 --- a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/finetune_hf_llama/scripts/finetune_llama.sbatch +++ b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/finetune_hf_llama/scripts/finetune_llama.sbatch @@ -13,26 +13,55 @@ set -euxo pipefail : "${CONTAINER_MOUNT:=$FSX_PATH:$FSX_PATH}" : "${HF_LLAMA_PATH:=/fsx/deepspeed/Llama2-7b-hf}" +########################### +###### Instance Profile ### +########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: NUM_GPUS_PER_NODE, TP, PP, MICRO_BATCH_SIZE, +# GLOBAL_BATCH_SIZE, EFA/NCCL vars. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See ../profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, p4de-style)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, p4de-style)." +fi + export NODES=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) export NODES_ARRAY=($NODES) export HEAD_NODE=${NODES_ARRAY[0]} export MASTER_ADDR=$(hostname --ip-address) export MASTER_PORT=$((RANDOM + 10000)) export NNODES=$SLURM_JOB_NUM_NODES -export NUM_GPUS_PER_NODE=8 -## EFA settings -export FI_LOG_LEVEL=1 -export FI_PROVIDER=efa # change to eth if you want to use ENA for comparisons -export FI_EFA_USE_HUGE_PAGE=0 -# https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352 -# https://github.com/pytorch/pytorch/issues/68893 -export NCCL_SOCKET_IFNAME=en -export NCCL_ASYNC_ERROR_HANDLING=1 -export OMPI_MCA_plm=^slurm -export MICRO_BATCH_SIZE=16 -export GLOBAL_BATCH_SIZE=256 -export TP=4 -export PP=2 + +# Fallback defaults when no profile is loaded (assumes P4de-class instance) +export NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-8} + +## EFA settings — configured by profile or defaults +if [[ "$PROFILE_LOADED" != "1" ]]; then + export FI_LOG_LEVEL=1 + export FI_PROVIDER=efa # change to eth if you want to use ENA for comparisons + export FI_EFA_USE_HUGE_PAGE=0 + export NCCL_SOCKET_IFNAME=en + export NCCL_ASYNC_ERROR_HANDLING=1 +fi + +export OMPI_MCA_plm=^slurm +export MICRO_BATCH_SIZE=${MICRO_BATCH_SIZE:-16} +export GLOBAL_BATCH_SIZE=${GLOBAL_BATCH_SIZE:-256} +export TP=${TP:-4} +export PP=${PP:-2} # require to align with weight dimensions export HIDDEN_SIZE=4096 export FFN_HIDDEN_SIZE=11008 diff --git a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/README.md b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/README.md new file mode 100644 index 000000000..3f8d99ef4 --- /dev/null +++ b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/README.md @@ -0,0 +1,36 @@ +# Megatron-DeepSpeed Instance Profiles + +Instance profiles configure GPU count, parallelism (TP/PP), batch sizes, and +EFA/NCCL networking variables for each supported EC2 instance type. + +## Auto-detection + +The training script auto-detects the running instance type and sources the +matching `.env` profile. Override with: + +```bash +export INSTANCE_PROFILE=g5-12xlarge +``` + +See [docs/instance-compatibility.md](../../../../docs/instance-compatibility.md) +for full details. + +## Available Profiles + +| Profile | Instance | GPUs | VRAM | EFA | TP | PP | MBS | GBS | Status | +|---------|----------|------|------|-----|----|----|-----|-----|--------| +| `p5en-48xlarge.env` | p5en.48xlarge | 8x H200 | 141 GB | 32 | 4 | 2 | 16 | 256 | Supported | +| `p5-48xlarge.env` | p5.48xlarge | 8x H100 | 80 GB | 32 | 4 | 2 | 16 | 256 | Supported | +| `p4de-24xlarge.env` | p4de.24xlarge | 8x A100 | 80 GB | 4 | 4 | 2 | 16 | 256 | Supported (original) | +| `g6e-12xlarge.env` | g6e.12xlarge | 4x L40S | 48 GB | 0 | 2 | 2 | 8 | 128 | Experimental | +| `g5-12xlarge.env` | g5.12xlarge | 4x A10G | 24 GB | 0 | 2 | 2 | 4 | 64 | Experimental | + +## Notes + +- **TP * PP must equal NUM_GPUS_PER_NODE** for single-node training +- The weight conversion step (`convert`) also uses TP/PP to shard weights. + If you change TP/PP, re-run conversion: `bash 2.convert-weights-to-mega-ds.sh` +- The `ds_config.json` is auto-generated by the training script from + `GLOBAL_BATCH_SIZE` and `MICRO_BATCH_SIZE` profile values +- g5/g6e profiles may need ZeRO stage adjustment if OOM occurs (edit `--zero-stage` + in the sbatch script or profile) diff --git a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/_detect.sh b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/g5-12xlarge.env b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/g5-12xlarge.env new file mode 100644 index 000000000..bdb00cc21 --- /dev/null +++ b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/g5-12xlarge.env @@ -0,0 +1,27 @@ +# g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA +# +# MODEL ASSUMPTIONS (Llama2-7B finetuning): +# TP=2, PP=2 → 4 GPUs (TP*PP must equal NUM_GPUS_PER_NODE for single-node) +# MBS reduced to 4 for 24GB VRAM (from 16 on A100 80GB) +# GBS reduced proportionally; adjust for your data parallel degree +# May need ZeRO stage 1 or 2 if MBS=4 still OOMs; update ds_config.json +# +# Key differences from p4de/p5/p5en: +# - 4 GPUs → TP*PP must be ≤ 4 +# - 24GB VRAM → reduce MBS significantly +# - No EFA → remove all FI_* env vars +# - A10G supports bf16 via Ampere tensor cores + +# --- Hardware --- +export NUM_GPUS_PER_NODE=4 + +# --- Parallelism --- +export TP=2 +export PP=2 +export MICRO_BATCH_SIZE=4 +export GLOBAL_BATCH_SIZE=64 + +# --- EFA / NCCL --- +# No EFA on g5 +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" +export NCCL_ASYNC_ERROR_HANDLING=1 diff --git a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/g6e-12xlarge.env b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/g6e-12xlarge.env new file mode 100644 index 000000000..ab8029120 --- /dev/null +++ b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/g6e-12xlarge.env @@ -0,0 +1,20 @@ +# g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA +# +# MODEL ASSUMPTIONS (Llama2-7B finetuning): +# TP=2, PP=2 → 4 GPUs +# MBS=8 fits in 48GB VRAM (half of the 80GB A100 value) +# ZeRO stage 0 should work for Llama2-7B on 48GB GPUs + +# --- Hardware --- +export NUM_GPUS_PER_NODE=4 + +# --- Parallelism --- +export TP=2 +export PP=2 +export MICRO_BATCH_SIZE=8 +export GLOBAL_BATCH_SIZE=128 + +# --- EFA / NCCL --- +# No EFA on g6e +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" +export NCCL_ASYNC_ERROR_HANDLING=1 diff --git a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p4de-24xlarge.env b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..9a63657c6 --- /dev/null +++ b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p4de-24xlarge.env @@ -0,0 +1,18 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA +# This is the original target instance for this test case. + +# --- Hardware --- +export NUM_GPUS_PER_NODE=8 + +# --- Parallelism --- +export TP=4 +export PP=2 +export MICRO_BATCH_SIZE=16 +export GLOBAL_BATCH_SIZE=256 + +# --- EFA / NCCL --- +export FI_LOG_LEVEL=1 +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export NCCL_SOCKET_IFNAME=en +export NCCL_ASYNC_ERROR_HANDLING=1 diff --git a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p5-48xlarge.env b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p5-48xlarge.env new file mode 100644 index 000000000..ea7219929 --- /dev/null +++ b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p5-48xlarge.env @@ -0,0 +1,17 @@ +# p5.48xlarge — 8x H100 80GB, 32 EFA, NVLink, GPUDirect RDMA + +# --- Hardware --- +export NUM_GPUS_PER_NODE=8 + +# --- Parallelism --- +export TP=4 +export PP=2 +export MICRO_BATCH_SIZE=16 +export GLOBAL_BATCH_SIZE=256 + +# --- EFA / NCCL --- +export FI_LOG_LEVEL=1 +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" +export NCCL_ASYNC_ERROR_HANDLING=1 diff --git a/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p5en-48xlarge.env b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..3ec591d64 --- /dev/null +++ b/3.test_cases/pytorch/deepspeed/examples_megatron_deepspeed/profiles/p5en-48xlarge.env @@ -0,0 +1,22 @@ +# p5en.48xlarge — 8x H200 141GB, 32 EFA, NVLink, GPUDirect RDMA +# +# MODEL ASSUMPTIONS (Llama2-7B finetuning): +# TP=4, PP=2 → 8 GPUs fully utilized +# MBS=16 fits comfortably in 141GB VRAM +# ZeRO stage 0 (no optimizer sharding needed) + +# --- Hardware --- +export NUM_GPUS_PER_NODE=8 + +# --- Parallelism --- +export TP=4 +export PP=2 +export MICRO_BATCH_SIZE=16 +export GLOBAL_BATCH_SIZE=256 + +# --- EFA / NCCL --- +export FI_LOG_LEVEL=1 +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" +export NCCL_ASYNC_ERROR_HANDLING=1 diff --git a/3.test_cases/pytorch/distillation/README.md b/3.test_cases/pytorch/distillation/README.md index b5b149ced..2d003f7da 100644 --- a/3.test_cases/pytorch/distillation/README.md +++ b/3.test_cases/pytorch/distillation/README.md @@ -14,6 +14,20 @@ This walkthrough demonstrates how to set up and run large language model (LLM) k └── setup.sh # Environment setup script for PyTorch and dependencies ``` +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Tested | Primary target | +| p5.48xlarge | 8 x H100 80 GB | Tested | | +| p4d.24xlarge | 8 x A100 40 GB | Tested | | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need smaller teacher/student models | +| g6e.12xlarge | 4 x L40S 48 GB | Untested | | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## Getting Started **Compatible Instance Types**: The architectures and test cases in this repository are designed to work with GPU instances including P4d, P5, and P5en instance types for optimal distributed training performance. diff --git a/3.test_cases/pytorch/mosaicml-composer/mpt/README.md b/3.test_cases/pytorch/mosaicml-composer/mpt/README.md index ae4daa09e..88c959076 100644 --- a/3.test_cases/pytorch/mosaicml-composer/mpt/README.md +++ b/3.test_cases/pytorch/mosaicml-composer/mpt/README.md @@ -1,5 +1,17 @@ # Mosaic Pretrained Transformers (MPT) Test Case +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need reduced model size | + +> See the [Instance Compatibility Guide](../../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + MPT are GPT-style models in [llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main) with some special features -- [Flash Attention](https://arxiv.org/abs/2205.14135) for efficiency, [ALiBi](https://arxiv.org/abs/2108.12409) for context length extrapolation, and stability improvements to mitigate loss spikes. This project contains: diff --git a/3.test_cases/pytorch/mosaicml-composer/stable-diffusion/README.md b/3.test_cases/pytorch/mosaicml-composer/stable-diffusion/README.md index b2c9cf1ae..363dd1ce7 100644 --- a/3.test_cases/pytorch/mosaicml-composer/stable-diffusion/README.md +++ b/3.test_cases/pytorch/mosaicml-composer/stable-diffusion/README.md @@ -1,5 +1,17 @@ # Stable Diffusion Test Case +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5.48xlarge | 8 x H100 80 GB | Tested | Scaling benchmarks available below | +| p4de.24xlarge | 8 x A100 80 GB | Tested | Performance comparison available below | +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need reduced batch size | + +> See the [Instance Compatibility Guide](../../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + We will follow MosaicML's stable diffusion benchmarking scripts provided [here](https://github.com/mosaicml/diffusion-benchmark/tree/main). It uses the `'stabilityai/stable-diffusion-2-base'` model. You can check the number of parameters by executing: ```bash diff --git a/3.test_cases/pytorch/nanoVLM/README.md b/3.test_cases/pytorch/nanoVLM/README.md index 46d9a7a0a..d3d92dce9 100644 --- a/3.test_cases/pytorch/nanoVLM/README.md +++ b/3.test_cases/pytorch/nanoVLM/README.md @@ -3,6 +3,19 @@ This test case demonstrates distributed training of [NanoVLM](https://github.com/huggingface/nanoVLM/), a repository for training/finetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch. +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| g5.12xlarge | 4 x A10G 24 GB | Tested | Requires config changes (see step 7 optional section) | +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work with default config | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work with default config | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work with default config | + +> **g5 users**: See step 7 below for required configuration changes to avoid OOM. +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for general guidance on running across instance types. + ## 1. Prerequisites This guide assumes that you have the following: diff --git a/3.test_cases/pytorch/nanoVLM/profiles/README.md b/3.test_cases/pytorch/nanoVLM/profiles/README.md new file mode 100644 index 000000000..4c795fb61 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/profiles/README.md @@ -0,0 +1,29 @@ +# nanoVLM Instance Profiles + +Instance profiles configure GPU count and EFA networking variables for each +supported EC2 instance type. Training-specific parameters (model config, batch +size, etc.) are handled by the nanoVLM training scripts directly. + +## Auto-detection + +The launch script auto-detects the running instance type via the EC2 instance +metadata service and sources the matching `.env` profile. Detection order +follows the same logic used by the FSDP and veRL profiles. + +To override auto-detection: + +```bash +export INSTANCE_PROFILE=g5-12xlarge +``` + +See [docs/instance-compatibility.md](../../../../docs/instance-compatibility.md) +for full details on the detection mechanism and supported instances. + +## Available Profiles + +| Profile | Instance | GPUs | VRAM | EFA | Status | +|---------|----------|------|------|-----|--------| +| `p5en-48xlarge.env` | p5en.48xlarge | 8x H200 | 141 GB | 32 adapters | Supported | +| `p4de-24xlarge.env` | p4de.24xlarge | 8x A100 | 80 GB | 4 adapters | Supported | +| `g6e-12xlarge.env` | g6e.12xlarge | 4x L40S | 48 GB | None | Supported | +| `g5-12xlarge.env` | g5.12xlarge | 4x A10G | 24 GB | None | Supported | diff --git a/3.test_cases/pytorch/nanoVLM/profiles/_detect.sh b/3.test_cases/pytorch/nanoVLM/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/pytorch/nanoVLM/profiles/g5-12xlarge.env b/3.test_cases/pytorch/nanoVLM/profiles/g5-12xlarge.env new file mode 100644 index 000000000..d3fbc2e36 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/profiles/g5-12xlarge.env @@ -0,0 +1,5 @@ +# g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA +# nanoVLM's small model should fit on A10G GPUs. +export GPUS_PER_NODE=4 +# No EFA on g5 — FI_PROVIDER intentionally unset +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" diff --git a/3.test_cases/pytorch/nanoVLM/profiles/g6e-12xlarge.env b/3.test_cases/pytorch/nanoVLM/profiles/g6e-12xlarge.env new file mode 100644 index 000000000..69fcdbc4e --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/profiles/g6e-12xlarge.env @@ -0,0 +1,4 @@ +# g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA +export GPUS_PER_NODE=4 +# No EFA on g6e — FI_PROVIDER intentionally unset +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" diff --git a/3.test_cases/pytorch/nanoVLM/profiles/p4de-24xlarge.env b/3.test_cases/pytorch/nanoVLM/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..c95ceca84 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/profiles/p4de-24xlarge.env @@ -0,0 +1,6 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA +export GPUS_PER_NODE=8 +export FI_PROVIDER=efa +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export FI_EFA_USE_DEVICE_RDMA=1 +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" diff --git a/3.test_cases/pytorch/nanoVLM/profiles/p5en-48xlarge.env b/3.test_cases/pytorch/nanoVLM/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..9557a03d6 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/profiles/p5en-48xlarge.env @@ -0,0 +1,5 @@ +# p5en.48xlarge — 8x H200 141GB, 32 EFA, NVLink, GPUDirect RDMA +export GPUS_PER_NODE=8 +export FI_PROVIDER=efa +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" diff --git a/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch b/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch index ae09ee7d1..8d5c33508 100644 --- a/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch +++ b/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch @@ -6,7 +6,34 @@ #SBATCH --nodes=4 #SBATCH --partition=p5en -GPUS_PER_NODE=8 #set to 1 for g5.8xlarge +set -ex; + +########################### +###### Instance Profile ### +########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, EFA vars, NCCL settings. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi + +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} cd .. @@ -14,20 +41,27 @@ export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh export FSX_MOUNT=$(pwd):$(pwd) +########################### +## Environment Variables ## +########################### + +# EFA networking — configured by profile or defaults. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + true +else + # No profile — use legacy EFA defaults (P5) + export FI_PROVIDER=efa + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -#export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth,eth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth declare -a ARGS=( --container-image $CONTAINER_IMAGE @@ -54,4 +88,4 @@ if [ -d "/opt/sagemaker_cluster" ]; then AUTO_RESUME="--auto-resume=1" fi -srun ${AUTO_RESUME} -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file +srun ${AUTO_RESUME} -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" diff --git a/3.test_cases/pytorch/neuronx-distributed/llama3/kubernetes/README.md b/3.test_cases/pytorch/neuronx-distributed/llama3/kubernetes/README.md index a777cbc80..9e4c30104 100644 --- a/3.test_cases/pytorch/neuronx-distributed/llama3/kubernetes/README.md +++ b/3.test_cases/pytorch/neuronx-distributed/llama3/kubernetes/README.md @@ -1,5 +1,17 @@ ## Train Llama 3 8B model on Kubernetes +### Tested Configurations + +| Instance | NeuronCores | Status | Notes | +|----------|-------------|--------|-------| +| trn1.32xlarge | 32 | Tested | | +| trn1n.32xlarge | 32 | Tested | | +| trn2.48xlarge | 64 | Untested | Expected to work | + +> See the [Instance Compatibility Guide](../../../../../docs/instance-compatibility.md) +> and [Trainium instance profile](../../../../../docs/instance-profiles/trn1.md) +> for details on Trainium hardware. + In this section, we showcase how to pre-train Llama3-8B, Llama3 8B model using Trn1.32xlarge/Trn1n.32xlarge instances using the Neuron Distributed library. To train the LLama model in this example, we will apply the following optimizations using the Neuron Distributed library: 1. [Tensor Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html#tensor-parallelism-overview) diff --git a/3.test_cases/pytorch/neuronx-distributed/llama3/slurm/README.md b/3.test_cases/pytorch/neuronx-distributed/llama3/slurm/README.md index bd650f120..f22b9f0b4 100644 --- a/3.test_cases/pytorch/neuronx-distributed/llama3/slurm/README.md +++ b/3.test_cases/pytorch/neuronx-distributed/llama3/slurm/README.md @@ -1,5 +1,17 @@ # How to run continual pretraining of Llama3 using Amazon Trainium on Slurm +## Tested Configurations + +| Instance | NeuronCores | Status | Notes | +|----------|-------------|--------|-------| +| trn1.32xlarge | 32 | Tested | 16 nodes used in example | +| trn1n.32xlarge | 32 | Untested | Expected to work (more EFA adapters) | +| trn2.48xlarge | 64 | Untested | Expected to work | + +> See the [Instance Compatibility Guide](../../../../../docs/instance-compatibility.md) +> and [Trainium instance profile](../../../../../docs/instance-profiles/trn1.md) +> for details on Trainium hardware. + ## Prerequisites diff --git a/3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md b/3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md index 7d79af877..182be8951 100644 --- a/3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md +++ b/3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md @@ -1,5 +1,17 @@ ## PEFT Fine Tuning of Llama 3 on Amazon EKS with AWS Trainium +### Tested Configurations + +| Instance | NeuronCores | TP Degree | Status | Notes | +|----------|-------------|-----------|--------|-------| +| trn1.32xlarge | 32 | 8 | Tested | | +| trn1n.32xlarge | 32 | 8 | Tested | 16 EFA adapters | +| trn2.48xlarge | 64 | 4 | Tested | | + +> See the [Instance Compatibility Guide](../../../../../../docs/instance-compatibility.md) +> and [Trainium instance profile](../../../../../../docs/instance-profiles/trn1.md) +> for details on Trainium hardware. + This example demonstrates how to perform supervised fine tuning for Meta Llama 3.1 using Parameter-Efficient Fine Tuning (PEFT) on AWS Trainium with EKS. It uses [Hugging Face Optimum Neuron](https://huggingface.co/docs/optimum-neuron) to apply Low-Rank Adaptation (LoRA) for distributed training on Trainium. **Supported instances:** trn1.32xlarge, trn1n.32xlarge, trn2.48xlarge. Set the `INSTANCE_TYPE` variable in `generate-jobspec.sh` to match your cluster's instance type. Parallelism settings (tensor parallel degree, NeuronCore count) are derived automatically. diff --git a/3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md b/3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md index 3ee375ee1..788100e31 100644 --- a/3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md +++ b/3.test_cases/pytorch/optimum-neuron/llama3/slurm/fine-tuning/README.md @@ -1,5 +1,18 @@ ## PEFT Fine Tuning of Llama 3 on Slurm Cluster (trn1/trn2) +### Tested Configurations + +| Instance | NeuronCores | TP Degree | Status | Notes | +|----------|-------------|-----------|--------|-------| +| trn1.32xlarge | 32 | 8 | Tested | 4 DP workers | +| trn1n.32xlarge | 32 | 8 | Tested | 4 DP workers; 16 EFA adapters | +| trn2.48xlarge | 64 | 4 | Tested | 16 DP workers | +| trn2.3xlarge | 4 | 4 | Tested | 1 DP worker | + +> See the [Instance Compatibility Guide](../../../../../../docs/instance-compatibility.md) +> and [Trainium instance profile](../../../../../../docs/instance-profiles/trn1.md) +> for details on Trainium hardware. + This example showcases how to fine tune Llama 3 models using AWS Trainium instances and [Hugging Face Optimum Neuron](https://huggingface.co/docs/optimum-neuron). Optimum Neuron is the interface between the Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. It provides tools for model loading, training, and inference on single- and multi-accelerator settings. **Supported instances:** trn1.32xlarge, trn1n.32xlarge, trn2.48xlarge, trn2.3xlarge. The training script auto-detects the instance type and sets tensor parallelism accordingly. diff --git a/3.test_cases/pytorch/picotron/README.md b/3.test_cases/pytorch/picotron/README.md index 49d56e107..79b420caf 100644 --- a/3.test_cases/pytorch/picotron/README.md +++ b/3.test_cases/pytorch/picotron/README.md @@ -3,6 +3,18 @@ This test case demonstrates distributed training of [Picotron](https://github.com/huggingface/picotron), a distributed training framework for education and research experimentation. Picotron is designed for training speed, scalability, and memory efficiency. +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need smaller model sizes | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## Build Environment The provided Dockerfile (`picotron.Dockerfile`) will set up the environment with all required dependencies: diff --git a/3.test_cases/pytorch/torchtitan/README.md b/3.test_cases/pytorch/torchtitan/README.md index 3eba72480..7a5057ac2 100644 --- a/3.test_cases/pytorch/torchtitan/README.md +++ b/3.test_cases/pytorch/torchtitan/README.md @@ -2,6 +2,18 @@ [torchtitan](https://github.com/pytorch/torchtitan) is a reference architecture for large-scale LLM training using native PyTorch. It aims to showcase PyTorch's latest distributed training features in a clean, minimal code base. The library is designed to be simple to understand, use, and extend for different training purposes, with minimal changes required to the model code when applying various parallel processing techniques. +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | FSDP2 with offloading may be needed | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## Key Features torchtitan offers several advanced capabilities: diff --git a/3.test_cases/pytorch/torchtitan/profiles/README.md b/3.test_cases/pytorch/torchtitan/profiles/README.md new file mode 100644 index 000000000..faa5c759f --- /dev/null +++ b/3.test_cases/pytorch/torchtitan/profiles/README.md @@ -0,0 +1,29 @@ +# torchtitan Instance Profiles + +Instance profiles configure GPU count and EFA networking variables for each +supported EC2 instance type. Training-specific parameters (model config, batch +size, etc.) are handled by torchtitan's TOML configuration files. + +## Auto-detection + +The launch script auto-detects the running instance type via the EC2 instance +metadata service and sources the matching `.env` profile. Detection order +follows the same logic used by the FSDP and veRL profiles. + +To override auto-detection: + +```bash +export INSTANCE_PROFILE=g5-12xlarge +``` + +See [docs/instance-compatibility.md](../../../../docs/instance-compatibility.md) +for full details on the detection mechanism and supported instances. + +## Available Profiles + +| Profile | Instance | GPUs | VRAM | EFA | Status | +|---------|----------|------|------|-----|--------| +| `p5en-48xlarge.env` | p5en.48xlarge | 8x H200 | 141 GB | 32 adapters | Supported | +| `p4de-24xlarge.env` | p4de.24xlarge | 8x A100 | 80 GB | 4 adapters | Supported | +| `g6e-12xlarge.env` | g6e.12xlarge | 4x L40S | 48 GB | None | Supported | +| `g5-12xlarge.env` | g5.12xlarge | 4x A10G | 24 GB | None | Supported (small configs) | diff --git a/3.test_cases/pytorch/torchtitan/profiles/_detect.sh b/3.test_cases/pytorch/torchtitan/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/pytorch/torchtitan/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/pytorch/torchtitan/profiles/g5-12xlarge.env b/3.test_cases/pytorch/torchtitan/profiles/g5-12xlarge.env new file mode 100644 index 000000000..4db69aa52 --- /dev/null +++ b/3.test_cases/pytorch/torchtitan/profiles/g5-12xlarge.env @@ -0,0 +1,7 @@ +# g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA +# Note: torchtitan model configs may need adjustment for 24GB GPUs. +# Use smaller model configs (e.g., llama3_8b.toml) or enable CPU offloading +# in the TOML config if running into OOM issues. +export GPUS_PER_NODE=4 +# No EFA on g5 — FI_PROVIDER intentionally unset +export NCCL_SOCKET_IFNAME="^docker,lo,veth" diff --git a/3.test_cases/pytorch/torchtitan/profiles/g6e-12xlarge.env b/3.test_cases/pytorch/torchtitan/profiles/g6e-12xlarge.env new file mode 100644 index 000000000..a6aba7c0b --- /dev/null +++ b/3.test_cases/pytorch/torchtitan/profiles/g6e-12xlarge.env @@ -0,0 +1,5 @@ +# g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA +# 48GB VRAM handles most torchtitan model configs without changes. +export GPUS_PER_NODE=4 +# No EFA on g6e — FI_PROVIDER intentionally unset +export NCCL_SOCKET_IFNAME="^docker,lo,veth" diff --git a/3.test_cases/pytorch/torchtitan/profiles/p4de-24xlarge.env b/3.test_cases/pytorch/torchtitan/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..283f3d3d9 --- /dev/null +++ b/3.test_cases/pytorch/torchtitan/profiles/p4de-24xlarge.env @@ -0,0 +1,7 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA +export GPUS_PER_NODE=8 +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export FI_EFA_USE_DEVICE_RDMA=1 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" diff --git a/3.test_cases/pytorch/torchtitan/profiles/p5en-48xlarge.env b/3.test_cases/pytorch/torchtitan/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..1eca3610a --- /dev/null +++ b/3.test_cases/pytorch/torchtitan/profiles/p5en-48xlarge.env @@ -0,0 +1,6 @@ +# p5en.48xlarge — 8x H200 141GB, 32 EFA, NVLink, GPUDirect RDMA +export GPUS_PER_NODE=8 +export FI_PROVIDER=efa +export FI_EFA_USE_HUGE_PAGE=0 +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME="^docker,lo,veth" diff --git a/3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh b/3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh index faf164403..ead9f1626 100644 --- a/3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh +++ b/3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh @@ -11,28 +11,54 @@ set -ex; ########################### -###### User Variables ##### +###### Instance Profile ### ########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, EFA vars, NCCL settings. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. -GPUS_PER_NODE=8 +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi + +# Fallback defaults when no profile is loaded (assumes P5-class instance) +GPUS_PER_NODE=${GPUS_PER_NODE:-8} ########################### ## Environment Variables ## ########################### +# EFA networking — configured by profile or defaults. +if [[ "$PROFILE_LOADED" == "1" ]]; then + # Profile was sourced — trust its EFA settings. + true +else + # No profile — use legacy EFA defaults (P4/P5) + export FI_PROVIDER=efa + export FI_EFA_USE_HUGE_PAGE=0 + export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +fi + export NCCL_DEBUG=INFO -export FI_PROVIDER=efa -export FI_EFA_USE_HUGE_PAGE=0 -## Switching SYNC_MEMOPS to zero can boost throughput with FSDP -## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS -## Reduces memory synchronizations -## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html -export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-"^docker,lo,veth"} + # LD_PRELOAD is required for PyTorch to find the NCCL library # This path assumes you are using the Deep Learning AMI # If you are not using the DLAMI, you may need to update this path export LD_PRELOAD=/usr/local/cuda-12.1/lib/libnccl.so -export NCCL_SOCKET_IFNAME=^docker,lo,veth ## Set HuggingFace metadata timeout (in seconds) for large clusters export HF_HUB_ETAG_TIMEOUT=60 diff --git a/3.test_cases/pytorch/trl/README.md b/3.test_cases/pytorch/trl/README.md index 573e8a53a..63be24bc7 100644 --- a/3.test_cases/pytorch/trl/README.md +++ b/3.test_cases/pytorch/trl/README.md @@ -2,6 +2,18 @@ This directory contains test cases for distributed training with [Hugging Face TRL](https://huggingface.co/docs/trl), a library for post-training LLMs using reinforcement learning techniques such as GRPO, PPO, DPO, and SFT. +## Tested Configurations + +| Instance | GPUs | Status | Notes | +|----------|------|--------|-------| +| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | +| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | +| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work | +| g5.12xlarge | 4 x A10G 24 GB | Untested | May need offloading for large models; see GRPO sub-cases | + +> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) +> for parameter adjustments needed across instance types. + ## Base Docker Image All test cases share a common base Docker image defined in [`Dockerfile`](Dockerfile). It includes Python 3.12, PyTorch 2.6.0, TRL with vLLM backend, Flash Attention, FlashInfer, and common training dependencies. diff --git a/3.test_cases/pytorch/trl/grpo-math-reasoning/train.sbatch b/3.test_cases/pytorch/trl/grpo-math-reasoning/train.sbatch index 559b14c11..def10f882 100644 --- a/3.test_cases/pytorch/trl/grpo-math-reasoning/train.sbatch +++ b/3.test_cases/pytorch/trl/grpo-math-reasoning/train.sbatch @@ -5,9 +5,39 @@ set -ex -## Set libfabric flags to use EFA -export FI_PROVIDER=efa -export FI_EFA_FORK_SAFE=1 +########################### +###### Instance Profile ### +########################### +# Auto-detect instance type and source the matching profile. +# Profiles set: GPUS_PER_NODE, EFA vars, NCCL tuning. +# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) +# See profiles/README.md for details. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROFILES_DIR="${SCRIPT_DIR}/../profiles" +PROFILE_LOADED=0 + +if [[ -d "$PROFILES_DIR" ]]; then + if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then + echo "Sourcing instance profile: $PROFILE_ENV" + source "$PROFILE_ENV" + PROFILE_LOADED=1 + else + echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." + fi +else + echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." +fi + +# EFA networking — configured by profile or defaults. +if [[ "$PROFILE_LOADED" != "1" ]]; then + # No profile — use legacy EFA defaults + export FI_PROVIDER=efa + export FI_EFA_FORK_SAFE=1 + export NCCL_BUFFSIZE=8388608 + export NCCL_P2P_NET_CHUNKSIZE=524288 + export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so +fi ## Set this flag for debugging EFA #export FI_LOG_LEVEL=warn @@ -15,24 +45,12 @@ export FI_EFA_FORK_SAFE=1 ## NCCL Environment variables # export NCCL_DEBUG=INFO -### Increase the send queue depth and can turn NCCL communications into non-blocking. -### https://www.usenix.org/system/files/atc23-choi.pdf -export NCCL_BUFFSIZE=8388608 -### Improve performance by increasing buffer size for Send/Recv, Gather, Scatter and Alltoall communications -### https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html -export NCCL_P2P_NET_CHUNKSIZE=524288 - -### Improve performance for AllReduce by selecting specific protocol and algorithm for specific -### message size and number of ranks. -### More information https://github.com/aws/aws-ofi-nccl/wiki/Algorithm-and-Protocol-Tuner-for-AWS. -export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so - NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST)) TRAIN_NODES_NUM=$((SLURM_NNODES - 1)) TRAIN_NODES="${NODELIST[@]:0:$TRAIN_NODES_NUM}" VLLM_NODE="${NODELIST[$TRAIN_NODES_NUM]}" head_node_ip=${NODELIST[0]} -GPUS_PER_NODE=8 +GPUS_PER_NODE=${GPUS_PER_NODE:-8} LAUNCHER="accelerate launch \ --config_file /grpo-math-reasoning/deepspeed_zero3.yaml \ @@ -78,4 +96,4 @@ srun -l --mpi=pmix --cpu-bind=none --container-image ./trl-base.sqsh \ --nodes=1 --ntasks=1 --nodelist="${VLLM_NODE}" \ trl vllm-serve --model $MODEL --tensor_parallel_size $TENSOR_PARALLEL & -wait \ No newline at end of file +wait diff --git a/3.test_cases/pytorch/trl/profiles/README.md b/3.test_cases/pytorch/trl/profiles/README.md new file mode 100644 index 000000000..baa572da0 --- /dev/null +++ b/3.test_cases/pytorch/trl/profiles/README.md @@ -0,0 +1,30 @@ +# TRL Instance Profiles + +Instance profiles configure GPU count and EFA networking per EC2 instance type. +The profile is auto-detected at runtime by `_detect.sh`. + +## Detection Order + +1. `INSTANCE_PROFILE` env var (explicit override) +2. `INSTANCE_TYPE` env var +3. EC2 Instance Metadata API +4. GPU name from nvidia-smi + +## Available Profiles + +| Profile | Instance | GPU | VRAM | Status | +|---------|----------|-----|------|--------| +| [p5en-48xlarge.env](p5en-48xlarge.env) | p5en.48xlarge | 8x H200 | 141 GB | Tested | +| [p4de-24xlarge.env](p4de-24xlarge.env) | p4de.24xlarge | 8x A100 | 80 GB | Untested | +| [g5-12xlarge.env](g5-12xlarge.env) | g5.12xlarge | 4x A10G | 24 GB | Untested | +| [g6e-12xlarge.env](g6e-12xlarge.env) | g6e.12xlarge | 4x L40S | 48 GB | Untested | + +## Notes + +The `grpo-math-reasoning` script splits nodes between training (accelerate + +DeepSpeed) and vLLM serving. The `GPUS_PER_NODE` from the profile drives: +- `--num_processes` for accelerate (TRAIN_NODES * GPUS_PER_NODE) +- `TENSOR_PARALLEL` for vLLM (auto-computed from model's attention heads) + +The `gpt-oss-lora-grpo` sub-case uses K8s manifests that are currently +hardcoded to g6e.12xlarge. See its README for manual adjustment instructions. diff --git a/3.test_cases/pytorch/trl/profiles/_detect.sh b/3.test_cases/pytorch/trl/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/pytorch/trl/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/pytorch/trl/profiles/g5-12xlarge.env b/3.test_cases/pytorch/trl/profiles/g5-12xlarge.env new file mode 100644 index 000000000..2b5fc21bf --- /dev/null +++ b/3.test_cases/pytorch/trl/profiles/g5-12xlarge.env @@ -0,0 +1,5 @@ +# g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA +# Note: vLLM server will use fewer GPUs for tensor parallelism. +# The script auto-computes TENSOR_PARALLEL from GPUS_PER_NODE and model config. +export GPUS_PER_NODE=4 +# No EFA on g5 — FI_PROVIDER intentionally unset diff --git a/3.test_cases/pytorch/trl/profiles/g6e-12xlarge.env b/3.test_cases/pytorch/trl/profiles/g6e-12xlarge.env new file mode 100644 index 000000000..e4debc578 --- /dev/null +++ b/3.test_cases/pytorch/trl/profiles/g6e-12xlarge.env @@ -0,0 +1,3 @@ +# g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA +export GPUS_PER_NODE=4 +# No EFA on g6e — FI_PROVIDER intentionally unset diff --git a/3.test_cases/pytorch/trl/profiles/p4de-24xlarge.env b/3.test_cases/pytorch/trl/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..385fe128d --- /dev/null +++ b/3.test_cases/pytorch/trl/profiles/p4de-24xlarge.env @@ -0,0 +1,8 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA +export GPUS_PER_NODE=8 +export FI_PROVIDER=efa +export FI_EFA_FORK_SAFE=1 +export FI_EFA_USE_DEVICE_RDMA=1 +export NCCL_BUFFSIZE=8388608 +export NCCL_P2P_NET_CHUNKSIZE=524288 +export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so diff --git a/3.test_cases/pytorch/trl/profiles/p5en-48xlarge.env b/3.test_cases/pytorch/trl/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..ad8acee22 --- /dev/null +++ b/3.test_cases/pytorch/trl/profiles/p5en-48xlarge.env @@ -0,0 +1,7 @@ +# p5en.48xlarge — 8x H200 141GB, 32 EFA, NVLink, GPUDirect RDMA +export GPUS_PER_NODE=8 +export FI_PROVIDER=efa +export FI_EFA_FORK_SAFE=1 +export NCCL_BUFFSIZE=8388608 +export NCCL_P2P_NET_CHUNKSIZE=524288 +export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so diff --git a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/README.md b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/README.md index 8bf187c2f..b05083488 100644 --- a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/README.md +++ b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/README.md @@ -10,6 +10,39 @@ This repository provides a complete setup for running reinforcement learning fro [Reinforcement Learning from Verifiable Rewards (RLVR)](https://arxiv.org/abs/2506.14245) is a training approach where models learn from tasks with objectively verifiable outcomes, such as math problems or code execution. Unlike human preference-based RL, RLVR uses ground-truth correctness as the reward signal, making it particularly effective for reasoning tasks. +## Tested Configurations + +| Instance | GPUs | Model | Nodes | Key Settings | Status | +|----------|------|-------|-------|-------------|--------| +| p5en.48xlarge | 8 x H200 80 GB | Qwen3-8B | 4 | FSDP1, TP=2, ref_offload only | Tested | +| g5.12xlarge | 4 x A10G 24 GB | gpt-oss-20b (MoE) | 3 workers + 1 head | FSDP2, full offload, TP=4, bf16 | Tested | +| p4de.24xlarge | 8 x A100 80 GB | Qwen3-8B | 4 | FSDP1, TP=2 | Untested | +| g6e.12xlarge | 4 x L40S 48 GB | — | — | — | Untested | + +> **Running on a different instance type?** See the +> [Instance Compatibility Guide](../../../../../docs/instance-compatibility.md) +> for the parameter changes needed when moving between instance families, and +> the [instance profiles](../../../../../docs/instance-profiles/) for +> per-instance hardware details and NCCL/EFA settings. +> +> **g5 users**: Running on g5.12xlarge (A10G 24 GB) requires significant +> parameter changes including FSDP2 with full CPU offloading, TP=4, +> enforce_eager=True, and NCCL_PROTO=simple. The key differences are: +> +> | Parameter | p5en (80 GB) | g5 (24 GB) | Why | +> |-----------|-------------|-----------|-----| +> | FSDP strategy | `fsdp` (FSDP1) | `fsdp2` | FSDP1 disables CPUOffload for actor | +> | `offload_policy` | not set | `True` | Enables FSDP2 CPU offloading | +> | `model_dtype` | default (fp32) | `bf16` | Explicit bf16 halves memory | +> | `enforce_eager` | `False` | `True` | CUDA graphs OOM on 24 GB | +> | `tensor_parallel_size` | 2 | 4 | Shard across all 4 GPUs | +> | `param_offload` | `False` | `True` | Offload params to CPU | +> | `optimizer_offload` | `False` | `True` | Offload optimizer to CPU | +> | `NCCL_PROTO` | default | `simple` | No GPUDirect RDMA on g5 | +> | `save_freq` | 1 | 20+ | 117 GB/ckpt for 20B; fills disk fast | +> | `WORKER_MEMORY` | 200 Gi+ | 150 Gi | g5.12xl allocatable ~168 Gi | +> | `nnodes` | node count | worker count only | Head pod without GPUs causes NCCL hang | + ## Getting started ### Prerequisites diff --git a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/README.md b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/README.md new file mode 100644 index 000000000..69f8040b2 --- /dev/null +++ b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/README.md @@ -0,0 +1,102 @@ +# Instance Profiles for veRL RLVR Recipes + +This directory contains instance-specific configuration profiles that +override the hardware-dependent parameters in the recipe scripts. + +## How It Works + +1. The recipe script (e.g., `run_grpo_configurable.sh`) calls `_detect.sh` + to determine which profile to load +2. `_detect.sh` resolves the instance type (from env var, EC2 metadata, or + GPU detection) and returns the path to the matching `.env` file +3. The recipe sources the `.env` file, overriding default values with + instance-specific settings + +## Detection Order + +The profile is selected by the first method that succeeds: + +1. **`INSTANCE_PROFILE` env var** — explicit override (e.g., `g5-12xlarge`) +2. **`INSTANCE_TYPE` env var** — from `setup/env_vars` (e.g., `g5.12xlarge`) +3. **EC2 instance metadata API** — works on bare metal and K8s with host networking +4. **GPU name from `nvidia-smi`** — fallback when metadata is unavailable + +## Available Profiles + +| Profile | Instance | GPU | VRAM | Tested Model | Status | +|---------|----------|-----|------|-------------|--------| +| [p5en-48xlarge.env](p5en-48xlarge.env) | p5en.48xlarge | 8x H200 | 80 GB | Qwen3-8B (dense) | Tested | +| [g5-12xlarge.env](g5-12xlarge.env) | g5.12xlarge | 4x A10G | 24 GB | gpt-oss-20b (MoE) | Tested | +| [p4de-24xlarge.env](p4de-24xlarge.env) | p4de.24xlarge | 8x A100 | 80 GB | — | Untested | + +## About Model Assumptions + +Each profile is tested with a specific model and documents what to change +for other model sizes. The instance-dependent settings (NCCL, EFA, GPU +count) stay the same regardless of model. What changes per model: + +| Setting | Driven by... | +|---------|-------------| +| `TENSOR_PARALLEL_SIZE` | Model size relative to per-GPU VRAM | +| `PARAM_OFFLOAD`, `OPTIMIZER_OFFLOAD` | Whether model + optimizer fit in GPU VRAM | +| `GPU_MEMORY_UTILIZATION` | Model shard size relative to per-GPU VRAM | +| `LOG_PROB_MICRO_BSZ_PER_GPU` | Activation memory during log-prob computation | +| `MAX_RESPONSE_LENGTH` | KV cache size (longer = more VRAM) | +| `SAVE_FREQ`, `MAX_ACTOR_CKPT_TO_KEEP` | Checkpoint size (scales with total params) | + +Settings that do NOT change per model (only per instance): + +| Setting | Driven by... | +|---------|-------------| +| `NCCL_PROTO`, `FI_EFA_USE_DEVICE_RDMA` | Whether instance has GPUDirect RDMA | +| `NUM_GPU_PER_NODE`, `NUM_EFA_PER_NODE` | Instance hardware | +| `WORKER_MEMORY`, `WORKER_CPU` | Instance allocatable resources | +| `ENFORCE_EAGER` | GPU VRAM headroom (always True on 24GB) | + +See the `MODEL ASSUMPTIONS` comment block at the top of each `.env` file +for guidance on adjusting settings for different models. + +## What's in a Profile + +Profiles contain **only instance-dependent parameters** — settings that +change based on hardware. Algorithm-specific settings (KL loss, reward +function, dataset, learning rate, etc.) stay in the recipe script. + +Instance-dependent parameters include: + +| Category | Parameters | +|----------|-----------| +| Cluster geometry | `NUM_GPU_PER_NODE`, `NUM_EFA_PER_NODE` | +| FSDP strategy | `ACTOR_STRATEGY`, `PARAM_OFFLOAD`, `OPTIMIZER_OFFLOAD`, `OFFLOAD_POLICY`, `MODEL_DTYPE`, `RESHARD_AFTER_FORWARD` | +| vLLM rollout | `TENSOR_PARALLEL_SIZE`, `GPU_MEMORY_UTILIZATION`, `ENFORCE_EAGER`, `ROLLOUT_DTYPE` | +| NCCL / EFA | `NCCL_PROTO`, `FI_EFA_USE_DEVICE_RDMA` | +| Training | `MAX_RESPONSE_LENGTH`, `LOG_PROB_MICRO_BSZ_PER_GPU` | +| Checkpoints | `SAVE_FREQ`, `MAX_ACTOR_CKPT_TO_KEEP`, `TEST_FREQ` | +| K8s resources | `WORKER_MEMORY`, `WORKER_CPU` | + +## Creating a New Profile + +1. Copy the closest existing profile: + ```bash + cp p5en-48xlarge.env g6e-12xlarge.env + ``` + +2. Adjust the parameters based on the target instance's hardware. + See [docs/instance-profiles/](../../../../../../docs/instance-profiles/) + for hardware specs. + +3. Key questions when creating a profile: + - **GPU VRAM < 48 GB?** → Likely need FSDP2 + offloading + - **No GPUDirect RDMA?** → Set `NCCL_PROTO=simple`, `FI_EFA_USE_DEVICE_RDMA=0` + - **4 GPUs per node?** → Set `TENSOR_PARALLEL_SIZE=4` + - **Less than 200 Gi allocatable RAM?** → Reduce `WORKER_MEMORY` + +4. Test with a 2-step smoke run before committing: + ```bash + export INSTANCE_PROFILE=g6e-12xlarge + export TOTAL_EPOCHS=1 + ./recipe/run_grpo_configurable.sh + ``` + +5. Update the Tested Configurations table in the README and the + [central compatibility matrix](../../../../../../docs/instance-compatibility.md). diff --git a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/_detect.sh b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/g5-12xlarge.env b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/g5-12xlarge.env new file mode 100644 index 000000000..6cb214e0c --- /dev/null +++ b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/g5-12xlarge.env @@ -0,0 +1,70 @@ +# g5.12xlarge — 4x A10G 24GB, 1 EFA, no NVLink, no GPUDirect RDMA +# Requires aggressive memory optimization for models >10B parameters. +# +# Validated with: openai/gpt-oss-20b (MoE, ~3B active params) GRPO on +# 3 worker + 1 head node, Multilingual-Thinking dataset +# +# MODEL ASSUMPTIONS (gpt-oss-20b MoE, ~40GB bf16 total, ~3B active): +# This profile assumes full offloading. The settings are conservative +# and should work for most models up to ~20B on g5. For other models: +# - <7B dense (e.g., Qwen3-8B): may not need full offloading. Try +# PARAM_OFFLOAD=False, OPTIMIZER_OFFLOAD=False first. Keep bf16. +# - 7B-20B dense: full offloading likely needed. These settings should +# work, but reduce TRAIN_BATCH_SIZE if OOM during backward. +# - >20B dense: will NOT fit on g5 even with offloading. Use p4de/p5. +# - MoE models: active params determine GPU memory. A 20B MoE with +# ~3B active params fits; a 40B MoE with 10B active may not. +# - Checkpoint sizes scale with total params: 20B = ~117GB/ckpt. +# Adjust SAVE_FREQ and MAX_ACTOR_CKPT_TO_KEEP for your model. +# +# Key lessons from 11 OOM iterations: +# - FSDP2 required (FSDP1 disables CPUOffload for actor role) +# - offload_policy=True required (FSDP2-specific CPU offloading flag) +# - model_dtype=bf16 required (veRL defaults actor to fp32) +# - enforce_eager=True required (CUDA graphs OOM on 24GB) +# - TP=4 required (shard across all GPUs per node) +# - save_freq=20+ required (117GB/ckpt for 20B model fills disk fast) +# - WORKER_MEMORY=150Gi (g5.12xl allocatable is ~168Gi, NOT 200Gi) +# - nnodes must exclude non-GPU head pod (Ray head has no GPUs in K8s) +# --------------------------------------------------------------------------- + +# --- Cluster geometry ------------------------------------------------------- +export NUM_GPU_PER_NODE=4 +export NUM_EFA_PER_NODE=1 + +# --- FSDP strategy ---------------------------------------------------------- +# FSDP2 with full CPU offloading — required for 24GB GPUs with >10B models +export ACTOR_STRATEGY=fsdp2 +export PARAM_OFFLOAD=True +export OPTIMIZER_OFFLOAD=True +export OFFLOAD_POLICY=True # FSDP2-specific: enables proper offload +export MODEL_DTYPE=bf16 # veRL defaults actor to fp32 — force bf16 +export RESHARD_AFTER_FORWARD=True # Free GPU memory after each forward pass + +# Ref model also offloaded +export REF_PARAM_OFFLOAD=True + +# --- vLLM rollout ----------------------------------------------------------- +export TENSOR_PARALLEL_SIZE=4 # Shard across all 4 GPUs (no NVLink) +export GPU_MEMORY_UTILIZATION=0.6 # 0.6 x 23GB = 13.8GB for vLLM +export ENFORCE_EAGER=True # CUDA graphs OOM on 24GB +export ROLLOUT_DTYPE=bfloat16 # Explicit dtype for vLLM + +# --- NCCL / EFA networking -------------------------------------------------- +export NCCL_PROTO=simple # Required: no GPUDirect RDMA on g5 +export FI_EFA_USE_DEVICE_RDMA=0 + +# --- Training defaults ------------------------------------------------------ +export MAX_RESPONSE_LENGTH=256 # Shorter to reduce KV cache pressure +export LOG_PROB_MICRO_BSZ_PER_GPU=2 # Keep low to avoid OOM during log-prob + +# --- Checkpoint management -------------------------------------------------- +# Full FSDP state for 20B model across 12 GPUs = ~117GB per checkpoint. +# save_freq=1 on 1.2TB FSx fills disk in 9 steps. +export SAVE_FREQ=20 +export MAX_ACTOR_CKPT_TO_KEEP=3 # Keep max 3 checkpoints (~351GB) +export TEST_FREQ=20 + +# --- K8s resource requests -------------------------------------------------- +export WORKER_MEMORY=150Gi # g5.12xl allocatable ~168Gi, NOT 200Gi +export WORKER_CPU=40 diff --git a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/p4de-24xlarge.env b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..56a102a97 --- /dev/null +++ b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/p4de-24xlarge.env @@ -0,0 +1,52 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVSwitch, GPUDirect RDMA +# Same VRAM as p5/p5en but fewer EFA adapters (4 vs 32). +# +# Status: Untested — expected to work with these settings. +# Based on p5en profile with EFA count adjusted. +# +# MODEL ASSUMPTIONS (same as p5en — 80GB VRAM): +# Same model fitting characteristics as p5en. The only difference is +# inter-node bandwidth (4 EFA vs 32). For large multi-node runs with +# big models, gradient sync may be slower. Consider: +# - Increasing gradient_accumulation_steps to reduce sync frequency +# - Using gradient compression if available +# --------------------------------------------------------------------------- + +# --- Cluster geometry ------------------------------------------------------- +export NUM_GPU_PER_NODE=8 +export NUM_EFA_PER_NODE=4 + +# --- FSDP strategy ---------------------------------------------------------- +# Same as p5en — 80GB VRAM, no offloading needed +export ACTOR_STRATEGY=fsdp +export PARAM_OFFLOAD=False +export OPTIMIZER_OFFLOAD=False +export OFFLOAD_POLICY= # Not set — FSDP1 doesn't use this +export MODEL_DTYPE= # Not set — use veRL default +export RESHARD_AFTER_FORWARD= # Not set — use veRL default + +export REF_PARAM_OFFLOAD=True + +# --- vLLM rollout ----------------------------------------------------------- +export TENSOR_PARALLEL_SIZE=2 # NVSwitch available +export GPU_MEMORY_UTILIZATION=0.6 +export ENFORCE_EAGER=False # CUDA graphs are fine on 80GB +export ROLLOUT_DTYPE= # Not set — use veRL default + +# --- NCCL / EFA networking -------------------------------------------------- +export NCCL_PROTO= # Default — RDMA available +export FI_EFA_USE_DEVICE_RDMA=1 + +# --- Training defaults ------------------------------------------------------ +export MAX_RESPONSE_LENGTH=1024 +export LOG_PROB_MICRO_BSZ_PER_GPU=32 + +# --- Checkpoint management -------------------------------------------------- +export SAVE_FREQ=1 +export MAX_ACTOR_CKPT_TO_KEEP= # Not set — use veRL default +export TEST_FREQ=2 + +# --- K8s resource requests -------------------------------------------------- +# p4de has ~1100Gi allocatable — more than p5en +export WORKER_MEMORY=900Gi +export WORKER_CPU=90 diff --git a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/p5en-48xlarge.env b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..a95c1c546 --- /dev/null +++ b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/p5en-48xlarge.env @@ -0,0 +1,55 @@ +# p5en.48xlarge — 8x H200 80GB, 32 EFA, NVSwitch, GPUDirect RDMA +# This is the default/baseline profile. Most parameters stay at veRL defaults. +# +# Validated with: Qwen3-8B GRPO on 4 nodes (32 GPUs), GSM8K dataset +# +# MODEL ASSUMPTIONS (Qwen3-8B, ~8B dense params): +# This profile assumes a model that fits comfortably in 80GB without +# offloading. For larger models, adjust: +# - 30B-70B dense: increase TP to 4 or 8, may need PARAM_OFFLOAD=True +# - 70B+ dense: likely need FSDP2 + offloading even on 80GB +# - MoE models: active params matter more than total; 20B MoE (like +# gpt-oss-20b with ~3B active) fits at TP=2 on 80GB +# - Batch sizes: larger models need smaller TRAIN_BATCH_SIZE and +# LOG_PROB_MICRO_BSZ_PER_GPU to avoid OOM during backward pass +# --------------------------------------------------------------------------- + +# --- Cluster geometry ------------------------------------------------------- +export NUM_GPU_PER_NODE=8 +export NUM_EFA_PER_NODE=32 + +# --- FSDP strategy ---------------------------------------------------------- +# FSDP1 is fine on 80GB GPUs — no need for FSDP2 or full offloading +export ACTOR_STRATEGY=fsdp +export PARAM_OFFLOAD=False +export OPTIMIZER_OFFLOAD=False +export OFFLOAD_POLICY= # Not set — FSDP1 doesn't use this +export MODEL_DTYPE= # Not set — use veRL default +export RESHARD_AFTER_FORWARD= # Not set — use veRL default + +# Ref model offload to CPU is a good default even on 80GB +export REF_PARAM_OFFLOAD=True + +# --- vLLM rollout ----------------------------------------------------------- +export TENSOR_PARALLEL_SIZE=2 # NVSwitch makes TP=2 efficient +export GPU_MEMORY_UTILIZATION=0.6 +export ENFORCE_EAGER=False # CUDA graphs are fine on 80GB +export ROLLOUT_DTYPE= # Not set — use veRL default + +# --- NCCL / EFA networking -------------------------------------------------- +export NCCL_PROTO= # Default (LL/LL128) — RDMA available +export FI_EFA_USE_DEVICE_RDMA=1 + +# --- Training defaults ------------------------------------------------------ +export MAX_RESPONSE_LENGTH=1024 +export LOG_PROB_MICRO_BSZ_PER_GPU=32 + +# --- Checkpoint management -------------------------------------------------- +export SAVE_FREQ=1 +export MAX_ACTOR_CKPT_TO_KEEP= # Not set — use veRL default +export TEST_FREQ=2 + +# --- K8s resource requests -------------------------------------------------- +# These are informational — used by raycluster.yaml via envsubst +export WORKER_MEMORY=600Gi +export WORKER_CPU=180 diff --git a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/run_grpo_configurable.sh b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/run_grpo_configurable.sh index b2a1e31db..868168425 100755 --- a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/run_grpo_configurable.sh +++ b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/recipe/run_grpo_configurable.sh @@ -1,11 +1,34 @@ #!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 set -xeuo pipefail -# Project configuration +# --------------------------------------------------------------------------- +# veRL GRPO training — instance-aware configurable recipe +# +# This script auto-detects the instance type and loads the appropriate +# hardware profile before submitting a GRPO training job to Ray. +# +# Profile loading order: +# 1. env_vars (cluster config, model paths, tokens) +# 2. Instance profile (FSDP strategy, offloading, TP, NCCL settings) +# 3. Explicit env var overrides (anything set after sourcing profile wins) +# +# See recipe/profiles/README.md for available profiles and how to create new ones. +# --------------------------------------------------------------------------- + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# --- Load instance profile -------------------------------------------------- +PROFILE_ENV=$("${SCRIPT_DIR}/profiles/_detect.sh" "${SCRIPT_DIR}/profiles") +echo "Loading instance profile: ${PROFILE_ENV}" +source "$PROFILE_ENV" + +# --- Project configuration -------------------------------------------------- project_name='GRPO' exp_name="GRPO-${MODEL_NAME}" -# GRPO Algorithm parameters +# --- GRPO Algorithm parameters (task-specific, not instance-dependent) ------ adv_estimator=grpo use_kl_in_reward=False use_kl_loss=True @@ -13,101 +36,166 @@ kl_loss_coef=0.001 kl_loss_type=low_var_kl entropy_coeff=0 -# Token length configuration +# --- Token length configuration --------------------------------------------- max_prompt_length=512 -max_response_length=1024 +max_response_length=${MAX_RESPONSE_LENGTH:-1024} filter_overlong_prompts=True truncation='error' -# Training configuration -train_prompt_bsz=${TRAIN_BATCH_SIZE:-32} # Reduced from 256 for faster testing +# --- Training configuration ------------------------------------------------- +train_prompt_bsz=${TRAIN_BATCH_SIZE:-32} gen_prompt_bsz=${GEN_BATCH_SIZE:-$train_prompt_bsz} -n_resp_per_prompt=${N_RESP_PER_PROMPT:-2} # Reduced from 5 for faster testing +n_resp_per_prompt=${N_RESP_PER_PROMPT:-2} train_prompt_mini_bsz=16 # Must be <= train_prompt_bsz train_prompt_micro_bsz_per_gpu=1 -# Ray configuration from env_vars +# --- Ray configuration from env_vars ---------------------------------------- RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"} WORKING_DIR=${WORKING_DIR:-"${PWD}"} -# RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"} -# Cluster configuration from env_vars +# --- Cluster configuration (from profile, overridable by env_vars) ---------- NNODES=${NUM_NODES:-4} GPUS_PER_NODE=${NUM_GPU_PER_NODE:-8} -# Model and data paths from env_vars +# --- Model and data paths from env_vars ------------------------------------- MODEL_NAME=${MODEL_NAME:-"Qwen3-8B"} MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-8B"} RAY_DATA_HOME=${RAY_DATA_HOME:-"/fsx/verl"} CKPTS_DIR="${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}" -# Data files - using GSM8K dataset +# --- Data files -------------------------------------------------------------- TRAIN_FILE="${RAY_DATA_HOME}/data/gsm8k/train.parquet" TEST_FILE="${RAY_DATA_HOME}/data/gsm8k/test.parquet" -# Performance parameters -gen_tp=2 -log_prob_micro_bsz_per_gpu=32 -gpu_memory_utilization=0.6 +# --- Performance parameters (from profile, overridable) --------------------- +gen_tp=${TENSOR_PARALLEL_SIZE:-2} +log_prob_micro_bsz_per_gpu=${LOG_PROB_MICRO_BSZ_PER_GPU:-32} +gpu_memory_utilization=${GPU_MEMORY_UTILIZATION:-0.6} +enforce_eager=${ENFORCE_EAGER:-False} + +# --- FSDP / memory optimization (from profile) ------------------------------ +actor_strategy=${ACTOR_STRATEGY:-fsdp} +model_dtype=${MODEL_DTYPE:-} +param_offload=${PARAM_OFFLOAD:-False} +optimizer_offload=${OPTIMIZER_OFFLOAD:-False} +offload_policy=${OFFLOAD_POLICY:-} +reshard_after_forward=${RESHARD_AFTER_FORWARD:-} +ref_param_offload=${REF_PARAM_OFFLOAD:-True} +rollout_dtype=${ROLLOUT_DTYPE:-} -# Memory optimization -param_offload=False -optimizer_offload=False -ref_param_offload=True +# --- Checkpoint management (from profile) ------------------------------------ +save_freq=${SAVE_FREQ:-1} +test_freq=${TEST_FREQ:-2} +max_actor_ckpt_to_keep=${MAX_ACTOR_CKPT_TO_KEEP:-} +total_epochs=${TOTAL_EPOCHS:-2} +resume_mode=${RESUME_MODE:-} -# Print configuration for verification +# --- Print configuration for verification ----------------------------------- echo "=== GRPO Training Configuration ===" -echo "Project: ${project_name}" -echo "Experiment: ${exp_name}" -echo "Model: ${MODEL_NAME} (${MODEL_PATH})" -echo "Nodes: ${NNODES}" -echo "GPUs per node: ${GPUS_PER_NODE}" -echo "Total GPUs: $((NNODES * GPUS_PER_NODE))" -echo "Data home: ${RAY_DATA_HOME}" -echo "Checkpoints: ${CKPTS_DIR}" -echo "Ray address: ${RAY_ADDRESS}" +echo "Project : ${project_name}" +echo "Experiment : ${exp_name}" +echo "Model : ${MODEL_NAME} (${MODEL_PATH})" +echo "Profile : ${PROFILE_ENV}" +echo "Nodes : ${NNODES}" +echo "GPUs/node : ${GPUS_PER_NODE}" +echo "Total GPUs : $((NNODES * GPUS_PER_NODE))" +echo "Strategy : ${actor_strategy}" +echo "Model dtype : ${model_dtype:-default}" +echo "TP : ${gen_tp}" +echo "gpu_mem_util : ${gpu_memory_utilization}" +echo "enforce_eager : ${enforce_eager}" +echo "param_offload : ${param_offload}" +echo "optim_offload : ${optimizer_offload}" +echo "offload_policy: ${offload_policy:-not set}" +echo "ref_offload : ${ref_param_offload}" +echo "NCCL_PROTO : ${NCCL_PROTO:-default}" +echo "EFA RDMA : ${FI_EFA_USE_DEVICE_RDMA:-not set}" +echo "save_freq : ${save_freq}" +echo "Data home : ${RAY_DATA_HOME}" +echo "Checkpoints : ${CKPTS_DIR}" +echo "Ray address : ${RAY_ADDRESS}" echo "==================================" -# Submit Ray job -ray job submit --no-wait \ - --working-dir "${WORKING_DIR}" \ - -- python3 -m verl.trainer.main_ppo \ - algorithm.adv_estimator=${adv_estimator} \ - data.train_files="${TRAIN_FILE}" \ - data.val_files="${TEST_FILE}" \ - data.prompt_key=question \ - data.train_batch_size=${train_prompt_bsz} \ - data.max_prompt_length=${max_prompt_length} \ - data.max_response_length=${max_response_length} \ - data.filter_overlong_prompts=${filter_overlong_prompts} \ - data.truncation=${truncation} \ - actor_rollout_ref.model.path="${MODEL_PATH}" \ - actor_rollout_ref.model.use_remove_padding=True \ - actor_rollout_ref.model.enable_gradient_checkpointing=True \ - actor_rollout_ref.actor.optim.lr=1e-6 \ - actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \ - actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_prompt_micro_bsz_per_gpu} \ - actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \ - actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \ - actor_rollout_ref.actor.kl_loss_type=${kl_loss_type} \ - actor_rollout_ref.actor.entropy_coeff=${entropy_coeff} \ - actor_rollout_ref.actor.fsdp_config.param_offload=${param_offload} \ - actor_rollout_ref.actor.fsdp_config.optimizer_offload=${optimizer_offload} \ - actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_bsz_per_gpu} \ - actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \ - actor_rollout_ref.rollout.name=vllm \ - actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} \ - actor_rollout_ref.rollout.n=${n_resp_per_prompt} \ - actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_bsz_per_gpu} \ - actor_rollout_ref.ref.fsdp_config.param_offload=${ref_param_offload} \ - algorithm.use_kl_in_reward=${use_kl_in_reward} \ - trainer.critic_warmup=0 \ - trainer.logger='["console"]' \ - trainer.project_name="${project_name}" \ - trainer.experiment_name="${exp_name}" \ - trainer.n_gpus_per_node=${GPUS_PER_NODE} \ - trainer.nnodes=${NNODES} \ - trainer.default_local_dir="${CKPTS_DIR}" \ - trainer.save_freq=1 \ - trainer.test_freq=2 \ - trainer.total_epochs=2 +# --- Build ray job submit command dynamically -------------------------------- +# Start with required arguments +RAY_CMD=( + ray job submit --no-wait + --working-dir "${WORKING_DIR}" + -- python3 -m verl.trainer.main_ppo + algorithm.adv_estimator=${adv_estimator} + data.train_files="${TRAIN_FILE}" + data.val_files="${TEST_FILE}" + data.prompt_key=question + data.train_batch_size=${train_prompt_bsz} + data.max_prompt_length=${max_prompt_length} + data.max_response_length=${max_response_length} + data.filter_overlong_prompts=${filter_overlong_prompts} + data.truncation=${truncation} + actor_rollout_ref.model.path="${MODEL_PATH}" + actor_rollout_ref.model.use_remove_padding=True + actor_rollout_ref.model.enable_gradient_checkpointing=True + actor_rollout_ref.actor.strategy=${actor_strategy} + actor_rollout_ref.actor.optim.lr=1e-6 + actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_prompt_micro_bsz_per_gpu} + actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} + actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} + actor_rollout_ref.actor.kl_loss_type=${kl_loss_type} + actor_rollout_ref.actor.entropy_coeff=${entropy_coeff} + actor_rollout_ref.actor.fsdp_config.param_offload=${param_offload} + actor_rollout_ref.actor.fsdp_config.optimizer_offload=${optimizer_offload} + actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_bsz_per_gpu} + actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} + actor_rollout_ref.rollout.name=vllm + actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} + actor_rollout_ref.rollout.n=${n_resp_per_prompt} + actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_bsz_per_gpu} + actor_rollout_ref.ref.fsdp_config.param_offload=${ref_param_offload} + algorithm.use_kl_in_reward=${use_kl_in_reward} + trainer.critic_warmup=0 + "trainer.logger=[\"console\"]" + trainer.project_name="${project_name}" + trainer.experiment_name="${exp_name}" + trainer.n_gpus_per_node=${GPUS_PER_NODE} + trainer.nnodes=${NNODES} + trainer.default_local_dir="${CKPTS_DIR}" + trainer.save_freq=${save_freq} + trainer.test_freq=${test_freq} + trainer.total_epochs=${total_epochs} +) + +# --- Conditionally add profile-specific overrides ---------------------------- +# Only pass these if the profile set them (avoids sending empty/default values +# that might conflict with veRL's own defaults) + +if [[ -n "${offload_policy}" ]]; then + RAY_CMD+=(actor_rollout_ref.actor.fsdp_config.offload_policy=${offload_policy}) +fi + +if [[ -n "${model_dtype}" ]]; then + RAY_CMD+=(actor_rollout_ref.actor.fsdp_config.model_dtype=${model_dtype}) + RAY_CMD+=(actor_rollout_ref.ref.fsdp_config.model_dtype=${model_dtype}) +fi + +if [[ -n "${reshard_after_forward}" ]]; then + RAY_CMD+=(actor_rollout_ref.actor.fsdp_config.reshard_after_forward=${reshard_after_forward}) +fi + +if [[ "${enforce_eager}" == "True" ]]; then + RAY_CMD+=(actor_rollout_ref.rollout.enforce_eager=True) +fi + +if [[ -n "${rollout_dtype}" ]]; then + RAY_CMD+=(actor_rollout_ref.rollout.dtype=${rollout_dtype}) +fi + +if [[ -n "${max_actor_ckpt_to_keep}" ]]; then + RAY_CMD+=(trainer.max_actor_ckpt_to_keep=${max_actor_ckpt_to_keep}) +fi + +if [[ -n "${resume_mode}" ]]; then + RAY_CMD+=(trainer.resume_mode=${resume_mode}) +fi + +# --- Submit ------------------------------------------------------------------ +"${RAY_CMD[@]}" diff --git a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/setup/env_vars.example b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/setup/env_vars.example index 86ba4cc34..198bfd846 100644 --- a/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/setup/env_vars.example +++ b/3.test_cases/pytorch/verl/hyperpod-eks/rlvr/setup/env_vars.example @@ -9,8 +9,14 @@ export TAG=ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0 export EKS_CLUSTER_NAME="" export INSTANCE_TYPE="" # Example: "p5en.48xlarge" export NUM_NODES=4 # Single source of truth for number of nodes -export NUM_GPU_PER_NODE=8 -export NUM_EFA_PER_NODE=16 +# NUM_GPU_PER_NODE and NUM_EFA_PER_NODE are set by the instance profile. +# Override here only if your setup differs from the standard profile. +# export NUM_GPU_PER_NODE=8 +# export NUM_EFA_PER_NODE=16 + +# Instance profile (optional — auto-detected from INSTANCE_TYPE if not set) +# See recipe/profiles/ for available profiles. +# export INSTANCE_PROFILE="g5-12xlarge" # Uncomment to force a specific profile export PRIVATE_SUBNET_ID="subnet-xxxxxxxxxxxxxxxxx" export SECURITY_GROUP_ID="sg-xxxxxxxxxxxxxxxxx" diff --git a/3.test_cases/pytorch/verl/kubernetes/rlvr/README.md b/3.test_cases/pytorch/verl/kubernetes/rlvr/README.md index e4f0233f7..e555dc0bb 100644 --- a/3.test_cases/pytorch/verl/kubernetes/rlvr/README.md +++ b/3.test_cases/pytorch/verl/kubernetes/rlvr/README.md @@ -10,6 +10,39 @@ This repository provides a complete setup for running reinforcement learning fro [Reinforcement Learning from Verifiable Rewards (RLVR)](https://arxiv.org/abs/2506.14245) is a training approach where models learn from tasks with objectively verifiable outcomes, such as math problems or code execution. Unlike human preference-based RL, RLVR uses ground-truth correctness as the reward signal, making it particularly effective for reasoning tasks. +## Tested Configurations + +| Instance | GPUs | Model | Nodes | Key Settings | Status | +|----------|------|-------|-------|-------------|--------| +| p5en.48xlarge | 8 x H200 80 GB | Qwen3-8B | 4 | FSDP1, TP=2, ref_offload only | Tested | +| g5.12xlarge | 4 x A10G 24 GB | gpt-oss-20b (MoE) | 3 workers + 1 head | FSDP2, full offload, TP=4, bf16 | Tested | +| p4de.24xlarge | 8 x A100 80 GB | Qwen3-8B | 4 | FSDP1, TP=2 | Untested | +| g6e.12xlarge | 4 x L40S 48 GB | — | — | — | Untested | + +> **Running on a different instance type?** See the +> [Instance Compatibility Guide](../../../../../docs/instance-compatibility.md) +> for the parameter changes needed when moving between instance families, and +> the [instance profiles](../../../../../docs/instance-profiles/) for +> per-instance hardware details and NCCL/EFA settings. +> +> **g5 users**: Running on g5.12xlarge (A10G 24 GB) requires significant +> parameter changes including FSDP2 with full CPU offloading, TP=4, +> enforce_eager=True, and NCCL_PROTO=simple. The key differences are: +> +> | Parameter | p5en (80 GB) | g5 (24 GB) | Why | +> |-----------|-------------|-----------|-----| +> | FSDP strategy | `fsdp` (FSDP1) | `fsdp2` | FSDP1 disables CPUOffload for actor | +> | `offload_policy` | not set | `True` | Enables FSDP2 CPU offloading | +> | `model_dtype` | default (fp32) | `bf16` | Explicit bf16 halves memory | +> | `enforce_eager` | `False` | `True` | CUDA graphs OOM on 24 GB | +> | `tensor_parallel_size` | 2 | 4 | Shard across all 4 GPUs | +> | `param_offload` | `False` | `True` | Offload params to CPU | +> | `optimizer_offload` | `False` | `True` | Offload optimizer to CPU | +> | `NCCL_PROTO` | default | `simple` | No GPUDirect RDMA on g5 | +> | `save_freq` | 1 | 20+ | 117 GB/ckpt for 20B; fills disk fast | +> | `WORKER_MEMORY` | 200 Gi+ | 150 Gi | g5.12xl allocatable ~168 Gi | +> | `nnodes` | node count | worker count only | Head pod without GPUs causes NCCL hang | + ## Getting started ### Prerequisites diff --git a/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/README.md b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/README.md new file mode 100644 index 000000000..69f8040b2 --- /dev/null +++ b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/README.md @@ -0,0 +1,102 @@ +# Instance Profiles for veRL RLVR Recipes + +This directory contains instance-specific configuration profiles that +override the hardware-dependent parameters in the recipe scripts. + +## How It Works + +1. The recipe script (e.g., `run_grpo_configurable.sh`) calls `_detect.sh` + to determine which profile to load +2. `_detect.sh` resolves the instance type (from env var, EC2 metadata, or + GPU detection) and returns the path to the matching `.env` file +3. The recipe sources the `.env` file, overriding default values with + instance-specific settings + +## Detection Order + +The profile is selected by the first method that succeeds: + +1. **`INSTANCE_PROFILE` env var** — explicit override (e.g., `g5-12xlarge`) +2. **`INSTANCE_TYPE` env var** — from `setup/env_vars` (e.g., `g5.12xlarge`) +3. **EC2 instance metadata API** — works on bare metal and K8s with host networking +4. **GPU name from `nvidia-smi`** — fallback when metadata is unavailable + +## Available Profiles + +| Profile | Instance | GPU | VRAM | Tested Model | Status | +|---------|----------|-----|------|-------------|--------| +| [p5en-48xlarge.env](p5en-48xlarge.env) | p5en.48xlarge | 8x H200 | 80 GB | Qwen3-8B (dense) | Tested | +| [g5-12xlarge.env](g5-12xlarge.env) | g5.12xlarge | 4x A10G | 24 GB | gpt-oss-20b (MoE) | Tested | +| [p4de-24xlarge.env](p4de-24xlarge.env) | p4de.24xlarge | 8x A100 | 80 GB | — | Untested | + +## About Model Assumptions + +Each profile is tested with a specific model and documents what to change +for other model sizes. The instance-dependent settings (NCCL, EFA, GPU +count) stay the same regardless of model. What changes per model: + +| Setting | Driven by... | +|---------|-------------| +| `TENSOR_PARALLEL_SIZE` | Model size relative to per-GPU VRAM | +| `PARAM_OFFLOAD`, `OPTIMIZER_OFFLOAD` | Whether model + optimizer fit in GPU VRAM | +| `GPU_MEMORY_UTILIZATION` | Model shard size relative to per-GPU VRAM | +| `LOG_PROB_MICRO_BSZ_PER_GPU` | Activation memory during log-prob computation | +| `MAX_RESPONSE_LENGTH` | KV cache size (longer = more VRAM) | +| `SAVE_FREQ`, `MAX_ACTOR_CKPT_TO_KEEP` | Checkpoint size (scales with total params) | + +Settings that do NOT change per model (only per instance): + +| Setting | Driven by... | +|---------|-------------| +| `NCCL_PROTO`, `FI_EFA_USE_DEVICE_RDMA` | Whether instance has GPUDirect RDMA | +| `NUM_GPU_PER_NODE`, `NUM_EFA_PER_NODE` | Instance hardware | +| `WORKER_MEMORY`, `WORKER_CPU` | Instance allocatable resources | +| `ENFORCE_EAGER` | GPU VRAM headroom (always True on 24GB) | + +See the `MODEL ASSUMPTIONS` comment block at the top of each `.env` file +for guidance on adjusting settings for different models. + +## What's in a Profile + +Profiles contain **only instance-dependent parameters** — settings that +change based on hardware. Algorithm-specific settings (KL loss, reward +function, dataset, learning rate, etc.) stay in the recipe script. + +Instance-dependent parameters include: + +| Category | Parameters | +|----------|-----------| +| Cluster geometry | `NUM_GPU_PER_NODE`, `NUM_EFA_PER_NODE` | +| FSDP strategy | `ACTOR_STRATEGY`, `PARAM_OFFLOAD`, `OPTIMIZER_OFFLOAD`, `OFFLOAD_POLICY`, `MODEL_DTYPE`, `RESHARD_AFTER_FORWARD` | +| vLLM rollout | `TENSOR_PARALLEL_SIZE`, `GPU_MEMORY_UTILIZATION`, `ENFORCE_EAGER`, `ROLLOUT_DTYPE` | +| NCCL / EFA | `NCCL_PROTO`, `FI_EFA_USE_DEVICE_RDMA` | +| Training | `MAX_RESPONSE_LENGTH`, `LOG_PROB_MICRO_BSZ_PER_GPU` | +| Checkpoints | `SAVE_FREQ`, `MAX_ACTOR_CKPT_TO_KEEP`, `TEST_FREQ` | +| K8s resources | `WORKER_MEMORY`, `WORKER_CPU` | + +## Creating a New Profile + +1. Copy the closest existing profile: + ```bash + cp p5en-48xlarge.env g6e-12xlarge.env + ``` + +2. Adjust the parameters based on the target instance's hardware. + See [docs/instance-profiles/](../../../../../../docs/instance-profiles/) + for hardware specs. + +3. Key questions when creating a profile: + - **GPU VRAM < 48 GB?** → Likely need FSDP2 + offloading + - **No GPUDirect RDMA?** → Set `NCCL_PROTO=simple`, `FI_EFA_USE_DEVICE_RDMA=0` + - **4 GPUs per node?** → Set `TENSOR_PARALLEL_SIZE=4` + - **Less than 200 Gi allocatable RAM?** → Reduce `WORKER_MEMORY` + +4. Test with a 2-step smoke run before committing: + ```bash + export INSTANCE_PROFILE=g6e-12xlarge + export TOTAL_EPOCHS=1 + ./recipe/run_grpo_configurable.sh + ``` + +5. Update the Tested Configurations table in the README and the + [central compatibility matrix](../../../../../../docs/instance-compatibility.md). diff --git a/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/_detect.sh b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/g5-12xlarge.env b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/g5-12xlarge.env new file mode 100644 index 000000000..6cb214e0c --- /dev/null +++ b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/g5-12xlarge.env @@ -0,0 +1,70 @@ +# g5.12xlarge — 4x A10G 24GB, 1 EFA, no NVLink, no GPUDirect RDMA +# Requires aggressive memory optimization for models >10B parameters. +# +# Validated with: openai/gpt-oss-20b (MoE, ~3B active params) GRPO on +# 3 worker + 1 head node, Multilingual-Thinking dataset +# +# MODEL ASSUMPTIONS (gpt-oss-20b MoE, ~40GB bf16 total, ~3B active): +# This profile assumes full offloading. The settings are conservative +# and should work for most models up to ~20B on g5. For other models: +# - <7B dense (e.g., Qwen3-8B): may not need full offloading. Try +# PARAM_OFFLOAD=False, OPTIMIZER_OFFLOAD=False first. Keep bf16. +# - 7B-20B dense: full offloading likely needed. These settings should +# work, but reduce TRAIN_BATCH_SIZE if OOM during backward. +# - >20B dense: will NOT fit on g5 even with offloading. Use p4de/p5. +# - MoE models: active params determine GPU memory. A 20B MoE with +# ~3B active params fits; a 40B MoE with 10B active may not. +# - Checkpoint sizes scale with total params: 20B = ~117GB/ckpt. +# Adjust SAVE_FREQ and MAX_ACTOR_CKPT_TO_KEEP for your model. +# +# Key lessons from 11 OOM iterations: +# - FSDP2 required (FSDP1 disables CPUOffload for actor role) +# - offload_policy=True required (FSDP2-specific CPU offloading flag) +# - model_dtype=bf16 required (veRL defaults actor to fp32) +# - enforce_eager=True required (CUDA graphs OOM on 24GB) +# - TP=4 required (shard across all GPUs per node) +# - save_freq=20+ required (117GB/ckpt for 20B model fills disk fast) +# - WORKER_MEMORY=150Gi (g5.12xl allocatable is ~168Gi, NOT 200Gi) +# - nnodes must exclude non-GPU head pod (Ray head has no GPUs in K8s) +# --------------------------------------------------------------------------- + +# --- Cluster geometry ------------------------------------------------------- +export NUM_GPU_PER_NODE=4 +export NUM_EFA_PER_NODE=1 + +# --- FSDP strategy ---------------------------------------------------------- +# FSDP2 with full CPU offloading — required for 24GB GPUs with >10B models +export ACTOR_STRATEGY=fsdp2 +export PARAM_OFFLOAD=True +export OPTIMIZER_OFFLOAD=True +export OFFLOAD_POLICY=True # FSDP2-specific: enables proper offload +export MODEL_DTYPE=bf16 # veRL defaults actor to fp32 — force bf16 +export RESHARD_AFTER_FORWARD=True # Free GPU memory after each forward pass + +# Ref model also offloaded +export REF_PARAM_OFFLOAD=True + +# --- vLLM rollout ----------------------------------------------------------- +export TENSOR_PARALLEL_SIZE=4 # Shard across all 4 GPUs (no NVLink) +export GPU_MEMORY_UTILIZATION=0.6 # 0.6 x 23GB = 13.8GB for vLLM +export ENFORCE_EAGER=True # CUDA graphs OOM on 24GB +export ROLLOUT_DTYPE=bfloat16 # Explicit dtype for vLLM + +# --- NCCL / EFA networking -------------------------------------------------- +export NCCL_PROTO=simple # Required: no GPUDirect RDMA on g5 +export FI_EFA_USE_DEVICE_RDMA=0 + +# --- Training defaults ------------------------------------------------------ +export MAX_RESPONSE_LENGTH=256 # Shorter to reduce KV cache pressure +export LOG_PROB_MICRO_BSZ_PER_GPU=2 # Keep low to avoid OOM during log-prob + +# --- Checkpoint management -------------------------------------------------- +# Full FSDP state for 20B model across 12 GPUs = ~117GB per checkpoint. +# save_freq=1 on 1.2TB FSx fills disk in 9 steps. +export SAVE_FREQ=20 +export MAX_ACTOR_CKPT_TO_KEEP=3 # Keep max 3 checkpoints (~351GB) +export TEST_FREQ=20 + +# --- K8s resource requests -------------------------------------------------- +export WORKER_MEMORY=150Gi # g5.12xl allocatable ~168Gi, NOT 200Gi +export WORKER_CPU=40 diff --git a/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/p4de-24xlarge.env b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/p4de-24xlarge.env new file mode 100644 index 000000000..56a102a97 --- /dev/null +++ b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/p4de-24xlarge.env @@ -0,0 +1,52 @@ +# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVSwitch, GPUDirect RDMA +# Same VRAM as p5/p5en but fewer EFA adapters (4 vs 32). +# +# Status: Untested — expected to work with these settings. +# Based on p5en profile with EFA count adjusted. +# +# MODEL ASSUMPTIONS (same as p5en — 80GB VRAM): +# Same model fitting characteristics as p5en. The only difference is +# inter-node bandwidth (4 EFA vs 32). For large multi-node runs with +# big models, gradient sync may be slower. Consider: +# - Increasing gradient_accumulation_steps to reduce sync frequency +# - Using gradient compression if available +# --------------------------------------------------------------------------- + +# --- Cluster geometry ------------------------------------------------------- +export NUM_GPU_PER_NODE=8 +export NUM_EFA_PER_NODE=4 + +# --- FSDP strategy ---------------------------------------------------------- +# Same as p5en — 80GB VRAM, no offloading needed +export ACTOR_STRATEGY=fsdp +export PARAM_OFFLOAD=False +export OPTIMIZER_OFFLOAD=False +export OFFLOAD_POLICY= # Not set — FSDP1 doesn't use this +export MODEL_DTYPE= # Not set — use veRL default +export RESHARD_AFTER_FORWARD= # Not set — use veRL default + +export REF_PARAM_OFFLOAD=True + +# --- vLLM rollout ----------------------------------------------------------- +export TENSOR_PARALLEL_SIZE=2 # NVSwitch available +export GPU_MEMORY_UTILIZATION=0.6 +export ENFORCE_EAGER=False # CUDA graphs are fine on 80GB +export ROLLOUT_DTYPE= # Not set — use veRL default + +# --- NCCL / EFA networking -------------------------------------------------- +export NCCL_PROTO= # Default — RDMA available +export FI_EFA_USE_DEVICE_RDMA=1 + +# --- Training defaults ------------------------------------------------------ +export MAX_RESPONSE_LENGTH=1024 +export LOG_PROB_MICRO_BSZ_PER_GPU=32 + +# --- Checkpoint management -------------------------------------------------- +export SAVE_FREQ=1 +export MAX_ACTOR_CKPT_TO_KEEP= # Not set — use veRL default +export TEST_FREQ=2 + +# --- K8s resource requests -------------------------------------------------- +# p4de has ~1100Gi allocatable — more than p5en +export WORKER_MEMORY=900Gi +export WORKER_CPU=90 diff --git a/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/p5en-48xlarge.env b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/p5en-48xlarge.env new file mode 100644 index 000000000..a95c1c546 --- /dev/null +++ b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/profiles/p5en-48xlarge.env @@ -0,0 +1,55 @@ +# p5en.48xlarge — 8x H200 80GB, 32 EFA, NVSwitch, GPUDirect RDMA +# This is the default/baseline profile. Most parameters stay at veRL defaults. +# +# Validated with: Qwen3-8B GRPO on 4 nodes (32 GPUs), GSM8K dataset +# +# MODEL ASSUMPTIONS (Qwen3-8B, ~8B dense params): +# This profile assumes a model that fits comfortably in 80GB without +# offloading. For larger models, adjust: +# - 30B-70B dense: increase TP to 4 or 8, may need PARAM_OFFLOAD=True +# - 70B+ dense: likely need FSDP2 + offloading even on 80GB +# - MoE models: active params matter more than total; 20B MoE (like +# gpt-oss-20b with ~3B active) fits at TP=2 on 80GB +# - Batch sizes: larger models need smaller TRAIN_BATCH_SIZE and +# LOG_PROB_MICRO_BSZ_PER_GPU to avoid OOM during backward pass +# --------------------------------------------------------------------------- + +# --- Cluster geometry ------------------------------------------------------- +export NUM_GPU_PER_NODE=8 +export NUM_EFA_PER_NODE=32 + +# --- FSDP strategy ---------------------------------------------------------- +# FSDP1 is fine on 80GB GPUs — no need for FSDP2 or full offloading +export ACTOR_STRATEGY=fsdp +export PARAM_OFFLOAD=False +export OPTIMIZER_OFFLOAD=False +export OFFLOAD_POLICY= # Not set — FSDP1 doesn't use this +export MODEL_DTYPE= # Not set — use veRL default +export RESHARD_AFTER_FORWARD= # Not set — use veRL default + +# Ref model offload to CPU is a good default even on 80GB +export REF_PARAM_OFFLOAD=True + +# --- vLLM rollout ----------------------------------------------------------- +export TENSOR_PARALLEL_SIZE=2 # NVSwitch makes TP=2 efficient +export GPU_MEMORY_UTILIZATION=0.6 +export ENFORCE_EAGER=False # CUDA graphs are fine on 80GB +export ROLLOUT_DTYPE= # Not set — use veRL default + +# --- NCCL / EFA networking -------------------------------------------------- +export NCCL_PROTO= # Default (LL/LL128) — RDMA available +export FI_EFA_USE_DEVICE_RDMA=1 + +# --- Training defaults ------------------------------------------------------ +export MAX_RESPONSE_LENGTH=1024 +export LOG_PROB_MICRO_BSZ_PER_GPU=32 + +# --- Checkpoint management -------------------------------------------------- +export SAVE_FREQ=1 +export MAX_ACTOR_CKPT_TO_KEEP= # Not set — use veRL default +export TEST_FREQ=2 + +# --- K8s resource requests -------------------------------------------------- +# These are informational — used by raycluster.yaml via envsubst +export WORKER_MEMORY=600Gi +export WORKER_CPU=180 diff --git a/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/run_grpo_configurable.sh b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/run_grpo_configurable.sh index e3992e176..95e03328c 100755 --- a/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/run_grpo_configurable.sh +++ b/3.test_cases/pytorch/verl/kubernetes/rlvr/recipe/run_grpo_configurable.sh @@ -1,11 +1,34 @@ #!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 set -xeuo pipefail -# Project configuration +# --------------------------------------------------------------------------- +# veRL GRPO training — instance-aware configurable recipe (Kubernetes/EKS) +# +# This script auto-detects the instance type and loads the appropriate +# hardware profile before submitting a GRPO training job to Ray. +# +# Profile loading order: +# 1. env_vars (cluster config, model paths, tokens) +# 2. Instance profile (FSDP strategy, offloading, TP, NCCL settings) +# 3. Explicit env var overrides (anything set after sourcing profile wins) +# +# See recipe/profiles/README.md for available profiles and how to create new ones. +# --------------------------------------------------------------------------- + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# --- Load instance profile -------------------------------------------------- +PROFILE_ENV=$("${SCRIPT_DIR}/profiles/_detect.sh" "${SCRIPT_DIR}/profiles") +echo "Loading instance profile: ${PROFILE_ENV}" +source "$PROFILE_ENV" + +# --- Project configuration -------------------------------------------------- project_name='GRPO' exp_name="GRPO-${MODEL_NAME}" -# GRPO Algorithm parameters +# --- GRPO Algorithm parameters (task-specific, not instance-dependent) ------ adv_estimator=grpo use_kl_in_reward=False use_kl_loss=True @@ -13,110 +36,169 @@ kl_loss_coef=0.001 kl_loss_type=low_var_kl entropy_coeff=0 -# Token length configuration +# --- Token length configuration --------------------------------------------- max_prompt_length=512 -max_response_length=1024 +max_response_length=${MAX_RESPONSE_LENGTH:-1024} filter_overlong_prompts=True truncation='error' -# Training configuration -train_prompt_bsz=${TRAIN_BATCH_SIZE:-32} # Reduced from 256 for faster testing +# --- Training configuration ------------------------------------------------- +train_prompt_bsz=${TRAIN_BATCH_SIZE:-32} gen_prompt_bsz=${GEN_BATCH_SIZE:-$train_prompt_bsz} -n_resp_per_prompt=${N_RESP_PER_PROMPT:-2} # Reduced from 5 for faster testing +n_resp_per_prompt=${N_RESP_PER_PROMPT:-2} train_prompt_mini_bsz=16 # Must be <= train_prompt_bsz train_prompt_micro_bsz_per_gpu=1 -# Ray configuration from env_vars +# --- Ray configuration from env_vars ---------------------------------------- RAY_ADDRESS=${RAY_ADDRESS:-"http://localhost:8265"} WORKING_DIR=${WORKING_DIR:-"${PWD}"} -# RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"} -# Cluster configuration from env_vars +# --- Cluster configuration (from profile, overridable by env_vars) ---------- NNODES=${NUM_NODES:-4} GPUS_PER_NODE=${NUM_GPU_PER_NODE:-8} -# Model and data paths from env_vars +# --- Model and data paths from env_vars ------------------------------------- MODEL_NAME=${MODEL_NAME:-"Qwen3-8B"} MODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-8B"} RAY_DATA_HOME=${RAY_DATA_HOME:-"/fsx/verl"} CKPTS_DIR="${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}" -# Data files - using GSM8K dataset +# --- Data files -------------------------------------------------------------- TRAIN_FILE="${RAY_DATA_HOME}/data/gsm8k/train.parquet" TEST_FILE="${RAY_DATA_HOME}/data/gsm8k/test.parquet" -# S3 checkpoint configuration (for managed tiered checkpointing) +# --- S3 checkpoint configuration (for managed tiered checkpointing) ---------- S3_CHECKPOINT_BASE=${S3_CHECKPOINT_BASE:-"s3://sagemaker-mvincig-rlvr-e66849d3-bucket/checkpoints"} CHECKPOINT_NAMESPACE="${exp_name}-$(date +%s)" +CHECKPOINT_ASYNC_SAVE=True +CHECKPOINT_SAVE_TO_S3_FREQ=5 -# Checkpoint configuration -CHECKPOINT_ASYNC_SAVE=True # Enable async checkpointing -CHECKPOINT_SAVE_TO_S3_FREQ=5 # Save to S3 every N steps (in addition to in-memory) +# --- Performance parameters (from profile, overridable) --------------------- +gen_tp=${TENSOR_PARALLEL_SIZE:-2} +log_prob_micro_bsz_per_gpu=${LOG_PROB_MICRO_BSZ_PER_GPU:-32} +gpu_memory_utilization=${GPU_MEMORY_UTILIZATION:-0.6} +enforce_eager=${ENFORCE_EAGER:-False} -# Performance parameters -gen_tp=2 -log_prob_micro_bsz_per_gpu=32 -gpu_memory_utilization=0.6 +# --- FSDP / memory optimization (from profile) ------------------------------ +actor_strategy=${ACTOR_STRATEGY:-fsdp} +model_dtype=${MODEL_DTYPE:-} +param_offload=${PARAM_OFFLOAD:-False} +optimizer_offload=${OPTIMIZER_OFFLOAD:-False} +offload_policy=${OFFLOAD_POLICY:-} +reshard_after_forward=${RESHARD_AFTER_FORWARD:-} +ref_param_offload=${REF_PARAM_OFFLOAD:-True} +rollout_dtype=${ROLLOUT_DTYPE:-} -# Memory optimization -param_offload=False -optimizer_offload=False -ref_param_offload=True +# --- Checkpoint management (from profile) ------------------------------------ +save_freq=${SAVE_FREQ:-1} +test_freq=${TEST_FREQ:-2} +max_actor_ckpt_to_keep=${MAX_ACTOR_CKPT_TO_KEEP:-} +total_epochs=${TOTAL_EPOCHS:-2} +resume_mode=${RESUME_MODE:-} -# Print configuration for verification +# --- Print configuration for verification ----------------------------------- echo "=== GRPO Training Configuration ===" -echo "Project: ${project_name}" -echo "Experiment: ${exp_name}" -echo "Model: ${MODEL_NAME} (${MODEL_PATH})" -echo "Nodes: ${NNODES}" -echo "GPUs per node: ${GPUS_PER_NODE}" -echo "Total GPUs: $((NNODES * GPUS_PER_NODE))" -echo "Data home: ${RAY_DATA_HOME}" -echo "Checkpoints: ${CKPTS_DIR}" +echo "Project : ${project_name}" +echo "Experiment : ${exp_name}" +echo "Model : ${MODEL_NAME} (${MODEL_PATH})" +echo "Profile : ${PROFILE_ENV}" +echo "Nodes : ${NNODES}" +echo "GPUs/node : ${GPUS_PER_NODE}" +echo "Total GPUs : $((NNODES * GPUS_PER_NODE))" +echo "Strategy : ${actor_strategy}" +echo "Model dtype : ${model_dtype:-default}" +echo "TP : ${gen_tp}" +echo "gpu_mem_util : ${gpu_memory_utilization}" +echo "enforce_eager : ${enforce_eager}" +echo "param_offload : ${param_offload}" +echo "optim_offload : ${optimizer_offload}" +echo "offload_policy: ${offload_policy:-not set}" +echo "ref_offload : ${ref_param_offload}" +echo "NCCL_PROTO : ${NCCL_PROTO:-default}" +echo "EFA RDMA : ${FI_EFA_USE_DEVICE_RDMA:-not set}" +echo "save_freq : ${save_freq}" +echo "Data home : ${RAY_DATA_HOME}" +echo "Checkpoints : ${CKPTS_DIR}" echo "S3 Checkpoints: ${S3_CHECKPOINT_BASE}" -echo "Ray address: ${RAY_ADDRESS}" +echo "Ray address : ${RAY_ADDRESS}" echo "==================================" -# Submit Ray job -ray job submit --no-wait \ - --working-dir "${WORKING_DIR}" \ - -- python3 -m verl.trainer.main_ppo \ - algorithm.adv_estimator=${adv_estimator} \ - data.train_files="${TRAIN_FILE}" \ - data.val_files="${TEST_FILE}" \ - data.prompt_key=question \ - data.train_batch_size=${train_prompt_bsz} \ - data.max_prompt_length=${max_prompt_length} \ - data.max_response_length=${max_response_length} \ - data.filter_overlong_prompts=${filter_overlong_prompts} \ - data.truncation=${truncation} \ - actor_rollout_ref.model.path="${MODEL_PATH}" \ - actor_rollout_ref.model.use_remove_padding=True \ - actor_rollout_ref.model.enable_gradient_checkpointing=True \ - actor_rollout_ref.actor.optim.lr=1e-6 \ - actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \ - actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_prompt_micro_bsz_per_gpu} \ - actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \ - actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \ - actor_rollout_ref.actor.kl_loss_type=${kl_loss_type} \ - actor_rollout_ref.actor.entropy_coeff=${entropy_coeff} \ - actor_rollout_ref.actor.fsdp_config.param_offload=${param_offload} \ - actor_rollout_ref.actor.fsdp_config.optimizer_offload=${optimizer_offload} \ - actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_bsz_per_gpu} \ - actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \ - actor_rollout_ref.rollout.name=vllm \ - actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} \ - actor_rollout_ref.rollout.n=${n_resp_per_prompt} \ - actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_bsz_per_gpu} \ - actor_rollout_ref.ref.fsdp_config.param_offload=${ref_param_offload} \ - algorithm.use_kl_in_reward=${use_kl_in_reward} \ - trainer.critic_warmup=0 \ - trainer.logger='["console"]' \ - trainer.project_name="${project_name}" \ - trainer.experiment_name="${exp_name}" \ - trainer.n_gpus_per_node=${GPUS_PER_NODE} \ - trainer.nnodes=${NNODES} \ - trainer.default_local_dir="${CKPTS_DIR}" \ - trainer.save_freq=1 \ - trainer.test_freq=2 \ - trainer.total_epochs=2 +# --- Build ray job submit command dynamically -------------------------------- +RAY_CMD=( + ray job submit --no-wait + --working-dir "${WORKING_DIR}" + -- python3 -m verl.trainer.main_ppo + algorithm.adv_estimator=${adv_estimator} + data.train_files="${TRAIN_FILE}" + data.val_files="${TEST_FILE}" + data.prompt_key=question + data.train_batch_size=${train_prompt_bsz} + data.max_prompt_length=${max_prompt_length} + data.max_response_length=${max_response_length} + data.filter_overlong_prompts=${filter_overlong_prompts} + data.truncation=${truncation} + actor_rollout_ref.model.path="${MODEL_PATH}" + actor_rollout_ref.model.use_remove_padding=True + actor_rollout_ref.model.enable_gradient_checkpointing=True + actor_rollout_ref.actor.strategy=${actor_strategy} + actor_rollout_ref.actor.optim.lr=1e-6 + actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${train_prompt_micro_bsz_per_gpu} + actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} + actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} + actor_rollout_ref.actor.kl_loss_type=${kl_loss_type} + actor_rollout_ref.actor.entropy_coeff=${entropy_coeff} + actor_rollout_ref.actor.fsdp_config.param_offload=${param_offload} + actor_rollout_ref.actor.fsdp_config.optimizer_offload=${optimizer_offload} + actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_bsz_per_gpu} + actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} + actor_rollout_ref.rollout.name=vllm + actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} + actor_rollout_ref.rollout.n=${n_resp_per_prompt} + actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_bsz_per_gpu} + actor_rollout_ref.ref.fsdp_config.param_offload=${ref_param_offload} + algorithm.use_kl_in_reward=${use_kl_in_reward} + trainer.critic_warmup=0 + "trainer.logger=[\"console\"]" + trainer.project_name="${project_name}" + trainer.experiment_name="${exp_name}" + trainer.n_gpus_per_node=${GPUS_PER_NODE} + trainer.nnodes=${NNODES} + trainer.default_local_dir="${CKPTS_DIR}" + trainer.save_freq=${save_freq} + trainer.test_freq=${test_freq} + trainer.total_epochs=${total_epochs} +) + +# --- Conditionally add profile-specific overrides ---------------------------- +if [[ -n "${offload_policy}" ]]; then + RAY_CMD+=(actor_rollout_ref.actor.fsdp_config.offload_policy=${offload_policy}) +fi + +if [[ -n "${model_dtype}" ]]; then + RAY_CMD+=(actor_rollout_ref.actor.fsdp_config.model_dtype=${model_dtype}) + RAY_CMD+=(actor_rollout_ref.ref.fsdp_config.model_dtype=${model_dtype}) +fi + +if [[ -n "${reshard_after_forward}" ]]; then + RAY_CMD+=(actor_rollout_ref.actor.fsdp_config.reshard_after_forward=${reshard_after_forward}) +fi + +if [[ "${enforce_eager}" == "True" ]]; then + RAY_CMD+=(actor_rollout_ref.rollout.enforce_eager=True) +fi + +if [[ -n "${rollout_dtype}" ]]; then + RAY_CMD+=(actor_rollout_ref.rollout.dtype=${rollout_dtype}) +fi + +if [[ -n "${max_actor_ckpt_to_keep}" ]]; then + RAY_CMD+=(trainer.max_actor_ckpt_to_keep=${max_actor_ckpt_to_keep}) +fi + +if [[ -n "${resume_mode}" ]]; then + RAY_CMD+=(trainer.resume_mode=${resume_mode}) +fi + +# --- Submit ------------------------------------------------------------------ +"${RAY_CMD[@]}" diff --git a/3.test_cases/pytorch/verl/kubernetes/rlvr/setup/env_vars.example b/3.test_cases/pytorch/verl/kubernetes/rlvr/setup/env_vars.example index fb60fe4cd..bcff7ee4f 100644 --- a/3.test_cases/pytorch/verl/kubernetes/rlvr/setup/env_vars.example +++ b/3.test_cases/pytorch/verl/kubernetes/rlvr/setup/env_vars.example @@ -9,8 +9,14 @@ export TAG=vllm011.latest export EKS_CLUSTER_NAME="" export INSTANCE_TYPE="" # Example: "p5en.48xlarge" export NUM_NODES=4 # Single source of truth for number of nodes -export NUM_GPU_PER_NODE=8 -export NUM_EFA_PER_NODE=16 +# NUM_GPU_PER_NODE and NUM_EFA_PER_NODE are set by the instance profile. +# Override here only if your setup differs from the standard profile. +# export NUM_GPU_PER_NODE=8 +# export NUM_EFA_PER_NODE=16 + +# Instance profile (optional — auto-detected from INSTANCE_TYPE if not set) +# See recipe/profiles/ for available profiles. +# export INSTANCE_PROFILE="g5-12xlarge" # Uncomment to force a specific profile # Ray configs diff --git a/3.test_cases/shared/README.md b/3.test_cases/shared/README.md new file mode 100644 index 000000000..f7989edb6 --- /dev/null +++ b/3.test_cases/shared/README.md @@ -0,0 +1,41 @@ +# Shared Utilities + +Shared scripts and utilities used across multiple test cases. + +## Instance Detection (`instance_detect.sh`) + +Auto-detects the EC2 instance type and resolves a matching instance profile +(`.env` file) for the current test case. Each test case has its own +`profiles/` directory with instance-specific `.env` files, and a copy of the +detection script at `profiles/_detect.sh`. + +### Canonical Source + +**`instance_detect.sh`** in this directory is the single source of truth. +Copies exist in each test case's `profiles/_detect.sh`. To update all copies +after editing the canonical source: + +```bash +bash 3.test_cases/shared/sync_profiles.sh +``` + +### Detection Order + +1. `INSTANCE_PROFILE` env var — explicit override (e.g., `g5-12xlarge`) +2. `INSTANCE_TYPE` env var — from env_vars (e.g., `g5.12xlarge`) +3. EC2 Instance Metadata API (IMDSv2) — works on bare metal and K8s +4. GPU name from `nvidia-smi` — fallback mapping (A10G -> g5, H100 -> p5, etc.) + +### Test Cases Using Profiles + +| Test Case | Profiles Directory | +|-----------|-------------------| +| [FSDP](../pytorch/FSDP/) | `pytorch/FSDP/profiles/` | +| [veRL (HyperPod EKS)](../pytorch/verl/hyperpod-eks/rlvr/) | `pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/` | +| [veRL (Kubernetes)](../pytorch/verl/kubernetes/rlvr/) | `pytorch/verl/kubernetes/rlvr/recipe/profiles/` | +| [torchtitan](../pytorch/torchtitan/) | `pytorch/torchtitan/profiles/` | +| [nanoVLM](../pytorch/nanoVLM/) | `pytorch/nanoVLM/profiles/` | +| [TRL](../pytorch/trl/) | `pytorch/trl/profiles/` | +| [Megatron-LM](../megatron/megatron-lm/) | `megatron/megatron-lm/profiles/` | +| [BioNeMo](../megatron/bionemo/) | `megatron/bionemo/profiles/` | +| [Megatron-DeepSpeed](../pytorch/deepspeed/examples_megatron_deepspeed/) | `pytorch/deepspeed/examples_megatron_deepspeed/profiles/` | diff --git a/3.test_cases/shared/instance_detect.sh b/3.test_cases/shared/instance_detect.sh new file mode 100755 index 000000000..896664395 --- /dev/null +++ b/3.test_cases/shared/instance_detect.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. +# +# CANONICAL SOURCE: This is the single source of truth for instance detection. +# Copies exist in each test case's profiles/_detect.sh directory. To update +# all copies, edit this file and run: ./sync_profiles.sh +# +# Usage (from a training script): +# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") +# source "$PROFILE_ENV" +# +# Detection order: +# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") +# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") +# 3. EC2 instance metadata API (works on bare metal and K8s with host networking) +# 4. GPU name from nvidia-smi (fallback when metadata is unavailable) +# +# Outputs the path to the profile .env file on stdout. +# Exits non-zero if no profile can be resolved. +# --------------------------------------------------------------------------- +set -euo pipefail + +PROFILES_DIR="${1:-.}" + +# --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- +if [[ -n "${INSTANCE_PROFILE:-}" ]]; then + PROFILE_NAME="$INSTANCE_PROFILE" + echo "Instance profile override: ${PROFILE_NAME}" >&2 +else + # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- + INSTANCE_TYPE="${INSTANCE_TYPE:-}" + + # --- Step 3: Try EC2 instance metadata API ------------------------------ + if [[ -z "$INSTANCE_TYPE" ]]; then + # IMDSv2: get a token first, then query + TOKEN=$(curl -s --connect-timeout 2 -X PUT \ + "http://169.254.169.254/latest/api/token" \ + -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true + if [[ -n "$TOKEN" ]]; then + INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + -H "X-aws-ec2-metadata-token: $TOKEN" \ + "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true + fi + fi + + # --- Step 4: Fallback — detect from GPU name ---------------------------- + if [[ -z "$INSTANCE_TYPE" ]]; then + if command -v nvidia-smi &>/dev/null; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true + case "${GPU_NAME:-}" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; + *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + *L4*) INSTANCE_TYPE="g6.12xlarge" ;; + *) + echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + ;; + esac + echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 + else + echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 + echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 + exit 1 + fi + fi + + # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" + PROFILE_NAME="${INSTANCE_TYPE//./-}" + echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 +fi + +# --- Resolve profile file path ---------------------------------------------- +PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" + +if [[ ! -f "$PROFILE_PATH" ]]; then + echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 + echo "" >&2 + echo "Available profiles:" >&2 + ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 + echo "" >&2 + echo "To create a new profile, copy an existing one and adjust the values:" >&2 + echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 + exit 1 +fi + +# Output the resolved path (recipe script will source it) +echo "$PROFILE_PATH" diff --git a/3.test_cases/shared/sync_profiles.sh b/3.test_cases/shared/sync_profiles.sh new file mode 100755 index 000000000..4da8aea1b --- /dev/null +++ b/3.test_cases/shared/sync_profiles.sh @@ -0,0 +1,50 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# sync_profiles.sh — Copy the canonical instance_detect.sh to all test cases. +# +# Run this after editing 3.test_cases/shared/instance_detect.sh to propagate +# changes to all test case profile directories. +# --------------------------------------------------------------------------- +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +CANONICAL="${SCRIPT_DIR}/instance_detect.sh" + +if [[ ! -f "$CANONICAL" ]]; then + echo "ERROR: Canonical source not found: $CANONICAL" >&2 + exit 1 +fi + +# All test case profile directories that contain _detect.sh +TARGETS=( + "${SCRIPT_DIR}/../pytorch/verl/hyperpod-eks/rlvr/recipe/profiles/_detect.sh" + "${SCRIPT_DIR}/../pytorch/verl/kubernetes/rlvr/recipe/profiles/_detect.sh" + "${SCRIPT_DIR}/../pytorch/FSDP/profiles/_detect.sh" + "${SCRIPT_DIR}/../pytorch/torchtitan/profiles/_detect.sh" + "${SCRIPT_DIR}/../pytorch/nanoVLM/profiles/_detect.sh" + "${SCRIPT_DIR}/../pytorch/trl/profiles/_detect.sh" + "${SCRIPT_DIR}/../megatron/megatron-lm/profiles/_detect.sh" + "${SCRIPT_DIR}/../megatron/bionemo/profiles/_detect.sh" + "${SCRIPT_DIR}/../pytorch/deepspeed/examples_megatron_deepspeed/profiles/_detect.sh" +) + +SYNCED=0 +SKIPPED=0 + +for target in "${TARGETS[@]}"; do + target_resolved="$(cd "$(dirname "$target")" 2>/dev/null && pwd)/$(basename "$target")" 2>/dev/null || true + if [[ -d "$(dirname "$target")" ]]; then + cp "$CANONICAL" "$target" + chmod +x "$target" + echo " Synced: $target" + ((SYNCED++)) + else + echo " Skipped (dir not found): $target" + ((SKIPPED++)) + fi +done + +echo "" +echo "Done: $SYNCED synced, $SKIPPED skipped." diff --git a/docs/instance-compatibility.md b/docs/instance-compatibility.md new file mode 100644 index 000000000..8a72509cd --- /dev/null +++ b/docs/instance-compatibility.md @@ -0,0 +1,187 @@ +# Instance Compatibility Guide + +This guide helps you choose the right EC2 instance type for each test case +in this repository and understand the parameter changes required when moving +between instance families. + +Most test cases are developed and tested on **p5en.48xlarge** (8 x H200 80GB). +Running them on smaller or differently-configured instances (g5, p4de, g6e) +requires adjustments to FSDP strategy, offloading, tensor parallelism, NCCL +flags, and checkpoint frequency. This document captures those differences +systematically. + +## Quick Reference: Instance Hardware Profiles + +| Instance | GPU | VRAM | GPUs | GPUDirect RDMA | EFA | NVLink | Node RAM | Notes | +|----------|-----|------|------|----------------|-----|--------|----------|-------| +| [p5en.48xlarge](instance-profiles/p5en.md) | H200 | 80 GB | 8 | Yes | 32 | NVSwitch | ~700 Gi | Current primary target | +| [p5.48xlarge](instance-profiles/p5.md) | H100 | 80 GB | 8 | Yes | 32 | NVSwitch | ~700 Gi | Same profile as p5en for most workloads | +| [p4de.24xlarge](instance-profiles/p4de.md) | A100 | 80 GB | 8 | Yes | 4 | NVSwitch | ~1100 Gi | Fewer EFA adapters than p5 | +| [g5.12xlarge](instance-profiles/g5.md) | A10G | 24 GB | 4 | No | 1 | None | ~168 Gi | Requires aggressive offloading for >10B models | +| [g6e.12xlarge](instance-profiles/g6e.md) | L40S | 48 GB | 4 | No | 1 | None | ~168 Gi | Middle ground between g5 and p4de | +| [trn1.32xlarge](instance-profiles/trn1.md) | Trainium v1 | 32 GB HBM | 16 | Yes | 8 | NeuronLink | ~480 Gi | Neuron SDK only | + +> See [docs/instance-profiles/](instance-profiles/) for detailed per-instance +> specifications, NCCL settings, and tuning recommendations. + +## The 6 Dimensions That Differ Across Instances + +When porting a workload between instance types, these are the six hardware +dimensions that drive parameter changes: + +| # | Dimension | Why It Matters | Example Impact | +|---|-----------|---------------|----------------| +| 1 | **GPU VRAM** | Determines FSDP strategy, offloading needs, TP degree, batch sizes | A10G 24 GB needs FSDP2 + full offload; H200 80 GB does not | +| 2 | **GPUDirect RDMA** | Controls NCCL transport protocol and EFA device flags | g5: `RDMA=0, PROTO=simple`; p5: `RDMA=1, PROTO=default` | +| 3 | **EFA Count** | Inter-node bandwidth; affects multi-node scaling efficiency | g5: 1 EFA adapter; p5en: 32 EFA adapters | +| 4 | **NVLink Topology** | Intra-node GPU-to-GPU bandwidth; affects TP efficiency | g5: no NVLink (PCIe only); p5: NVSwitch full-mesh | +| 5 | **Node CPU Memory** | Feasibility of CPU offloading for optimizer states and params | g5: ~168 Gi allocatable; p5en: ~700 Gi | +| 6 | **Storage Size** | Checkpoint frequency limits; large models produce huge checkpoints | 117 GB/checkpoint for 20B model; save_freq=1 fills 1.2 TB in 9 steps | + +## Test Case Compatibility Matrix + +The table below shows which instance types have been tested with each test case. +Status key: **Tested** = validated end-to-end, **Untested** = expected to work +with the listed profile but not yet validated, **N/A** = not applicable +(e.g., Neuron test cases on NVIDIA instances). + +### PyTorch Test Cases + +| Test Case | p5en.48xl (H200) | p5.48xl (H100) | p4de.24xl (A100) | g5.12xl (A10G) | g6e.12xl (L40S) | trn1/trn2 | +|-----------|:-:|:-:|:-:|:-:|:-:|:-:| +| [FSDP](../3.test_cases/pytorch/FSDP/) | Tested | Tested | Tested | Tested | Untested | N/A | +| [veRL GRPO](../3.test_cases/pytorch/verl/) | Tested | Untested | Untested | Tested | Untested | N/A | +| [DDP](../3.test_cases/pytorch/ddp/) | Untested | Untested | Untested | Untested | Untested | N/A | +| [DeepSpeed](../3.test_cases/pytorch/deepspeed/) | Untested | Untested | Untested | Untested | Untested | N/A | +| [torchtitan](../3.test_cases/pytorch/torchtitan/) | Untested | Untested | Untested | Untested | Untested | N/A | +| [TRL](../3.test_cases/pytorch/trl/) | Untested | Untested | Untested | Untested | Untested | N/A | +| [distillation](../3.test_cases/pytorch/distillation/) | Tested | Untested | Tested | Untested | Untested | N/A | +| [nanoVLM](../3.test_cases/pytorch/nanoVLM/) | Untested | Untested | Untested | Tested | Untested | N/A | +| [MosaicML Composer](../3.test_cases/pytorch/mosaicml-composer/) | Untested | Untested | Untested | Untested | Untested | N/A | +| [Picotron](../3.test_cases/pytorch/picotron/) | Untested | Untested | Untested | Untested | Untested | N/A | + +### Megatron Test Cases + +| Test Case | p5en.48xl (H200) | p5.48xl (H100) | p4de.24xl (A100) | g5.12xl (A10G) | g6e.12xl (L40S) | trn1/trn2 | +|-----------|:-:|:-:|:-:|:-:|:-:|:-:| +| [NeMo 2.0](../3.test_cases/megatron/nemo/) | Tested | Tested | Untested | Untested | Untested | N/A | +| [NeMo 1.0](../3.test_cases/megatron/nemo1.0/) | Untested | Untested | Tested | Untested | Untested | N/A | +| [Megatron-LM](../3.test_cases/megatron/megatron-lm/) | Untested | Untested | Untested | Untested | Untested | N/A | +| [BioNeMo](../3.test_cases/megatron/bionemo/) | Untested | Untested | Tested | Untested | Untested | N/A | + +### Neuron Test Cases + +| Test Case | trn1.32xl | trn1n.32xl | trn2.48xl | trn2.3xl | NVIDIA | +|-----------|:-:|:-:|:-:|:-:|:-:| +| [optimum-neuron](../3.test_cases/pytorch/optimum-neuron/) | Tested | Tested | Tested | Tested | N/A | +| [neuronx-distributed](../3.test_cases/pytorch/neuronx-distributed/) | Untested | Untested | Untested | Untested | N/A | + +### Other Test Cases + +| Test Case | p5en.48xl (H200) | p5.48xl (H100) | p4de.24xl (A100) | g5.12xl (A10G) | +|-----------|:-:|:-:|:-:|:-:| +| [JAX/Paxml](../3.test_cases/jax/) | Untested | Untested | Untested | Untested | +| [ESM2](../3.test_cases/23.SMHP-esm2/) | Untested | Untested | Untested | Tested | + +## Common Parameter Adjustments by Instance Type + +This section summarizes the most frequently needed changes when moving from +the default p5en configuration to other instance types. + +### Moving to g5.12xlarge (A10G 24 GB) + +The most impactful change. Requires aggressive memory optimization: + +| Parameter | p5en Value | g5 Value | Rationale | +|-----------|-----------|----------|-----------| +| FSDP strategy | `fsdp` (FSDP1) | `fsdp2` | FSDP1 explicitly disables CPUOffload for actor role | +| `offload_policy` | not set | `True` | FSDP2-specific flag; enables proper CPU offloading | +| `model_dtype` | default (fp32) | `bf16` | veRL defaults to fp32; 24 GB requires explicit bf16 | +| `gpu_memory_utilization` | 0.6 | 0.6 | Fraction of TOTAL GPU; 0.3 x 23 GB = 6.9 GB < model shard | +| `enforce_eager` | `False` | `True` | CUDA graphs need extra workspace; OOM on 24 GB | +| `tensor_parallel_size` | 2 | 4 | Shard model across all 4 GPUs per node | +| `param_offload` | `False` | `True` | Offload params to CPU to fit in 24 GB | +| `optimizer_offload` | `False` | `True` | Offload optimizer states to CPU | +| `NCCL_PROTO` | default | `simple` | Required when `FI_EFA_USE_DEVICE_RDMA=0` | +| `FI_EFA_USE_DEVICE_RDMA` | `1` | `0` | g5 does not support GPUDirect RDMA | +| `save_freq` | 1 | 20+ | 117 GB/ckpt for 20B model; save_freq=1 fills disk fast | +| `WORKER_MEMORY` | 200 Gi+ | 150 Gi | g5.12xl allocatable is ~168 Gi, not 200 Gi | +| `nnodes` | node count | worker count only | Head pod without GPUs causes NCCL hang | + +### Moving to p4de.24xlarge (A100 80 GB) + +Moderate changes. Same VRAM as p5 but fewer EFA adapters: + +| Parameter | p5en Value | p4de Value | Rationale | +|-----------|-----------|-----------|-----------| +| EFA adapter count | 32 | 4 | Lower inter-node bandwidth | +| `FI_EFA_USE_DEVICE_RDMA` | `1` | `1` | p4de supports GPUDirect RDMA | +| `NCCL_PROTO` | default | default | RDMA available | +| Batch sizes | as-is | may need reduction | Less inter-node bandwidth for gradient sync | + +### Moving to g6e.12xlarge (L40S 48 GB) + +Moderate changes. More VRAM than g5 but still no RDMA: + +| Parameter | p5en Value | g6e Value | Rationale | +|-----------|-----------|----------|-----------| +| FSDP strategy | `fsdp` | `fsdp` or `fsdp2` | 48 GB may fit without offloading for models <30B | +| `NCCL_PROTO` | default | `simple` | No GPUDirect RDMA on g6e | +| `FI_EFA_USE_DEVICE_RDMA` | `1` | `0` | g6e does not support GPUDirect RDMA | +| `tensor_parallel_size` | 2 | 4 | 4 GPUs per node | +| `gpu_memory_utilization` | 0.6 | 0.6-0.7 | More headroom than g5 | + +## Lessons Learned: veRL GRPO on g5 (11 OOM iterations) + +The Instance Compatibility Framework was motivated by the experience of +porting veRL GRPO training from p5en.48xlarge to g5.12xlarge. It required +11 iterations to resolve cascading OOM failures. Each failure mapped to a +parameter that differs by instance type. + +Key findings: + +1. **FSDP2 vs FSDP1**: FSDP1 explicitly disables `CPUOffload` for the actor + role in veRL. On 24 GB GPUs, this is fatal for models >10B params. + +2. **offload_policy=True**: FSDP2-specific flag. Without it, the actor model + stays on GPU even when FSDP2 is selected. + +3. **model_dtype=bf16**: veRL defaults the actor to fp32. On 80 GB GPUs this + wastes space; on 24 GB GPUs it causes instant OOM. + +4. **gpu_memory_utilization**: Fraction of TOTAL GPU memory, not just KV + cache. `0.3 x 23 GB = 6.9 GB` which is less than a 10 GB model shard. + +5. **enforce_eager=True**: CUDA graphs require extra workspace memory that + pushes 24 GB GPUs into OOM. + +6. **NCCL_PROTO=simple**: Required when `FI_EFA_USE_DEVICE_RDMA=0` (g5, g6e). + Without it, NCCL hangs on collective operations. + +7. **Checkpoint size**: Full FSDP state for a 20B model across 12 GPUs + produces ~117 GB per checkpoint. `save_freq=1` on 1.2 TB FSx fills the + disk in 9 training steps. + +8. **nnodes must exclude the non-GPU head pod**: In Ray on K8s, the head pod + often has no GPUs. Including it in the `nnodes` count causes NCCL hangs. + +9. **WORKER_MEMORY**: g5.12xlarge allocatable memory is ~168 Gi, not 200 Gi. + Requesting 200 Gi causes pod scheduling failures. + +10. **expandable_segments incompatible with vLLM**: Setting + `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` causes + `CuMemAllocator` assertion failures in vLLM. + +## Contributing + +When you validate a test case on a new instance type: + +1. Add a row to the test case's "Tested Configurations" table in its README +2. Update the [compatibility matrix](#test-case-compatibility-matrix) above +3. If you needed new parameters, document them in the per-test-case README +4. Consider adding an [instance profile](instance-profiles/) if the instance + family is not yet documented + +See the [implementation plan](plans/instance-compatibility-framework.md) for +the full roadmap including parameterized config profiles (Tier 2) and +automated multi-instance CI validation (Tier 3). diff --git a/docs/instance-profiles/README.md b/docs/instance-profiles/README.md new file mode 100644 index 000000000..e572cc83b --- /dev/null +++ b/docs/instance-profiles/README.md @@ -0,0 +1,31 @@ +# Instance Profiles + +This directory contains hardware specifications and tuning guidance for each +EC2 instance family used in this repository's test cases. + +## NVIDIA GPU Instances + +| Profile | GPU | VRAM | GPUDirect RDMA | EFA | Primary Use | +|---------|-----|------|----------------|-----|-------------| +| [g5.md](g5.md) | A10G | 24 GB | No | 1 | Cost-effective experimentation | +| [g6e.md](g6e.md) | L40S | 48 GB | No | 1 | Mid-range development/testing | +| [p4de.md](p4de.md) | A100 | 80 GB | Yes | 4 | Production training | +| [p5.md](p5.md) | H100 | 80 GB | Yes | 32 | Production training | +| [p5en.md](p5en.md) | H200 | 80 GB | Yes | 32 | Primary target (most test cases) | + +## AWS Trainium Instances + +| Profile | Accelerator | Memory | EFA | Primary Use | +|---------|-------------|--------|-----|-------------| +| [trn1.md](trn1.md) | Trainium v1/v2 | 32 GB HBM/core | 8-16 | Neuron SDK training | + +## How to Use These Profiles + +1. Identify which instance type you will be running on +2. Read the corresponding profile to understand the hardware constraints +3. Check the [compatibility matrix](../instance-compatibility.md) to see if + your test case has been validated on that instance +4. Apply the NCCL/EFA settings and memory optimization strategies from the + profile +5. If you validate a new test case + instance combination, update both the + profile's "Tested Workloads" table and the central compatibility matrix diff --git a/docs/instance-profiles/g5.md b/docs/instance-profiles/g5.md new file mode 100644 index 000000000..f6b0c7f6c --- /dev/null +++ b/docs/instance-profiles/g5.md @@ -0,0 +1,80 @@ +# g5 Instance Family — A10G GPUs + +> Covers: g5.xlarge, g5.2xlarge, g5.4xlarge, g5.8xlarge, g5.12xlarge, +> g5.16xlarge, g5.24xlarge, g5.48xlarge + +## Hardware Summary + +| Dimension | g5.12xlarge | g5.48xlarge | +|-----------|-------------|-------------| +| GPU | NVIDIA A10G | NVIDIA A10G | +| GPU VRAM | 24 GB GDDR6 | 24 GB GDDR6 | +| GPU Count | 4 | 8 | +| GPUDirect RDMA | **No** | **No** | +| EFA Adapters | 1 | 1 | +| NVLink | **None** (PCIe only) | **None** (PCIe only) | +| Node CPU Memory | ~192 GB (~168 Gi allocatable) | ~768 GB | +| Local Storage | 1 x 3.8 TB NVMe | 2 x 3.8 TB NVMe | + +## Key Characteristics + +- **Lowest-cost GPU instances** for training experimentation +- **No NVLink**: GPU-to-GPU communication goes over PCIe; tensor parallelism + is less efficient than on p4de/p5 +- **No GPUDirect RDMA**: EFA traffic goes through CPU, requiring + `NCCL_PROTO=simple` and `FI_EFA_USE_DEVICE_RDMA=0` +- **Single EFA adapter**: Inter-node bandwidth is significantly lower than + p5 (1 vs 32 adapters); multi-node scaling is limited +- **24 GB VRAM**: Requires aggressive memory optimization for models >10B + parameters + +## Required NCCL / EFA Settings + +```bash +# g5 does not support GPUDirect RDMA +export FI_EFA_USE_DEVICE_RDMA=0 +export NCCL_PROTO=simple + +# Optional: helpful for debugging on g5 +export NCCL_DEBUG=INFO +export FI_LOG_LEVEL=info +``` + +## Memory Optimization Strategies + +For models >10B parameters on g5.12xlarge (4 x 24 GB): + +1. **Use FSDP2** instead of FSDP1 — FSDP1 may disable CPU offloading for + certain roles (e.g., actor in veRL) +2. **Enable full CPU offloading**: `param_offload=True`, + `optimizer_offload=True`, `offload_policy=True` +3. **Force bf16**: Many frameworks default to fp32; explicit bf16 halves memory +4. **Use eager mode**: `enforce_eager=True` — CUDA graphs require extra + workspace that causes OOM on 24 GB +5. **Set TP=4**: Shard model across all 4 GPUs per node +6. **Reduce checkpoint frequency**: Full FSDP state for 20B model = ~117 GB; + save_freq=1 fills disk quickly. Use save_freq=20+ +7. **Avoid expandable_segments**: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` + is incompatible with vLLM's CuMemAllocator + +## Resource Requests (Kubernetes) + +```yaml +# g5.12xlarge — 4 GPUs, ~168 Gi allocatable +resources: + limits: + nvidia.com/gpu: 4 + vpc.amazonaws.com/efa: 1 + requests: + memory: "150Gi" # NOT 200Gi — allocatable is ~168Gi + cpu: "40" +``` + +## Tested Workloads + +| Test Case | Model | Nodes | Status | Notes | +|-----------|-------|-------|--------|-------| +| veRL GRPO | gpt-oss-20b (MoE) | 3+1 head | Tested | FSDP2, full offload, TP=4 | +| FSDP | Various | Various | Tested | See FSDP test case README | +| nanoVLM | Various | 1 | Tested | See nanoVLM README | +| ESM2 | ESM2 | 1-2 | Tested | g5.24xl and g5.12xl | diff --git a/docs/instance-profiles/g6e.md b/docs/instance-profiles/g6e.md new file mode 100644 index 000000000..18c0632c1 --- /dev/null +++ b/docs/instance-profiles/g6e.md @@ -0,0 +1,85 @@ +# g6e Instance Family — L40S GPUs + +> Covers: g6e.xlarge, g6e.2xlarge, g6e.4xlarge, g6e.8xlarge, g6e.12xlarge, +> g6e.16xlarge, g6e.24xlarge, g6e.48xlarge + +## Hardware Summary + +| Dimension | g6e.12xlarge | g6e.48xlarge | +|-----------|--------------|--------------| +| GPU | NVIDIA L40S | NVIDIA L40S | +| GPU VRAM | 48 GB GDDR6 | 48 GB GDDR6 | +| GPU Count | 4 | 8 | +| GPUDirect RDMA | **No** | **No** | +| EFA Adapters | 1 | 1 | +| NVLink | **None** (PCIe only) | **None** (PCIe only) | +| Node CPU Memory | ~168 Gi allocatable | ~768 Gi allocatable | +| Local Storage | 1 x 7.6 TB NVMe | 2 x 7.6 TB NVMe | + +## Key Characteristics + +- **Middle ground** between g5 (24 GB) and p4de (80 GB) +- **48 GB VRAM**: Many models that OOM on g5 (24 GB) will fit on g6e without + full offloading — but still need offloading for >40B models +- **No NVLink**: Same as g5; GPU-to-GPU over PCIe only +- **No GPUDirect RDMA**: Same NCCL settings as g5 +- **Single EFA adapter**: Same inter-node bandwidth limitations as g5 +- **Good for development and testing** workloads that don't require + multi-node high-bandwidth communication + +## Required NCCL / EFA Settings + +```bash +# Same as g5 — no GPUDirect RDMA +export FI_EFA_USE_DEVICE_RDMA=0 +export NCCL_PROTO=simple + +export FI_PROVIDER=efa +export NCCL_DEBUG=WARN +``` + +## Differences from g5 + +| Aspect | g6e.12xlarge | g5.12xlarge | +|--------|--------------|-------------| +| GPU | L40S 48 GB | A10G 24 GB | +| VRAM | 48 GB | 24 GB | +| Offloading needed | Models >30B | Models >10B | +| FSDP strategy | FSDP1 may work | Must use FSDP2 | +| enforce_eager | May not be needed | Required | +| NCCL settings | Identical | Identical | +| EFA | Identical | Identical | + +## Memory Optimization Strategies + +With 48 GB per GPU, optimization requirements are less aggressive than g5: + +1. **FSDP1 may suffice** for models up to ~30B parameters +2. **Offloading**: May only need `ref_param_offload=True` instead of full + offloading +3. **bf16**: Still recommended to enable explicitly if the framework defaults + to fp32 +4. **CUDA graphs**: May work for smaller models; test on a case-by-case basis +5. **TP=4**: Same as g5 (4 GPUs per node on g6e.12xlarge) + +## Resource Requests (Kubernetes) + +```yaml +# g6e.12xlarge — 4 GPUs, ~168 Gi allocatable +resources: + limits: + nvidia.com/gpu: 4 + vpc.amazonaws.com/efa: 1 + requests: + memory: "150Gi" + cpu: "40" +``` + +## Tested Workloads + +| Test Case | Model | Nodes | Status | Notes | +|-----------|-------|-------|--------|-------| +| — | — | — | Untested | No test cases validated on g6e yet | + +> If you validate a workload on g6e, please update this table and the +> [compatibility matrix](../instance-compatibility.md). diff --git a/docs/instance-profiles/p4de.md b/docs/instance-profiles/p4de.md new file mode 100644 index 000000000..11c431e97 --- /dev/null +++ b/docs/instance-profiles/p4de.md @@ -0,0 +1,76 @@ +# p4de Instance Family — A100 GPUs + +> Covers: p4d.24xlarge, p4de.24xlarge + +## Hardware Summary + +| Dimension | p4de.24xlarge | p4d.24xlarge | +|-----------|---------------|--------------| +| GPU | NVIDIA A100 | NVIDIA A100 | +| GPU VRAM | 80 GB HBM2e | 40 GB HBM2e | +| GPU Count | 8 | 8 | +| GPUDirect RDMA | **Yes** | **Yes** | +| EFA Adapters | 4 | 4 | +| NVLink | NVSwitch (600 GB/s bisection) | NVSwitch (600 GB/s bisection) | +| Node CPU Memory | ~1100 Gi allocatable | ~1100 Gi allocatable | +| Local Storage | 8 x 1 TB NVMe | 8 x 1 TB NVMe | + +## Key Characteristics + +- **GPUDirect RDMA supported**: Same NCCL settings as p5/p5en +- **Fewer EFA adapters** (4 vs 32): Inter-node bandwidth is lower than p5; + may need to reduce batch sizes for large multi-node runs +- **NVSwitch**: Full-mesh GPU-to-GPU connectivity (slightly lower bandwidth + than p5's NVSwitch generation) +- **80 GB VRAM (p4de)**: Same fitting characteristics as p5 for most models +- **40 GB VRAM (p4d)**: May require offloading for models >30B; intermediate + between g5 (24 GB) and p4de (80 GB) +- **Higher CPU memory** (~1100 Gi): More headroom for CPU offloading than p5 + +## Required NCCL / EFA Settings + +```bash +# p4de supports GPUDirect RDMA +export FI_EFA_USE_DEVICE_RDMA=1 +export FI_PROVIDER=efa +export NCCL_DEBUG=WARN +``` + +## Differences from p5en + +| Aspect | p4de.24xlarge | p5en.48xlarge | +|--------|---------------|---------------| +| GPU | A100 80 GB | H200 80 GB | +| EFA adapters | 4 | 32 | +| NVSwitch bandwidth | 600 GB/s | 900 GB/s | +| Node CPU memory | ~1100 Gi | ~700 Gi | +| Training parameters | Usually identical | Baseline | + +Most p5en profiles work on p4de without modification. For large multi-node +runs, the lower EFA count (4 vs 32) may become a bottleneck — consider: +- Reducing gradient accumulation batch size +- Using gradient compression +- Overlapping communication with computation + +## Resource Requests (Kubernetes) + +```yaml +# p4de.24xlarge — 8 GPUs, ~1100 Gi allocatable +resources: + limits: + nvidia.com/gpu: 8 + vpc.amazonaws.com/efa: 4 + requests: + memory: "900Gi" + cpu: "90" +``` + +## Tested Workloads + +| Test Case | Model | Nodes | Status | Notes | +|-----------|-------|-------|--------|-------| +| FSDP | Llama 2/3 | Various | Tested | See FSDP README | +| NeMo 1.0 | Various | Various | Tested | Primary target for NeMo 1.0 | +| BioNeMo | Various | Various | Tested | See BioNeMo README | +| distillation | Various | Various | Tested | Explicitly listed | +| Stable Diffusion | SD models | Various | Tested | Performance comparison available | diff --git a/docs/instance-profiles/p5.md b/docs/instance-profiles/p5.md new file mode 100644 index 000000000..b99f65a32 --- /dev/null +++ b/docs/instance-profiles/p5.md @@ -0,0 +1,69 @@ +# p5 Instance Family — H100 GPUs + +> Covers: p5.48xlarge + +## Hardware Summary + +| Dimension | p5.48xlarge | +|-----------|-------------| +| GPU | NVIDIA H100 | +| GPU VRAM | 80 GB HBM3 | +| GPU Count | 8 | +| GPUDirect RDMA | **Yes** | +| EFA Adapters | 32 | +| NVLink | NVSwitch (full-mesh, 900 GB/s bisection) | +| Node CPU Memory | ~700 Gi allocatable | +| Local Storage | 8 x 3.84 TB NVMe | + +## Key Characteristics + +- **Functionally identical to p5en** for most training workloads +- H100 vs H200 difference is primarily in HBM bandwidth (HBM3 vs HBM3e) + and minor microarchitectural improvements +- **Same parallelism and offloading settings** as p5en for nearly all test + cases +- Profiles written for p5en.48xlarge should work on p5.48xlarge without + modification + +## Required NCCL / EFA Settings + +```bash +# Same as p5en — GPUDirect RDMA supported +export FI_EFA_USE_DEVICE_RDMA=1 +export FI_PROVIDER=efa +export NCCL_DEBUG=WARN +``` + +## Differences from p5en + +| Aspect | p5.48xlarge | p5en.48xlarge | +|--------|-------------|---------------| +| GPU | H100 80 GB HBM3 | H200 80 GB HBM3e | +| Memory bandwidth | 3.35 TB/s | 4.8 TB/s | +| Training parameters | Identical | Identical | +| NCCL settings | Identical | Identical | + +For practical purposes, use the same configuration profiles. Performance +will differ slightly due to memory bandwidth, but correctness and memory +fitting are identical. + +## Resource Requests (Kubernetes) + +```yaml +# p5.48xlarge — 8 GPUs, ~700 Gi allocatable +resources: + limits: + nvidia.com/gpu: 8 + vpc.amazonaws.com/efa: 32 + requests: + memory: "600Gi" + cpu: "180" +``` + +## Tested Workloads + +| Test Case | Model | Nodes | Status | Notes | +|-----------|-------|-------|--------|-------| +| FSDP | Llama 2/3, Mixtral | Various | Tested | CI target | +| NeMo 2.0 | Various | Various | Tested | See NeMo test case | +| Stable Diffusion | SD models | Various | Tested | Performance benchmarks available | diff --git a/docs/instance-profiles/p5en.md b/docs/instance-profiles/p5en.md new file mode 100644 index 000000000..62e5735dd --- /dev/null +++ b/docs/instance-profiles/p5en.md @@ -0,0 +1,74 @@ +# p5en Instance Family — H200 GPUs + +> Covers: p5en.48xlarge + +## Hardware Summary + +| Dimension | p5en.48xlarge | +|-----------|---------------| +| GPU | NVIDIA H200 | +| GPU VRAM | 80 GB HBM3e | +| GPU Count | 8 | +| GPUDirect RDMA | **Yes** | +| EFA Adapters | 32 | +| NVLink | NVSwitch (full-mesh, 900 GB/s bisection) | +| Node CPU Memory | ~700 Gi allocatable | +| Local Storage | 8 x 3.84 TB NVMe | + +## Key Characteristics + +- **Current primary target** for most test cases in this repository +- **NVSwitch full-mesh**: All 8 GPUs have equal bandwidth to each other; + tensor parallelism is highly efficient +- **32 EFA adapters**: Maximum inter-node bandwidth; ideal for large-scale + multi-node training +- **GPUDirect RDMA**: GPU memory can be read/written directly by the NIC, + bypassing the CPU for collective operations +- **80 GB HBM3e**: Most models up to 70B fit without offloading when using + appropriate parallelism + +## Required NCCL / EFA Settings + +```bash +# p5en supports GPUDirect RDMA — use defaults +export FI_EFA_USE_DEVICE_RDMA=1 +# NCCL_PROTO can be left at default (LL/LL128) + +# Standard EFA settings +export FI_PROVIDER=efa +export NCCL_DEBUG=WARN +``` + +## Memory Optimization Strategies + +With 80 GB per GPU, offloading is typically not needed: + +1. **FSDP1 is sufficient** for most workloads +2. **Offloading optional**: Only needed for very large models (>100B) or + when running with small TP degree +3. **`ref_param_offload=True`**: Common optimization — offload reference + model to CPU since it's only used for KL divergence computation +4. **TP=2**: Typical for 8B-70B models; NVSwitch makes TP very efficient +5. **Standard save_freq**: Disk space is ample with 8 x 3.84 TB NVMe + +## Resource Requests (Kubernetes) + +```yaml +# p5en.48xlarge — 8 GPUs, ~700 Gi allocatable +resources: + limits: + nvidia.com/gpu: 8 + vpc.amazonaws.com/efa: 32 + requests: + memory: "600Gi" + cpu: "180" +``` + +## Tested Workloads + +| Test Case | Model | Nodes | Status | Notes | +|-----------|-------|-------|--------|-------| +| veRL GRPO | Qwen3-8B | 4 | Tested | FSDP1, TP=2, ref_offload only | +| FSDP | Llama 2/3, Mixtral | Various | Tested | Primary CI target | +| NeMo 2.0 | Various | Various | Tested | See NeMo test case | +| distillation | Various | Various | Tested | P5en supported | diff --git a/docs/instance-profiles/trn1.md b/docs/instance-profiles/trn1.md new file mode 100644 index 000000000..bab006027 --- /dev/null +++ b/docs/instance-profiles/trn1.md @@ -0,0 +1,74 @@ +# trn1 / trn2 Instance Family — AWS Trainium + +> Covers: trn1.2xlarge, trn1.32xlarge, trn1n.32xlarge, trn2.3xlarge, +> trn2.48xlarge + +## Hardware Summary + +| Dimension | trn1.32xlarge | trn1n.32xlarge | trn2.48xlarge | trn2.3xlarge | +|-----------|---------------|----------------|---------------|--------------| +| Accelerator | Trainium v1 | Trainium v1 | Trainium v2 | Trainium v2 | +| Accelerator Memory | 32 GB HBM per core | 32 GB HBM per core | 32 GB HBM per core | 32 GB HBM per core | +| NeuronCores | 32 (16 devices x 2) | 32 (16 devices x 2) | 64 (32 devices x 2) | 4 (2 devices x 2) | +| EFA Adapters | 8 | 16 | 16 | 1 | +| NeuronLink | Yes (intra-node) | Yes (intra-node) | Yes (intra-node) | No | +| Node CPU Memory | ~480 Gi | ~480 Gi | ~960 Gi | ~30 Gi | + +## Key Characteristics + +- **Neuron SDK only**: These instances use the AWS Neuron compiler and + runtime, not CUDA. NVIDIA-targeted test cases are **not applicable** +- **Different software stack**: Uses `neuronx-distributed`, `optimum-neuron`, + or `neuronx-nemo-megatron` instead of PyTorch FSDP / Megatron-LM +- **NeuronLink**: Intra-node device-to-device communication (analogous to + NVLink for NVIDIA) +- **EFA**: Inter-node communication; trn1n and trn2 have more adapters +- **Compiler-driven optimization**: Memory optimization is handled primarily + by the Neuron compiler rather than manual FSDP/offloading configuration + +## Required Settings + +```bash +# Neuron-specific environment variables +export NEURON_RT_NUM_CORES=32 # Adjust per instance type +export NEURON_CC_FLAGS="--model-type=transformer" + +# EFA for multi-node +export FI_PROVIDER=efa +export FI_EFA_USE_DEVICE_RDMA=1 # Trainium supports EFA direct +``` + +## Differences from NVIDIA Instances + +Trainium instances use a fundamentally different software stack. The +parameter adjustment patterns described in the +[main compatibility guide](../instance-compatibility.md) (FSDP strategy, +NCCL settings, etc.) do not apply. Instead: + +| NVIDIA Concept | Trainium Equivalent | +|----------------|-------------------| +| FSDP / DeepSpeed | `neuronx-distributed` tensor/pipeline parallelism | +| NCCL | `xla` collective communication | +| CUDA graphs | Neuron compiler graph extraction | +| `nvidia-smi` | `neuron-top`, `neuron-monitor` | +| TP / PP configuration | Set via Neuron distributed config | + +## Resource Requests (Kubernetes) + +```yaml +# trn1.32xlarge — 16 Neuron devices +resources: + limits: + aws.amazon.com/neuron: 16 + vpc.amazonaws.com/efa: 8 + requests: + memory: "400Gi" + cpu: "120" +``` + +## Tested Workloads + +| Test Case | Model | Instance | Status | Notes | +|-----------|-------|----------|--------|-------| +| optimum-neuron | Various | trn1.32xl, trn1n, trn2.48xl, trn2.3xl | Tested | Multiple instance types validated | +| neuronx-distributed | Various | Various | Untested | Expected to work | diff --git a/docs/plans/instance-compatibility-framework.md b/docs/plans/instance-compatibility-framework.md new file mode 100644 index 000000000..a99c1e6e9 --- /dev/null +++ b/docs/plans/instance-compatibility-framework.md @@ -0,0 +1,297 @@ +# Instance Compatibility Framework — Implementation Plan + +> Proposed: 2026-03-10 | Status: Approved, not yet started +> Branch: `feat/instance-compatibility-framework` (to be created) +> Context: After 11 OOM iterations getting veRL GRPO to work on g5.12xlarge (A10G 24GB), +> it became clear the repo lacks systematic guidance for running test cases across +> different GPU instance types. + +## Problem Statement + +The `awsome-distributed-training` repo has ~20 test cases across PyTorch, Megatron, JAX, +and NeuronX. Most are written and tested for a single instance type (usually p5en.48xlarge +with H200 80GB GPUs). When users try to run these on different instances (g5, p4de, g6e), +they hit undocumented failures: + +- **7 of ~20 test cases have zero instance type guidance** in their READMEs +- **No centralized compatibility matrix** exists +- **CI only tests on P5 instances** — g5, p4de, trn1 are never validated +- **Scripts hardcode values** for one instance type without parameterized alternatives + +Real-world impact: Running veRL GRPO on g5.12xlarge required 11 iterations to resolve +OOM failures caused by parameters that work fine on p5en but break on 24GB GPUs. Every +failure mapped to a parameter that differs by instance type (FSDP strategy, offload policy, +dtype, gpu_memory_utilization, TP degree, NCCL settings, checkpoint frequency). + +## Current State (from repo analysis) + +### Test Cases with Instance Type Documentation +| Test Case | Documented Instances | +|-----------|---------------------| +| FSDP | P4d(e), P5, P6-B200, G5.12xlarge, G5.xlarge, G4dn | +| NeMo 2.0 | H100 (p5en), H200, B200 (p6) — has PERFORMANCE.md | +| NeMo 1.0 | p4de.24xlarge (A100 80GB) | +| bionemo | p4de.24xlarge, P5 | +| distillation | P4d, P5, P5en (explicit list) | +| nanoVLM | g5 (optional section) | +| optimum-neuron | trn1.32xlarge, trn1n, trn2.48xlarge, trn2.3xlarge | +| ESM2 | g5.24xlarge, g5.12xlarge, p5.48xlarge | +| Stable Diffusion | P4de, P5 (perf comparison table) | +| veRL (rlvr) | p5en.48xlarge, g5.12xlarge (after our work) | + +### Test Cases with NO Instance Type Guidance +JAX/Paxml, PyTorch DDP, DeepSpeed, torchtitan, TRL, MPT, Picotron + +### CI Coverage +- FSDP: Full regression (Slurm container, Slurm venv, EKS) — P5 only +- Megatron-LM: Container build only (no training) +- Everything else: No CI + +## Tiered Implementation Plan + +### Tier 1: Instance Profile Documentation (LOW effort, HIGH ROI) + +**Goal**: Every test case README has a "Tested Configurations" table. A central +document maps instance types to the key parameters that differ. + +**Deliverables**: +1. `docs/instance-compatibility.md` at repo root — master reference +2. Per-test-case "Tested Configurations" table in each README +3. `docs/instance-profiles/` with one file per instance family + +**Instance profile contents** (the 6 dimensions that matter): + +| Dimension | Why It Matters | Example Difference | +|-----------|---------------|-------------------| +| GPU VRAM | FSDP strategy, offloading, TP, batch sizes | A10G 24GB needs FSDP2+offload; H100 80GB doesn't | +| GPUDirect RDMA | NCCL_PROTO, FI_EFA_USE_DEVICE_RDMA | g5: RDMA=0, PROTO=simple; p5: RDMA=1, PROTO=default | +| EFA count | Inter-node bandwidth | g5: 1 EFA; p5en: 32 EFA | +| NVLink topology | Intra-node TP efficiency | g5: no NVLink; p5: NVSwitch | +| Node CPU memory | Offloading feasibility | g5: 168Gi allocatable; p5en: ~700Gi | +| Storage size | Checkpoint frequency | 117GB/ckpt × save_freq=1 fills 1.2TB in 9 steps | + +**Per-test-case table format**: +```markdown +## Tested Configurations + +| Instance | GPUs | Model | Nodes | Key Settings | Status | +|----------|------|-------|-------|-------------|--------| +| p5en.48xlarge | 8×H200 80GB | Qwen3-8B | 4 | FSDP1, TP=2 | ✅ Tested | +| g5.12xlarge | 4×A10G 24GB | gpt-oss-20b | 3+1 head | FSDP2, offload, TP=4 | ✅ Tested | +| p4de.24xlarge | 8×A100 80GB | Qwen3-8B | 4 | FSDP1, TP=2 | 🔲 Untested | +``` + +**Effort**: ~2-3 days. Pure documentation, no code changes. + +### Tier 2: Parameterized Config Profiles (MEDIUM effort, MEDIUM ROI) + +**Goal**: Scripts auto-detect instance type and source the right profile. Users +can also explicitly select a profile. + +**Deliverables**: +1. `profiles/` directory per test case with `.env` files per instance type +2. Auto-detection helper script using EC2 instance metadata +3. Updated recipe scripts that source profiles + +**Profile structure** (using veRL as the template): +``` +recipe/ +├── run_grpo.sh # Main script — sources profile +├── profiles/ +│ ├── _detect.sh # Auto-detect instance type +│ ├── p5en-48xlarge.env # FSDP1, no offload, TP=2 +│ ├── g5-12xlarge.env # FSDP2, full offload, TP=4 +│ ├── p4de-24xlarge.env # FSDP1, ref offload, TP=2 +│ └── README.md # How to create a new profile +``` + +**Auto-detection** (`_detect.sh`): +```bash +#!/bin/bash +# Detect from EC2 metadata (works on bare metal and K8s with host networking) +INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ + http://169.254.169.254/latest/meta-data/instance-type 2>/dev/null) + +# Fallback: detect from GPU type +if [ -z "$INSTANCE_TYPE" ]; then + GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader | head -1) + case "$GPU_NAME" in + *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; + *A100*) INSTANCE_TYPE="p4de.24xlarge" ;; + *H100*) INSTANCE_TYPE="p5.48xlarge" ;; + *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; + *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; + esac +fi + +PROFILE_NAME="${INSTANCE_TYPE//./-}" +echo "$PROFILE_NAME" +``` + +**Profile file** (`profiles/g5-12xlarge.env`): +```bash +# g5.12xlarge — 4× A10G 24GB, 1 EFA, no GPUDirect RDMA +# Requires aggressive CPU offloading for models >10B params + +# FSDP +export ACTOR_STRATEGY=fsdp2 +export OFFLOAD_POLICY=True +export MODEL_DTYPE=bf16 +export PARAM_OFFLOAD=True +export OPTIMIZER_OFFLOAD=True +export RESHARD_AFTER_FORWARD=True + +# vLLM +export GPU_MEMORY_UTILIZATION=0.6 +export ENFORCE_EAGER=True +export TENSOR_PARALLEL_SIZE=4 + +# Network +export NCCL_PROTO=simple +export FI_EFA_USE_DEVICE_RDMA=0 + +# Training +export NUM_GPU_PER_NODE=4 +export WORKER_MEMORY=150Gi +export MAX_RESPONSE_LENGTH=256 + +# Checkpoints (24GB GPUs → full offload → large checkpoints) +export SAVE_FREQ=20 +export MAX_CKPT_TO_KEEP=3 +``` + +**Effort**: ~1 week. Requires touching each test case's run scripts. + +### Tier 3: Automated Multi-Instance Validation (HIGH effort, LONG-TERM ROI) + +**Goal**: CI validates test cases across multiple instance types. Catches +regressions before they reach users. + +**Deliverables**: +1. Smoke test framework — 2-step train + checkpoint for each config +2. CI matrix expansion (instance type × model size × framework) +3. Memory profiling artifacts (peak GPU at init/fwd/bwd/ckpt) +4. Performance baseline tracking + +**Smoke test spec**: +```yaml +# .github/test-matrix.yml +smoke_tests: + - test_case: pytorch/verl + configs: + - instance: p5en.48xlarge + model: Qwen3-8B + profile: p5en-48xlarge + steps: 2 + expected_peak_gpu_gb: 45 + - instance: g5.12xlarge + model: gpt-oss-20b + profile: g5-12xlarge + steps: 2 + expected_peak_gpu_gb: 22 + - test_case: pytorch/FSDP + configs: + - instance: p5en.48xlarge + model: llama3_1_70b + steps: 5 + - instance: g5.12xlarge + model: llama3_1_8b + steps: 5 +``` + +**CI expansion**: Extend `fsdp-regression-test-container.yml` pattern: +```yaml +strategy: + matrix: + cluster: [p5-eks, g5-eks, p4de-slurm] + model: [llama3_1_8b, llama3_1_70b] + exclude: + - cluster: g5-eks + model: llama3_1_70b # Won't fit without offloading +``` + +**Memory profiling**: Capture at each phase and store as artifact: +```python +# Inserted at key points in training loop +torch.cuda.synchronize() +peak = torch.cuda.max_memory_allocated() / 1e9 +print(f"MEMORY_PROFILE phase=init peak_gb={peak:.2f}") +torch.cuda.reset_peak_memory_stats() +``` + +**Effort**: ~2-4 weeks. Requires CI infra for non-P5 clusters. + +## Implementation Order + +| Phase | Tier | Scope | Effort | Prereq | +|-------|------|-------|--------|--------| +| 1 | Tier 1 | Central docs + veRL READMEs | 2-3 days | None | +| 2 | Tier 1 | Remaining test case READMEs | 2-3 days | Phase 1 | +| 3 | Tier 2 | veRL profile system (template) | 2-3 days | Phase 1 | +| 4 | Tier 2 | FSDP profile system | 2-3 days | Phase 3 | +| 5 | Tier 2 | Remaining test cases | 1 week | Phase 4 | +| 6 | Tier 3 | Smoke test framework | 1 week | Phase 3 | +| 7 | Tier 3 | CI matrix expansion | 1-2 weeks | Phase 6 | + +## Key Lessons from veRL g5 Experience (informing the profiles) + +These are the specific parameters that differ by instance type, discovered +through 11 OOM iterations: + +1. **FSDP2 vs FSDP1**: FSDP1 explicitly disables CPUOffload for actor role. + On 24GB GPUs, this is fatal for models >10B params. + +2. **offload_policy=True**: FSDP2-specific flag. Without it, actor stays on GPU. + +3. **model_dtype=bf16**: veRL defaults actor to fp32. On 80GB GPUs this wastes + space; on 24GB GPUs it's an instant OOM. + +4. **gpu_memory_utilization**: Fraction of TOTAL GPU, not just KV cache. + 0.3 × 23GB = 6.9GB < 10GB model shard → OOM. + +5. **enforce_eager=True**: CUDA graphs need extra workspace → OOM on 24GB. + +6. **NCCL_PROTO=simple**: Required when FI_EFA_USE_DEVICE_RDMA=0 (g5, g6e). + +7. **Checkpoint size**: Full FSDP state for 20B model across 12 GPUs = 117GB. + save_freq=1 on 1.2TB FSx → disk full in 9 steps. + +8. **nnodes must exclude non-GPU head pod**: Ray head without GPUs in K8s + causes NCCL hang if included in nnodes count. + +9. **WORKER_MEMORY**: g5.12xlarge allocatable is ~168Gi, not 200Gi. Requesting + 200Gi causes pod scheduling failure. + +10. **expandable_segments incompatible with vLLM**: CuMemAllocator asserts. + +## How to Resume This Work in a New Session + +Paste this into a new OpenCode session: + +``` +I'm implementing the Instance Compatibility Framework for the +awsome-distributed-training repo. The full plan is at: +- In-repo: docs/plans/instance-compatibility-framework.md +- Local: /tmp/instance-compatibility-plan.md + +Start by reading the plan, then create branch +feat/instance-compatibility-framework and begin Phase 1 (Tier 1): +central docs + veRL README updates. + +Key context: +- Repo: /Users/nchkumar/Code/smml-work/awsome-distributed-training/ +- The veRL test case at 3.test_cases/pytorch/verl/ already has g5 guidance + in its README (added this session) +- Training job raysubmit_CQxnr9aau2TFZ3E2 may still be running on the + HyperPod cluster (check with kubectl) +- The 6 key dimensions for instance profiles: GPU VRAM, GPUDirect RDMA, + EFA count, NVLink topology, Node CPU memory, Storage size +``` +``` + +## Active Training Job (for monitoring) + +- Job ID: raysubmit_CQxnr9aau2TFZ3E2 (v12) +- Config: FSDP2 + offload_policy + save_freq=20 + resume from step 9 +- Check: kubectl exec rayml-efa-head-hlgcf -- bash -c "ray job logs raysubmit_CQxnr9aau2TFZ3E2 2>&1 | grep -E 'step:' | tail -5" +- Cluster cost: ~$28/hr (4× ml.g5.12xlarge)