Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions 3.test_cases/23.SMHP-esm2/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# How to finetune ESM2 with SageMaker Hyperpod using Amazon G5 instances

## Tested Configurations

| Instance | GPUs | Model | Status | Notes |
|----------|------|-------|--------|-------|
| g5.24xlarge | 4 x A10G 24 GB | ESM2 150M | Tested | Primary target |
| g5.12xlarge | 4 x A10G 24 GB | ESM2 150M | Tested | See benchmark tables below |
| p5.48xlarge | 8 x H100 80 GB | ESM2 150M | Tested | See benchmark tables below |
| p4de.24xlarge | 8 x A100 80 GB | ESM2 | Untested | Expected to work |
| p5en.48xlarge | 8 x H200 80 GB | ESM2 | Untested | Expected to work |

> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types.
Comment on lines +13 to +14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken relative link — one ../ too many

This file is at depth 2 under the repo root (3.test_cases/23.SMHP-esm2/), so it needs 2 ../ segments to reach the root. The current link uses 3, which resolves to the parent of the repo root.

The same issue likely affects 3.test_cases/jax/README.md. I'd suggest sweeping all 22 READMEs to verify each relative link depth.

Suggested change
> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types.
> See the [Instance Compatibility Guide](../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types.


## What is SageMaker Hyperpod?
[Amazon SageMaker Hyperpod](https://aws.amazon.com/sagemaker/hyperpod/) offers advanced training tools to help you accelerate scalable, reliable, and secure generative AI application development. It removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure for training foundation models (FMs) significantly reducing training time. SageMaker Hyperpod ensure customers can continue FM training uninterrupted by periodically saving checkpoints. When a hardware failure occurs during training, SageMaker Hyperpod automatically detects the failure, repairs, or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for customers to manually manage this process and helping them train for week or months in a distributed setting without disruption.

Expand Down
12 changes: 12 additions & 0 deletions 3.test_cases/jax/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@

Ths directory contains a sample Dockerfile `jax_paxml.Dockerfile` to run [JAX](https://github.com/google/jax) and [Paxml](https://github.com/google/paxml) on AWS.

## Tested Configurations

| Instance | GPUs | Status | Notes |
|----------|------|--------|-------|
| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work |
| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work |
| p4de.24xlarge | 8 x A100 80 GB | Untested | Expected to work |
| g5.12xlarge | 4 x A10G 24 GB | Untested | May need smaller model configs |

> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types.

## Container description

In principle, the reference `Dockerfile` does the following:
Expand Down
34 changes: 32 additions & 2 deletions 3.test_cases/megatron/bionemo/2.esm1nv_pretrain.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,37 @@
#SBATCH --exclusive # exclusive node access
#SBATCH --output slurm-esm1nv-train-%j.out

export FI_EFA_USE_HUGE_PAGE=0
###########################
###### Instance Profile ###
###########################
# Auto-detect instance type and source the matching profile.
# Profiles set: GPUS_PER_NODE, MICRO_BATCH_SIZE, EFA/NCCL vars.
# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch)
# See profiles/README.md for details.

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROFILES_DIR="${SCRIPT_DIR}/profiles"
PROFILE_LOADED=0

if [[ -d "$PROFILES_DIR" ]]; then
if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then
echo "Sourcing instance profile: $PROFILE_ENV"
source "$PROFILE_ENV"
PROFILE_LOADED=1
else
echo "WARNING: Profile detection failed. Using defaults (8 GPU, p4de-style)."
fi
else
echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, p4de-style)."
fi

# Fallback defaults when no profile is loaded (assumes P4de-class instance)
GPUS_PER_NODE=${GPUS_PER_NODE:-8}

# EFA — configured by profile or legacy default
if [[ "$PROFILE_LOADED" != "1" ]]; then
export FI_EFA_USE_HUGE_PAGE=0
fi


###########################
Expand All @@ -26,7 +56,7 @@ declare -a ARGS=(

# Training parameters
# =========================
MICRO_BATCH_SIZE=256 # micro batch size per GPU, for best efficiency should be set to occupy ~85% of GPU memory. Suggested value for A100 80GB is 256
MICRO_BATCH_SIZE=${MICRO_BATCH_SIZE:-256} # micro batch size per GPU, for best efficiency should be set to occupy ~85% of GPU memory. Suggested value for A100 80GB is 256
ACCUMULATE_GRAD_BATCHES=1 # gradient accumulation
TENSOR_MODEL_PARALLEL_SIZE=1 # tensor model parallel size
VAL_CHECK_INTERVAL=500 # how often validation step is performed, including downstream task validation
Expand Down
12 changes: 12 additions & 0 deletions 3.test_cases/megatron/bionemo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,18 @@ NVIDIA BioNeMo is a domain-specific machine learning framework for training and
This project provides a guide to run [Nvidia's BioNemo](https://docs.nvidia.com/bionemo-framework/latest/index.html) on AWS ParallelCluster and pretrain the popular [ESM models](https://github.com/facebookresearch/esm) specifically the [ESM1nv](https://docs.nvidia.com/bionemo-framework/latest/notebooks/model_training_esm1nv.html) model.


## Tested Configurations

| Instance | GPUs | Status | Notes |
|----------|------|--------|-------|
| p4de.24xlarge | 8 x A100 80 GB | Tested | Primary target (4 nodes) |
| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work |
| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work |
| g5.12xlarge | 4 x A10G 24 GB | Untested | May need smaller model or offloading |

> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types.

## 0. Prerequisites

0. You have access to the bionemo container. To get the access to BioNeMo, visit the [information website](https://www.nvidia.com/en-us/clara/bionemo/).
Expand Down
40 changes: 35 additions & 5 deletions 3.test_cases/megatron/bionemo/bionemo_2.5/train-esm.sbatch
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,39 @@
#SBATCH --exclusive # exclusive node access
#SBATCH --output slurm-esm2-train-%j.out

#export FI_EFA_USE_HUGE_PAGE=0 #Uncomment if you get os.fork() memory error
export FI_PROVIDER=efa
export NCCL_DEBUG=INFO
###########################
###### Instance Profile ###
###########################
# Auto-detect instance type and source the matching profile.
# Profiles set: GPUS_PER_NODE, EFA/NCCL vars.
# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch)
# See ../profiles/README.md for details.

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROFILES_DIR="${SCRIPT_DIR}/../profiles"
PROFILE_LOADED=0

if [[ -d "$PROFILES_DIR" ]]; then
if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then
echo "Sourcing instance profile: $PROFILE_ENV"
source "$PROFILE_ENV"
PROFILE_LOADED=1
else
echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)."
fi
else
echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)."
fi

# Fallback defaults when no profile is loaded
GPUS_PER_NODE=${GPUS_PER_NODE:-8}

# EFA — configured by profile or legacy defaults
if [[ "$PROFILE_LOADED" != "1" ]]; then
#export FI_EFA_USE_HUGE_PAGE=0 #Uncomment if you get os.fork() memory error
export FI_PROVIDER=efa
export NCCL_DEBUG=INFO
fi

#Path to store data and checkpoints
export DATA_HOME_DIR=/fsxl/awsankur/bionemo
Expand Down Expand Up @@ -36,8 +66,8 @@ srun -l "${ARGS[@]}" python3 /workspace/bionemo2/sub-packages/bionemo-esm2/src/
--valid-cluster-path ${DATA_DIR}/2024_03_sanity/valid_clusters.parquet \
--valid-database-path ${DATA_DIR}/2024_03_sanity/validation.db \
--precision="bf16-mixed" \
--num-gpus 8 \
--num-nodes 2 \
--num-gpus ${GPUS_PER_NODE} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — hardcoded values replaced with variables

Replacing --num-gpus 8 and --num-nodes 2 with ${GPUS_PER_NODE} and ${SLURM_JOB_NUM_NODES} is a valuable standalone fix. Like the $NODES fix above, I'd suggest extracting this into a separate PR.

--num-nodes ${SLURM_JOB_NUM_NODES} \
--num-steps 100 \
--val-check-interval 25 \
--max-seq-length 1024 \
Expand Down
46 changes: 46 additions & 0 deletions 3.test_cases/megatron/bionemo/profiles/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# BioNeMo Instance Profiles

Instance profiles configure GPU count, micro-batch size, and EFA/NCCL
networking variables for each supported EC2 instance type. Model architecture
parameters (num_layers, hidden_size, etc.) are handled by the training scripts
or BioNeMo config files.

## Auto-detection

The training scripts auto-detect the running instance type and source the
matching `.env` profile. Override with:

```bash
export INSTANCE_PROFILE=g5-12xlarge
```

See [docs/instance-compatibility.md](../../../docs/instance-compatibility.md)
for full details.

## Available Profiles

| Profile | Instance | GPUs | VRAM | EFA | Default MBS | Status |
|---------|----------|------|------|-----|-------------|--------|
| `p5en-48xlarge.env` | p5en.48xlarge | 8x H200 | 141 GB | 32 adapters | 256 | Supported |
| `p5-48xlarge.env` | p5.48xlarge | 8x H100 | 80 GB | 32 adapters | 256 | Supported |
| `p4de-24xlarge.env` | p4de.24xlarge | 8x A100 | 80 GB | 4 adapters | 256 | Supported (original target) |
| `g6e-12xlarge.env` | g6e.12xlarge | 4x L40S | 48 GB | None | 128 | Experimental |
| `g5-12xlarge.env` | g5.12xlarge | 4x A10G | 24 GB | None | 64 | Experimental |

## Model Compatibility

### ESM-1nv (BioNeMo 1.2, `2.esm1nv_pretrain.slurm`)

The key tunable is `MICRO_BATCH_SIZE`, which occupies ~85% of GPU memory at 256
on A100 80GB. Profile-sourced MBS values:

| Instance | VRAM | Profile MBS | Notes |
|----------|------|-------------|-------|
| p5en/p5/p4de | 80-141 GB | 256 | Original documented value |
| g6e | 48 GB | 128 | Estimated; tune based on actual usage |
| g5 | 24 GB | 64 | Estimated; may need further reduction |

### ESM-2 (BioNeMo 2.5, `bionemo_2.5/train-esm.sbatch`)

Uses fixed MBS=2 with 650M parameter model. Fits on all instance types.
The profile's `GPUS_PER_NODE` adjusts `--num-gpus` and SBATCH `--gpus-per-node`.
95 changes: 95 additions & 0 deletions 3.test_cases/megatron/bionemo/profiles/_detect.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove _detect.sh and all profiles/ directories

This 95-line auto-detection script (IMDSv2 → env var → nvidia-smi fallback) is copied identically into 9 test case profiles/ directories, with a sync_profiles.sh to keep them in sync. This is infrastructure for infrastructure.

The detection logic is well-engineered, but it adds complexity that doesn't belong in a repo of reference training scripts. Users should be editing their scripts directly — that's the point of reference architectures.

I'd suggest removing:

  • All profiles/ directories (9 test cases)
  • All _detect.sh copies
  • 3.test_cases/shared/instance_detect.sh
  • 3.test_cases/shared/sync_profiles.sh
  • All .env profile files

The useful content currently in .env files (NCCL flags, model fit notes, VRAM estimates) should be moved to README docs and inline script comments.

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
#
# instance_detect.sh — Auto-detect EC2 instance type and resolve a profile.
#
# CANONICAL SOURCE: This is the single source of truth for instance detection.
# Copies exist in each test case's profiles/_detect.sh directory. To update
# all copies, edit this file and run: ./sync_profiles.sh
#
# Usage (from a training script):
# PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles")
# source "$PROFILE_ENV"
#
# Detection order:
# 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge")
# 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge")
# 3. EC2 instance metadata API (works on bare metal and K8s with host networking)
# 4. GPU name from nvidia-smi (fallback when metadata is unavailable)
#
# Outputs the path to the profile .env file on stdout.
# Exits non-zero if no profile can be resolved.
# ---------------------------------------------------------------------------
set -euo pipefail

PROFILES_DIR="${1:-.}"

# --- Step 1: Check for explicit INSTANCE_PROFILE override -------------------
if [[ -n "${INSTANCE_PROFILE:-}" ]]; then
PROFILE_NAME="$INSTANCE_PROFILE"
echo "Instance profile override: ${PROFILE_NAME}" >&2
else
# --- Step 2: Try INSTANCE_TYPE from env_vars ----------------------------
INSTANCE_TYPE="${INSTANCE_TYPE:-}"

# --- Step 3: Try EC2 instance metadata API ------------------------------
if [[ -z "$INSTANCE_TYPE" ]]; then
# IMDSv2: get a token first, then query
TOKEN=$(curl -s --connect-timeout 2 -X PUT \
"http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true
if [[ -n "$TOKEN" ]]; then
INSTANCE_TYPE=$(curl -s --connect-timeout 2 \
-H "X-aws-ec2-metadata-token: $TOKEN" \
"http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true
fi
fi

# --- Step 4: Fallback — detect from GPU name ----------------------------
if [[ -z "$INSTANCE_TYPE" ]]; then
if command -v nvidia-smi &>/dev/null; then
GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true
case "${GPU_NAME:-}" in
*A10G*) INSTANCE_TYPE="g5.12xlarge" ;;
*A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;;
*A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;;
*H100*) INSTANCE_TYPE="p5.48xlarge" ;;
*H200*) INSTANCE_TYPE="p5en.48xlarge" ;;
*L40S*) INSTANCE_TYPE="g6e.12xlarge" ;;
*L4*) INSTANCE_TYPE="g6.12xlarge" ;;
*)
echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2
echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2
exit 1
;;
esac
echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2
else
echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2
echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2
exit 1
fi
fi

# Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge"
PROFILE_NAME="${INSTANCE_TYPE//./-}"
echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2
fi

# --- Resolve profile file path ----------------------------------------------
PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env"

if [[ ! -f "$PROFILE_PATH" ]]; then
echo "ERROR: No profile found at ${PROFILE_PATH}" >&2
echo "" >&2
echo "Available profiles:" >&2
ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2
echo "" >&2
echo "To create a new profile, copy an existing one and adjust the values:" >&2
echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2
exit 1
fi

# Output the resolved path (recipe script will source it)
echo "$PROFILE_PATH"
26 changes: 26 additions & 0 deletions 3.test_cases/megatron/bionemo/profiles/g5-12xlarge.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA
# Severely memory-constrained for BioNeMo. ESM-1nv with MBS=256 will OOM.
#
# MODEL COMPATIBILITY (g5.12xlarge, 4x A10G 24GB each):
# - ESM-1nv (pretrain_small): Must reduce MBS dramatically (try MBS=32-64).
# The original script says "A100 80GB → 256". 24GB is ~3.3x less VRAM,
# so MBS ~64-80 may fit. Start with 64 and adjust.
# - ESM-2 (650M, BioNeMo 2.5): MBS=2 should fit (small model).
#
# Key differences from p4de/p5/p5en:
# - 4 GPUs instead of 8
# - No EFA
# - 24GB VRAM → ESM-1nv micro batch size must be reduced

# --- Hardware ---
export GPUS_PER_NODE=4

# --- Training defaults ---
# Reduced MBS for 24GB VRAM. This is a starting point — tune based on
# actual memory usage. ESM-1nv may still OOM; try reducing further to 32.
export MICRO_BATCH_SIZE=64

# --- EFA / NCCL ---
# No EFA on g5 — do NOT set FI_PROVIDER or FI_EFA_* variables.
export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth"
export NCCL_DEBUG=INFO
20 changes: 20 additions & 0 deletions 3.test_cases/megatron/bionemo/profiles/g6e-12xlarge.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA
# Moderate VRAM; ESM-1nv MBS can be higher than g5 but lower than p4de.
#
# MODEL COMPATIBILITY (g6e.12xlarge, 4x L40S 48GB each):
# - ESM-1nv (pretrain_small): MBS ~128-160 may fit (48GB vs 80GB).
# Start with 128 and adjust upward.
# - ESM-2 (650M, BioNeMo 2.5): MBS=2 fits easily.

# --- Hardware ---
export GPUS_PER_NODE=4

# --- Training defaults ---
# Scaled MBS for 48GB VRAM (60% of A100's 80GB → ~60% of 256 ≈ 150).
# Start conservative at 128.
export MICRO_BATCH_SIZE=128

# --- EFA / NCCL ---
# No EFA on g6e — do NOT set FI_PROVIDER or FI_EFA_* variables.
export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth"
export NCCL_DEBUG=INFO
19 changes: 19 additions & 0 deletions 3.test_cases/megatron/bionemo/profiles/p4de-24xlarge.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA
# This is the primary target instance for BioNeMo. The original scripts
# were written for 4x p4de.24xlarge nodes.
#
# MODEL ASSUMPTIONS:
# ESM-1nv: "Suggested value for A100 80GB is 256" (micro batch size)
# ESM-2 (650M): MBS=2 is the documented value

# --- Hardware ---
export GPUS_PER_NODE=8

# --- Training defaults ---
export MICRO_BATCH_SIZE=256

# --- EFA / NCCL ---
export FI_PROVIDER=efa
export FI_EFA_USE_HUGE_PAGE=0
export NCCL_SOCKET_IFNAME="^docker,lo,veth"
export NCCL_DEBUG=INFO
18 changes: 18 additions & 0 deletions 3.test_cases/megatron/bionemo/profiles/p5-48xlarge.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# p5.48xlarge — 8x H100 80GB, 32 EFA, NVLink, GPUDirect RDMA
#
# MODEL ASSUMPTIONS:
# ESM-1nv: MBS=256 fits (same 80GB VRAM as A100)
# ESM-2 (650M): MBS=2

# --- Hardware ---
export GPUS_PER_NODE=8

# --- Training defaults ---
export MICRO_BATCH_SIZE=256

# --- EFA / NCCL ---
export FI_PROVIDER=efa
export FI_EFA_USE_HUGE_PAGE=0
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
export NCCL_SOCKET_IFNAME="^docker,lo,veth"
export NCCL_DEBUG=INFO
Loading
Loading