-
Notifications
You must be signed in to change notification settings - Fork 175
feat: Instance Compatibility Framework — multi-instance profiles and documentation for all test cases #1015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
9b75ae9
e140a26
80f7437
8e51ec7
e75d50a
1a6bd23
54f44df
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,9 +5,39 @@ | |
| #SBATCH --exclusive # exclusive node access | ||
| #SBATCH --output slurm-esm2-train-%j.out | ||
|
|
||
| #export FI_EFA_USE_HUGE_PAGE=0 #Uncomment if you get os.fork() memory error | ||
| export FI_PROVIDER=efa | ||
| export NCCL_DEBUG=INFO | ||
| ########################### | ||
| ###### Instance Profile ### | ||
| ########################### | ||
| # Auto-detect instance type and source the matching profile. | ||
| # Profiles set: GPUS_PER_NODE, EFA/NCCL vars. | ||
| # Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) | ||
| # See ../profiles/README.md for details. | ||
|
|
||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
| PROFILES_DIR="${SCRIPT_DIR}/../profiles" | ||
| PROFILE_LOADED=0 | ||
|
|
||
| if [[ -d "$PROFILES_DIR" ]]; then | ||
| if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then | ||
| echo "Sourcing instance profile: $PROFILE_ENV" | ||
| source "$PROFILE_ENV" | ||
| PROFILE_LOADED=1 | ||
| else | ||
| echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." | ||
| fi | ||
| else | ||
| echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." | ||
| fi | ||
|
|
||
| # Fallback defaults when no profile is loaded | ||
| GPUS_PER_NODE=${GPUS_PER_NODE:-8} | ||
|
|
||
| # EFA — configured by profile or legacy defaults | ||
| if [[ "$PROFILE_LOADED" != "1" ]]; then | ||
| #export FI_EFA_USE_HUGE_PAGE=0 #Uncomment if you get os.fork() memory error | ||
| export FI_PROVIDER=efa | ||
| export NCCL_DEBUG=INFO | ||
| fi | ||
|
|
||
| #Path to store data and checkpoints | ||
| export DATA_HOME_DIR=/fsxl/awsankur/bionemo | ||
|
|
@@ -36,8 +66,8 @@ srun -l "${ARGS[@]}" python3 /workspace/bionemo2/sub-packages/bionemo-esm2/src/ | |
| --valid-cluster-path ${DATA_DIR}/2024_03_sanity/valid_clusters.parquet \ | ||
| --valid-database-path ${DATA_DIR}/2024_03_sanity/validation.db \ | ||
| --precision="bf16-mixed" \ | ||
| --num-gpus 8 \ | ||
| --num-nodes 2 \ | ||
| --num-gpus ${GPUS_PER_NODE} \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good fix — hardcoded values replaced with variablesReplacing |
||
| --num-nodes ${SLURM_JOB_NUM_NODES} \ | ||
| --num-steps 100 \ | ||
| --val-check-interval 25 \ | ||
| --max-seq-length 1024 \ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # BioNeMo Instance Profiles | ||
|
|
||
| Instance profiles configure GPU count, micro-batch size, and EFA/NCCL | ||
| networking variables for each supported EC2 instance type. Model architecture | ||
| parameters (num_layers, hidden_size, etc.) are handled by the training scripts | ||
| or BioNeMo config files. | ||
|
|
||
| ## Auto-detection | ||
|
|
||
| The training scripts auto-detect the running instance type and source the | ||
| matching `.env` profile. Override with: | ||
|
|
||
| ```bash | ||
| export INSTANCE_PROFILE=g5-12xlarge | ||
| ``` | ||
|
|
||
| See [docs/instance-compatibility.md](../../../docs/instance-compatibility.md) | ||
| for full details. | ||
|
|
||
| ## Available Profiles | ||
|
|
||
| | Profile | Instance | GPUs | VRAM | EFA | Default MBS | Status | | ||
| |---------|----------|------|------|-----|-------------|--------| | ||
| | `p5en-48xlarge.env` | p5en.48xlarge | 8x H200 | 141 GB | 32 adapters | 256 | Supported | | ||
| | `p5-48xlarge.env` | p5.48xlarge | 8x H100 | 80 GB | 32 adapters | 256 | Supported | | ||
| | `p4de-24xlarge.env` | p4de.24xlarge | 8x A100 | 80 GB | 4 adapters | 256 | Supported (original target) | | ||
| | `g6e-12xlarge.env` | g6e.12xlarge | 4x L40S | 48 GB | None | 128 | Experimental | | ||
| | `g5-12xlarge.env` | g5.12xlarge | 4x A10G | 24 GB | None | 64 | Experimental | | ||
|
|
||
| ## Model Compatibility | ||
|
|
||
| ### ESM-1nv (BioNeMo 1.2, `2.esm1nv_pretrain.slurm`) | ||
|
|
||
| The key tunable is `MICRO_BATCH_SIZE`, which occupies ~85% of GPU memory at 256 | ||
| on A100 80GB. Profile-sourced MBS values: | ||
|
|
||
| | Instance | VRAM | Profile MBS | Notes | | ||
| |----------|------|-------------|-------| | ||
| | p5en/p5/p4de | 80-141 GB | 256 | Original documented value | | ||
| | g6e | 48 GB | 128 | Estimated; tune based on actual usage | | ||
| | g5 | 24 GB | 64 | Estimated; may need further reduction | | ||
|
|
||
| ### ESM-2 (BioNeMo 2.5, `bionemo_2.5/train-esm.sbatch`) | ||
|
|
||
| Uses fixed MBS=2 with 650M parameter model. Fits on all instance types. | ||
| The profile's `GPUS_PER_NODE` adjusts `--num-gpus` and SBATCH `--gpus-per-node`. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| #!/usr/bin/env bash | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove
|
||
| # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
| # SPDX-License-Identifier: MIT-0 | ||
| # | ||
| # instance_detect.sh — Auto-detect EC2 instance type and resolve a profile. | ||
| # | ||
| # CANONICAL SOURCE: This is the single source of truth for instance detection. | ||
| # Copies exist in each test case's profiles/_detect.sh directory. To update | ||
| # all copies, edit this file and run: ./sync_profiles.sh | ||
| # | ||
| # Usage (from a training script): | ||
| # PROFILE_ENV=$("path/to/profiles/_detect.sh" "path/to/profiles") | ||
| # source "$PROFILE_ENV" | ||
| # | ||
| # Detection order: | ||
| # 1. INSTANCE_PROFILE env var (explicit override, e.g. "g5-12xlarge") | ||
| # 2. INSTANCE_TYPE env var (from env_vars, e.g. "g5.12xlarge") | ||
| # 3. EC2 instance metadata API (works on bare metal and K8s with host networking) | ||
| # 4. GPU name from nvidia-smi (fallback when metadata is unavailable) | ||
| # | ||
| # Outputs the path to the profile .env file on stdout. | ||
| # Exits non-zero if no profile can be resolved. | ||
| # --------------------------------------------------------------------------- | ||
| set -euo pipefail | ||
|
|
||
| PROFILES_DIR="${1:-.}" | ||
|
|
||
| # --- Step 1: Check for explicit INSTANCE_PROFILE override ------------------- | ||
| if [[ -n "${INSTANCE_PROFILE:-}" ]]; then | ||
| PROFILE_NAME="$INSTANCE_PROFILE" | ||
| echo "Instance profile override: ${PROFILE_NAME}" >&2 | ||
| else | ||
| # --- Step 2: Try INSTANCE_TYPE from env_vars ---------------------------- | ||
| INSTANCE_TYPE="${INSTANCE_TYPE:-}" | ||
|
|
||
| # --- Step 3: Try EC2 instance metadata API ------------------------------ | ||
| if [[ -z "$INSTANCE_TYPE" ]]; then | ||
| # IMDSv2: get a token first, then query | ||
| TOKEN=$(curl -s --connect-timeout 2 -X PUT \ | ||
| "http://169.254.169.254/latest/api/token" \ | ||
| -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null) || true | ||
| if [[ -n "$TOKEN" ]]; then | ||
| INSTANCE_TYPE=$(curl -s --connect-timeout 2 \ | ||
| -H "X-aws-ec2-metadata-token: $TOKEN" \ | ||
| "http://169.254.169.254/latest/meta-data/instance-type" 2>/dev/null) || true | ||
| fi | ||
| fi | ||
|
|
||
| # --- Step 4: Fallback — detect from GPU name ---------------------------- | ||
| if [[ -z "$INSTANCE_TYPE" ]]; then | ||
| if command -v nvidia-smi &>/dev/null; then | ||
| GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) || true | ||
| case "${GPU_NAME:-}" in | ||
| *A10G*) INSTANCE_TYPE="g5.12xlarge" ;; | ||
| *A100*80*) INSTANCE_TYPE="p4de.24xlarge" ;; | ||
| *A100*40*) INSTANCE_TYPE="p4d.24xlarge" ;; | ||
| *H100*) INSTANCE_TYPE="p5.48xlarge" ;; | ||
| *H200*) INSTANCE_TYPE="p5en.48xlarge" ;; | ||
| *L40S*) INSTANCE_TYPE="g6e.12xlarge" ;; | ||
| *L4*) INSTANCE_TYPE="g6.12xlarge" ;; | ||
| *) | ||
| echo "ERROR: Could not determine instance type from GPU: '${GPU_NAME:-unknown}'" >&2 | ||
| echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 | ||
| exit 1 | ||
| ;; | ||
| esac | ||
| echo "Detected GPU '${GPU_NAME}' -> assuming ${INSTANCE_TYPE}" >&2 | ||
| else | ||
| echo "ERROR: Cannot detect instance type (no metadata, no nvidia-smi)." >&2 | ||
| echo "Set INSTANCE_TYPE or INSTANCE_PROFILE in env_vars." >&2 | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
|
||
| # Convert instance type to profile name: "g5.12xlarge" -> "g5-12xlarge" | ||
| PROFILE_NAME="${INSTANCE_TYPE//./-}" | ||
| echo "Detected instance type: ${INSTANCE_TYPE} -> profile: ${PROFILE_NAME}" >&2 | ||
| fi | ||
|
|
||
| # --- Resolve profile file path ---------------------------------------------- | ||
| PROFILE_PATH="${PROFILES_DIR}/${PROFILE_NAME}.env" | ||
|
|
||
| if [[ ! -f "$PROFILE_PATH" ]]; then | ||
| echo "ERROR: No profile found at ${PROFILE_PATH}" >&2 | ||
| echo "" >&2 | ||
| echo "Available profiles:" >&2 | ||
| ls -1 "${PROFILES_DIR}"/*.env 2>/dev/null | sed 's/.*\// /' >&2 || echo " (none)" >&2 | ||
| echo "" >&2 | ||
| echo "To create a new profile, copy an existing one and adjust the values:" >&2 | ||
| echo " cp ${PROFILES_DIR}/p5en-48xlarge.env ${PROFILE_PATH}" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Output the resolved path (recipe script will source it) | ||
| echo "$PROFILE_PATH" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # g5.12xlarge — 4x A10G 24GB, no EFA, no NVLink, no GPUDirect RDMA | ||
| # Severely memory-constrained for BioNeMo. ESM-1nv with MBS=256 will OOM. | ||
| # | ||
| # MODEL COMPATIBILITY (g5.12xlarge, 4x A10G 24GB each): | ||
| # - ESM-1nv (pretrain_small): Must reduce MBS dramatically (try MBS=32-64). | ||
| # The original script says "A100 80GB → 256". 24GB is ~3.3x less VRAM, | ||
| # so MBS ~64-80 may fit. Start with 64 and adjust. | ||
| # - ESM-2 (650M, BioNeMo 2.5): MBS=2 should fit (small model). | ||
| # | ||
| # Key differences from p4de/p5/p5en: | ||
| # - 4 GPUs instead of 8 | ||
| # - No EFA | ||
| # - 24GB VRAM → ESM-1nv micro batch size must be reduced | ||
|
|
||
| # --- Hardware --- | ||
| export GPUS_PER_NODE=4 | ||
|
|
||
| # --- Training defaults --- | ||
| # Reduced MBS for 24GB VRAM. This is a starting point — tune based on | ||
| # actual memory usage. ESM-1nv may still OOM; try reducing further to 32. | ||
| export MICRO_BATCH_SIZE=64 | ||
|
|
||
| # --- EFA / NCCL --- | ||
| # No EFA on g5 — do NOT set FI_PROVIDER or FI_EFA_* variables. | ||
| export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" | ||
| export NCCL_DEBUG=INFO |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # g6e.12xlarge — 4x L40S 48GB, no EFA, no NVLink, no GPUDirect RDMA | ||
| # Moderate VRAM; ESM-1nv MBS can be higher than g5 but lower than p4de. | ||
| # | ||
| # MODEL COMPATIBILITY (g6e.12xlarge, 4x L40S 48GB each): | ||
| # - ESM-1nv (pretrain_small): MBS ~128-160 may fit (48GB vs 80GB). | ||
| # Start with 128 and adjust upward. | ||
| # - ESM-2 (650M, BioNeMo 2.5): MBS=2 fits easily. | ||
|
|
||
| # --- Hardware --- | ||
| export GPUS_PER_NODE=4 | ||
|
|
||
| # --- Training defaults --- | ||
| # Scaled MBS for 48GB VRAM (60% of A100's 80GB → ~60% of 256 ≈ 150). | ||
| # Start conservative at 128. | ||
| export MICRO_BATCH_SIZE=128 | ||
|
|
||
| # --- EFA / NCCL --- | ||
| # No EFA on g6e — do NOT set FI_PROVIDER or FI_EFA_* variables. | ||
| export NCCL_SOCKET_IFNAME="^docker,lo,veth,eth" | ||
| export NCCL_DEBUG=INFO |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| # p4de.24xlarge — 8x A100 80GB, 4 EFA, NVLink, GPUDirect RDMA | ||
| # This is the primary target instance for BioNeMo. The original scripts | ||
| # were written for 4x p4de.24xlarge nodes. | ||
| # | ||
| # MODEL ASSUMPTIONS: | ||
| # ESM-1nv: "Suggested value for A100 80GB is 256" (micro batch size) | ||
| # ESM-2 (650M): MBS=2 is the documented value | ||
|
|
||
| # --- Hardware --- | ||
| export GPUS_PER_NODE=8 | ||
|
|
||
| # --- Training defaults --- | ||
| export MICRO_BATCH_SIZE=256 | ||
|
|
||
| # --- EFA / NCCL --- | ||
| export FI_PROVIDER=efa | ||
| export FI_EFA_USE_HUGE_PAGE=0 | ||
| export NCCL_SOCKET_IFNAME="^docker,lo,veth" | ||
| export NCCL_DEBUG=INFO |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| # p5.48xlarge — 8x H100 80GB, 32 EFA, NVLink, GPUDirect RDMA | ||
| # | ||
| # MODEL ASSUMPTIONS: | ||
| # ESM-1nv: MBS=256 fits (same 80GB VRAM as A100) | ||
| # ESM-2 (650M): MBS=2 | ||
|
|
||
| # --- Hardware --- | ||
| export GPUS_PER_NODE=8 | ||
|
|
||
| # --- Training defaults --- | ||
| export MICRO_BATCH_SIZE=256 | ||
|
|
||
| # --- EFA / NCCL --- | ||
| export FI_PROVIDER=efa | ||
| export FI_EFA_USE_HUGE_PAGE=0 | ||
| export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 | ||
| export NCCL_SOCKET_IFNAME="^docker,lo,veth" | ||
| export NCCL_DEBUG=INFO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken relative link — one
../too manyThis file is at depth 2 under the repo root (
3.test_cases/23.SMHP-esm2/), so it needs 2../segments to reach the root. The current link uses 3, which resolves to the parent of the repo root.The same issue likely affects
3.test_cases/jax/README.md. I'd suggest sweeping all 22 READMEs to verify each relative link depth.