Skip to content

feat: Instance Compatibility Framework — multi-instance profiles and documentation for all test cases#1015

Open
nkumaraws wants to merge 7 commits intomainfrom
feat/instance-compatibility-framework
Open

feat: Instance Compatibility Framework — multi-instance profiles and documentation for all test cases#1015
nkumaraws wants to merge 7 commits intomainfrom
feat/instance-compatibility-framework

Conversation

@nkumaraws
Copy link
Contributor

Summary

Adds a comprehensive Instance Compatibility Framework that enables test cases to run across different EC2 GPU instance types (p5en, p5, p4de, g5, g6e, trn1) with correct parameters. Previously, most test cases only documented P5 configurations, leaving users to guess parameters for other instances.

  • Central documentation: Hardware compatibility matrix, 6-dimension tuning guide (GPU VRAM, GPUDirect RDMA, EFA count, NVLink topology, CPU memory, storage), per-instance profile pages, and lessons learned
  • "Tested Configurations" tables: Added to 22 READMEs across all test cases showing which instance types are supported and at what validation level
  • Parameterized instance profiles: Auto-detecting .env profile system for 9 test cases (veRL, FSDP, torchtitan, nanoVLM, TRL, Megatron-LM, BioNeMo, NeMo, Megatron-DeepSpeed) that adjusts GPU count, EFA config, TP/PP, batch sizes, memory settings, and NCCL tuning per instance
  • Shared utilities: Canonical _detect.sh in 3.test_cases/shared/ with sync script to keep 9 test case copies in sync
  • Live validated: Megatron-LM GPT3 1.7B successfully trained on g5.12xlarge EKS cluster (2 nodes, 8 GPUs, TP=4, 5 iterations, no OOM)

Motivation

10 of ~20 test cases had zero instance type guidance. Users attempting to run on non-P5 instances (especially g5 and g6e with less VRAM and no NVLink) would hit OOM errors, EFA misconfigurations, or silent performance degradation with no documentation to help them.

Changes

Tier 1 — Documentation (Phase 1-2)

  • docs/instance-compatibility.md — central reference with hardware matrix and full test case compatibility table
  • docs/instance-profiles/ — per-instance guides (g5, g6e, p4de, p5, p5en, trn1)
  • docs/plans/instance-compatibility-framework.md — implementation plan
  • 22 README files updated with "Tested Configurations" tables

Tier 2 — Parameterized Profiles (Phases 3-5)

  • veRL: 3 profiles (p5en, p4de, g5) with Ray actor scaling, VRAM-aware batch sizes, conditional --enforce-eager for g5
  • FSDP: 5 profiles with conditional cpu_offload based on model layer count thresholds (g5=32, g6e=40 layers)
  • torchtitan, nanoVLM, TRL: 4 profiles each with EFA/NCCL tuner settings
  • Megatron-LM: 5 profiles with TP/PP defaults, fixed $NODES bug ($SLURM_JOB_NUM_NODES)
  • BioNeMo: 5 profiles with MICRO_BATCH_SIZE scaled per VRAM (256→128/64)
  • NeMo: 5 env_vars_*.json profiles (no script modifications — uses existing --env_vars_file CLI arg)
  • Megatron-DeepSpeed: 5 profiles with TP/PP/MBS/GBS scaled for Llama2-7B

Shared Infrastructure

  • 3.test_cases/shared/instance_detect.sh — canonical auto-detection script (IMDSv2 → INSTANCE_TYPE env → nvidia-smi GPU name)
  • 3.test_cases/shared/sync_profiles.sh — keeps 9 test case copies in sync
  • 3.test_cases/shared/README.md — documents the shared utilities

Design Decisions

  • Per-instance profiles (not per model×instance matrix) — model assumptions documented in .env comments
  • Backward compatible — all scripts fall back to legacy defaults when profiles/ directory is absent
  • No symlinks — copies kept in sync via sync_profiles.sh for cross-platform compatibility
  • 5 test cases intentionally skipped for profiles: ddp (too simple), picotron (config externalized), distillation (already templated), deepspeed/qlora (already self-parameterized with MIG), nemo1.0 (Hydra config makes it awkward)

Testing

  • Dry-run tested all profile detection paths (4+ instances × 9 test cases)
  • Live tested Megatron-LM GPT3 1.7B on HyperPod EKS g5.12xlarge cluster:
    • 2 nodes, 8 GPUs, TP=4/PP=1/DP=2
    • 5 training iterations completed successfully
    • GPU memory: 5.1 GB peak / 24 GB (21% utilization)
    • Iteration time: ~3.2s steady state
    • Confirmed EFA config (FI_EFA_USE_DEVICE_RDMA=0) works correctly
    • Discovered /mnt/k8s-disks/0 doesn't exist on HyperPod g5 nodes (template needs /tmp)

…ration tables

Add central documentation mapping EC2 instance types to the parameter
changes required for each test case. Motivated by 11 OOM iterations
porting veRL GRPO from p5en (H200 80GB) to g5 (A10G 24GB).

New files:
- docs/instance-compatibility.md: master reference with compatibility
  matrix, the 6 hardware dimensions that differ, and common parameter
  adjustment tables for g5/p4de/g6e
- docs/instance-profiles/{g5,g6e,p4de,p5,p5en,trn1}.md: per-instance
  hardware specs, NCCL/EFA settings, memory strategies, K8s resources
- docs/plans/instance-compatibility-framework.md: implementation plan

Modified 22 READMEs to add 'Tested Configurations' tables covering all
test cases across PyTorch, Megatron, JAX, Neuron, and MosaicML.
Add auto-detection and per-instance config profiles so the same recipe
script works across p5en, g5, and p4de without manual parameter tuning.

New files (both hyperpod-eks and kubernetes variants):
- recipe/profiles/_detect.sh: auto-detect instance via IMDSv2, INSTANCE_TYPE
  env var, or nvidia-smi GPU name; resolves to matching .env profile
- recipe/profiles/p5en-48xlarge.env: baseline (FSDP1, TP=2, no offload)
- recipe/profiles/g5-12xlarge.env: aggressive (FSDP2, full offload, TP=4,
  bf16, enforce_eager, NCCL_PROTO=simple) — from 11 OOM iterations
- recipe/profiles/p4de-24xlarge.env: like p5en with 4 EFA adapters
- recipe/profiles/README.md: how profiles work, how to create new ones

Modified:
- run_grpo_configurable.sh: sources detected profile, builds ray job
  command dynamically with conditional FSDP2/offload/eager args
- env_vars.example: documents INSTANCE_PROFILE override, defers
  NUM_GPU_PER_NODE and NUM_EFA_PER_NODE to profile defaults
Add MODEL ASSUMPTIONS comment blocks to all three profile .env files
(g5-12xlarge, p4de-24xlarge, p5en-48xlarge) documenting which settings
are model-dependent vs instance-dependent and how to adjust for
different model sizes. Update profiles README with a reference table
distinguishing model-driven vs instance-driven settings.

Synced across both hyperpod-eks and kubernetes variants.
Add auto-detecting instance profiles for the FSDP test case, following
the same pattern established in the veRL test case (Phase 4 of the
Instance Compatibility Framework).

New files:
- profiles/_detect.sh: auto-detects EC2 instance type via IMDS, env
  vars, or nvidia-smi GPU name (reused from veRL)
- profiles/{p5en-48xlarge,p5-48xlarge,p4de-24xlarge,g5-12xlarge,
  g6e-12xlarge}.env: instance-specific GPU count, EFA config, NCCL
  settings, and FSDP cpu_offload flag
- profiles/README.md: profile guide with model compatibility matrix

Modified files:
- slurm/training-sub.template: sources detected profile at runtime,
  conditionally sets EFA vars and appends --cpu_offload from profile.
  Falls back to legacy defaults (8 GPU, EFA) when no profile found.
- All 9 regenerated .sbatch files reflect the new template.
- kubernetes/README.md: adds profile-aware env_vars setup with
  instance type reference table.
- slurm/README.md: documents profile override workflow.
- README.md: links to profile system.

Backward compatible: scripts work unchanged on P5-class instances
without any profile directory present.
Extend the instance profile system (Phase 5) to three more test cases:

torchtitan (slurm/1.llama_3_8b_torchtitan.sh):
- Added profiles/ with _detect.sh and 4 profiles (p5en, p4de, g5, g6e)
- Sbatch script now auto-detects instance and sources profile for GPU
  count and EFA settings. Falls back to legacy defaults.

nanoVLM (slurm/launch_training.sbatch):
- Same profile structure. Original script had GPUS_PER_NODE=8 and
  hardcoded EFA vars; now auto-detected via profile.

TRL grpo-math-reasoning (grpo-math-reasoning/train.sbatch):
- Added profiles/ with _detect.sh and 4 profiles including NCCL tuner
  settings (NCCL_BUFFSIZE, NCCL_P2P_NET_CHUNKSIZE, NCCL_TUNER_PLUGIN)
  for EFA instances. Non-EFA profiles skip all EFA and NCCL tuner vars.
- GPUS_PER_NODE drives both accelerate --num_processes and the
  auto-computed TENSOR_PARALLEL for vLLM serving.
- The gpt-oss-lora-grpo K8s sub-case is not modified (hardcoded to
  g6e in YAML manifests; documented in profiles README).

All profiles dry-run tested: 4 instance types x 3 test cases = 12/12 pass.
Move the instance detection logic to 3.test_cases/shared/instance_detect.sh
as the canonical source, with sync_profiles.sh to propagate changes to all
6 test case copies. This eliminates drift risk when updating detection logic
(e.g., adding new instance types or IMDSv2 improvements).

All 6 copies updated and verified identical via md5 hash.
…ron-DeepSpeed

Phase 5b of the Instance Compatibility Framework. Adds parameterized
instance profiles to the 4 remaining complex test cases:

- megatron/megatron-lm: 5 profiles (p5en, p5, p4de, g5, g6e) with
  TP/PP defaults, modified GPT3 and Llama2 sbatch scripts to auto-detect
  instance and source profiles. Fixed NODES variable bug (now uses
  SLURM_JOB_NUM_NODES). GPUS_PER_NODE replaces hardcoded nproc_per_node=8.

- megatron/bionemo: 5 profiles with MICRO_BATCH_SIZE scaled per instance
  VRAM (256 for 80GB+, 128 for 48GB, 64 for 24GB). Modified both ESM-1nv
  (BioNeMo 1.2) and ESM-2 (BioNeMo 2.5) training scripts.

- megatron/nemo: 5 env_vars JSON profiles (p5en, p5, p4de, g5, g6e).
  Clean swap via existing --env_vars_file CLI arg — no script modifications
  needed. Non-EFA profiles omit FI_PROVIDER and FI_EFA_* variables.

- deepspeed/megatron-deepspeed: 5 profiles with TP/PP/MBS/GBS scaled for
  Llama2-7B finetuning. 4-GPU instances use TP=2,PP=2 (vs TP=4,PP=2 on
  8-GPU instances). Modified finetune_llama.sbatch to source profiles.

Also updates shared/sync_profiles.sh and shared/README.md to track the
3 new _detect.sh copies (9 total across the repo).
Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/3 — Structure & Repository Hygiene

This is the core feedback on this PR. While the motivation is solid — users genuinely need guidance when running test cases on non-P5 instances — the current approach over-engineers what should remain simple, self-contained setup scripts.

Each test case in this repo is meant to be a simple, easy-to-follow setup script that a user can read top-to-bottom and understand. The _detect.sh / .env / profile-loading machinery adds a layer of general-purpose utility complexity that conflicts with this design philosophy.

I'd suggest splitting this into a documentation-only PR (README tables, instance profile reference docs) and deferring the script modifications until e2e tests validate each test case on the target instance types.

Comment on lines +15 to +40
###### Instance Profile ###
###########################
# Auto-detect instance type and source the matching profile.
# Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc.
# Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch)
# See profiles/README.md for details.

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROFILES_DIR="${SCRIPT_DIR}/../profiles"
PROFILE_LOADED=0

if [[ -d "$PROFILES_DIR" ]]; then
# _detect.sh prints diagnostics to stderr and the profile path to stdout.
if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then
echo "Sourcing instance profile: $PROFILE_ENV"
source "$PROFILE_ENV"
PROFILE_LOADED=1
else
echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)."
fi
else
echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)."
fi

GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5
# Fallback defaults when no profile is loaded (assumes P5-class instance)
GPUS_PER_NODE=${GPUS_PER_NODE:-8}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the profile-loading boilerplate — it overcomplicates test cases

This ~25-line profile detection block is copy-pasted into every sbatch script across 9 test cases (~20 scripts total). A user reading this file now needs to understand the profile system, _detect.sh detection order, and which .env file is being sourced before they can reason about what the script actually does.

The original approach was simpler and more aligned with how users interact with these scripts:

GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5

I'd suggest reverting the sbatch scripts to their original form. The valuable information about per-instance settings (NCCL flags, batch sizes, model fit) should go into the README documentation and inline comments instead.

@@ -0,0 +1,95 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove _detect.sh and all profiles/ directories

This 95-line auto-detection script (IMDSv2 → env var → nvidia-smi fallback) is copied identically into 9 test case profiles/ directories, with a sync_profiles.sh to keep them in sync. This is infrastructure for infrastructure.

The detection logic is well-engineered, but it adds complexity that doesn't belong in a repo of reference training scripts. Users should be editing their scripts directly — that's the point of reference architectures.

I'd suggest removing:

  • All profiles/ directories (9 test cases)
  • All _detect.sh copies
  • 3.test_cases/shared/instance_detect.sh
  • 3.test_cases/shared/sync_profiles.sh
  • All .env profile files

The useful content currently in .env files (NCCL flags, model fit notes, VRAM estimates) should be moved to README docs and inline script comments.

@@ -0,0 +1,50 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared/sync_profiles.sh is not maintainable

Maintaining a sync script that copies _detect.sh across 9 directories adds ongoing maintenance burden. If the detection logic ever diverges (which is likely as test cases evolve independently), this sync mechanism will either silently break things or become a friction point.

This reinforces the recommendation to remove the entire profile system. The shared/ directory should not exist for this purpose.

@@ -0,0 +1,297 @@
# Instance Compatibility Framework — Implementation Plan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove docs/plans/ — implementation plans should not be in the repo

This is internal project planning material, not user-facing documentation. I'd suggest removing this file entirely. If there's value in preserving the design rationale, it can go in the PR description or a wiki page.

@@ -0,0 +1,80 @@
# g5 Instance Family — A10G GPUs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope instance profile pages to hardware reference only

If we're going to include instance profile pages in docs/, they should contain general hardware reference information useful across all test cases — not test-case-specific tuning knobs.

Good content for these pages:

  • GPU memory (VRAM), FLOPS (FP16/BF16/FP8), memory bandwidth
  • GDRCopy support
  • NVLink/NVSwitch topology and bandwidth
  • EFA adapter count and bandwidth
  • GPUDirect RDMA support
  • Node CPU/RAM specs

Content that should NOT be here: specific FSDP strategies, per-test-case batch sizes, offloading recommendations, or .env variable references. That belongs in each test case's own README, and only if validated with e2e tests.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/3 — Documentation Consistency

Several documentation issues to address, regardless of whether the profile system stays or goes.

Comment on lines +13 to +14
> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken relative link — one ../ too many

This file is at depth 2 under the repo root (3.test_cases/23.SMHP-esm2/), so it needs 2 ../ segments to reach the root. The current link uses 3, which resolves to the parent of the repo root.

The same issue likely affects 3.test_cases/jax/README.md. I'd suggest sweeping all 22 READMEs to verify each relative link depth.

Suggested change
> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types.
> See the [Instance Compatibility Guide](../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types.

| Instance | GPUs | Status | Notes |
|----------|------|--------|-------|
| p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work |
| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H200 VRAM should be 141 GB, not 80 GB

The NVIDIA H200 SXM on p5en.48xlarge has 141 GB HBM3e. The "80 GB" figure appears to be carried over from H100 (p5.48xlarge). This error is repeated across most of the 22 README tables and in docs/instance-compatibility.md.

Suggested change
| p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work |
| p5en.48xlarge | 8 x H200 141 GB | Untested | Expected to work |

Comment on lines +7 to +25

| Instance | GPUs | Models | Nodes | Status |
|----------|------|--------|-------|--------|
| p5en.48xlarge | 8 x H200 80 GB | Llama 2/3, Mixtral 8x7B | Various | Tested (CI) |
| p5.48xlarge | 8 x H100 80 GB | Llama 2/3, Mixtral 8x7B | Various | Tested (CI) |
| p4de.24xlarge | 8 x A100 80 GB | Llama 2/3 | Various | Tested |
| g5.12xlarge | 4 x A10G 24 GB | Various | Various | Tested |
| g5.xlarge | 1 x A10G 24 GB | Various | 1 | Tested |
| g4dn | Various | Various | Various | Tested |
| g6e.12xlarge | 4 x L40S 48 GB | — | — | Untested |

> See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md)
> for parameter adjustments needed across instance types, and
> [instance profiles](../../../docs/instance-profiles/) for hardware details.

### Instance Profiles

This test case includes an [instance profile system](profiles/) that auto-detects
your EC2 instance type and configures GPU count, EFA networking, NCCL settings,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Tested Configurations" tables include untested entries — risk of misleading users

Many rows in this table are marked "Untested | Expected to work". Adding untested configurations to a compatibility table risks misleading users — "expected to work" with the wrong NCCL flags or batch size can lead to silent failures or OOM errors.

I'd suggest either:

  1. Only including rows for configurations that have been validated end-to-end
  2. Moving "Untested" entries to a separate section clearly labeled as speculative

This applies to all 22 READMEs with Tested Configurations tables.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/3 — Testing Requirements & Things That Look Great

E2e tests required for each modified test case

This PR modifies the core training scripts of 9 test cases — changing how GPUS_PER_NODE is set, restructuring EFA environment variable handling, and adding conditional logic around NCCL configuration. These are not cosmetic changes; they alter the runtime behavior of the scripts on every instance type.

Before merging script modifications, I'd expect e2e test results for each modified test case demonstrating that:

  1. The scripts still work correctly on the primary instance type (the one they were originally written for)
  2. The fallback path (no profiles/ directory) produces identical behavior to the original scripts

The PR body mentions live testing only for Megatron-LM GPT3 on g5.12xlarge EKS. The other 8 test cases appear to have been dry-run tested only. For changes to production training scripts, dry-run testing is not sufficient.

Could you provide e2e test results (or at least single-iteration smoke test logs) for each modified test case on its primary instance type?


Things That Look Great

  • "Tested Configurations" tables in READMEs: The concept of adding structured compatibility information to each test case README is valuable. Users genuinely need this. The table format is clean and scannable.
  • $NODES bug fix in Megatron-LM: Replacing the undefined $NODES with $SLURM_JOB_NUM_NODES in the GPT3 script (2.distributed-training.sbatch) is a real bug fix. This should be extracted and merged independently.
  • Hardcoded --num-nodes 2 and --num-gpus 8 fix in BioNeMo: Replacing hardcoded values with ${SLURM_JOB_NUM_NODES} and ${GPUS_PER_NODE} in train-esm.sbatch makes the script work with arbitrary cluster sizes. This is also a standalone fix worth merging separately.
  • Central docs/instance-compatibility.md: The hardware comparison table and the "6 Dimensions" tuning guide are genuinely useful reference material. This documentation (with the VRAM corrections) would be a great standalone contribution.
  • Detailed g5 veRL debugging notes: The "Key lessons from 11 OOM iterations" knowledge in the veRL g5 profile is extremely valuable. This content should be preserved — perhaps as a section in the veRL README or a troubleshooting guide — even if the .env profile mechanism is removed.

: "${TENSOR_PARALLEL:=4}"
: "${PIPELINE_PARALLEL:=2}"
: "${GLOBAL_BATCH_SIZE:=288}"
elif [ $NODES -ge 8 ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good bug fix — extract and merge independently

Replacing the undefined $NODES with $SLURM_JOB_NUM_NODES is a genuine bug fix. This was likely silently broken before. I'd suggest extracting this fix (and the TOTAL_GPUS fix on the same line) into a separate small PR so it can be merged without waiting for the larger profile system discussion to resolve.

--precision="bf16-mixed" \
--num-gpus 8 \
--num-nodes 2 \
--num-gpus ${GPUS_PER_NODE} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — hardcoded values replaced with variables

Replacing --num-gpus 8 and --num-nodes 2 with ${GPUS_PER_NODE} and ${SLURM_JOB_NUM_NODES} is a valuable standalone fix. Like the $NODES fix above, I'd suggest extracting this into a separate PR.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants