feat: Instance Compatibility Framework — multi-instance profiles and documentation for all test cases#1015
feat: Instance Compatibility Framework — multi-instance profiles and documentation for all test cases#1015
Conversation
…ration tables
Add central documentation mapping EC2 instance types to the parameter
changes required for each test case. Motivated by 11 OOM iterations
porting veRL GRPO from p5en (H200 80GB) to g5 (A10G 24GB).
New files:
- docs/instance-compatibility.md: master reference with compatibility
matrix, the 6 hardware dimensions that differ, and common parameter
adjustment tables for g5/p4de/g6e
- docs/instance-profiles/{g5,g6e,p4de,p5,p5en,trn1}.md: per-instance
hardware specs, NCCL/EFA settings, memory strategies, K8s resources
- docs/plans/instance-compatibility-framework.md: implementation plan
Modified 22 READMEs to add 'Tested Configurations' tables covering all
test cases across PyTorch, Megatron, JAX, Neuron, and MosaicML.
Add auto-detection and per-instance config profiles so the same recipe script works across p5en, g5, and p4de without manual parameter tuning. New files (both hyperpod-eks and kubernetes variants): - recipe/profiles/_detect.sh: auto-detect instance via IMDSv2, INSTANCE_TYPE env var, or nvidia-smi GPU name; resolves to matching .env profile - recipe/profiles/p5en-48xlarge.env: baseline (FSDP1, TP=2, no offload) - recipe/profiles/g5-12xlarge.env: aggressive (FSDP2, full offload, TP=4, bf16, enforce_eager, NCCL_PROTO=simple) — from 11 OOM iterations - recipe/profiles/p4de-24xlarge.env: like p5en with 4 EFA adapters - recipe/profiles/README.md: how profiles work, how to create new ones Modified: - run_grpo_configurable.sh: sources detected profile, builds ray job command dynamically with conditional FSDP2/offload/eager args - env_vars.example: documents INSTANCE_PROFILE override, defers NUM_GPU_PER_NODE and NUM_EFA_PER_NODE to profile defaults
Add MODEL ASSUMPTIONS comment blocks to all three profile .env files (g5-12xlarge, p4de-24xlarge, p5en-48xlarge) documenting which settings are model-dependent vs instance-dependent and how to adjust for different model sizes. Update profiles README with a reference table distinguishing model-driven vs instance-driven settings. Synced across both hyperpod-eks and kubernetes variants.
Add auto-detecting instance profiles for the FSDP test case, following
the same pattern established in the veRL test case (Phase 4 of the
Instance Compatibility Framework).
New files:
- profiles/_detect.sh: auto-detects EC2 instance type via IMDS, env
vars, or nvidia-smi GPU name (reused from veRL)
- profiles/{p5en-48xlarge,p5-48xlarge,p4de-24xlarge,g5-12xlarge,
g6e-12xlarge}.env: instance-specific GPU count, EFA config, NCCL
settings, and FSDP cpu_offload flag
- profiles/README.md: profile guide with model compatibility matrix
Modified files:
- slurm/training-sub.template: sources detected profile at runtime,
conditionally sets EFA vars and appends --cpu_offload from profile.
Falls back to legacy defaults (8 GPU, EFA) when no profile found.
- All 9 regenerated .sbatch files reflect the new template.
- kubernetes/README.md: adds profile-aware env_vars setup with
instance type reference table.
- slurm/README.md: documents profile override workflow.
- README.md: links to profile system.
Backward compatible: scripts work unchanged on P5-class instances
without any profile directory present.
Extend the instance profile system (Phase 5) to three more test cases: torchtitan (slurm/1.llama_3_8b_torchtitan.sh): - Added profiles/ with _detect.sh and 4 profiles (p5en, p4de, g5, g6e) - Sbatch script now auto-detects instance and sources profile for GPU count and EFA settings. Falls back to legacy defaults. nanoVLM (slurm/launch_training.sbatch): - Same profile structure. Original script had GPUS_PER_NODE=8 and hardcoded EFA vars; now auto-detected via profile. TRL grpo-math-reasoning (grpo-math-reasoning/train.sbatch): - Added profiles/ with _detect.sh and 4 profiles including NCCL tuner settings (NCCL_BUFFSIZE, NCCL_P2P_NET_CHUNKSIZE, NCCL_TUNER_PLUGIN) for EFA instances. Non-EFA profiles skip all EFA and NCCL tuner vars. - GPUS_PER_NODE drives both accelerate --num_processes and the auto-computed TENSOR_PARALLEL for vLLM serving. - The gpt-oss-lora-grpo K8s sub-case is not modified (hardcoded to g6e in YAML manifests; documented in profiles README). All profiles dry-run tested: 4 instance types x 3 test cases = 12/12 pass.
Move the instance detection logic to 3.test_cases/shared/instance_detect.sh as the canonical source, with sync_profiles.sh to propagate changes to all 6 test case copies. This eliminates drift risk when updating detection logic (e.g., adding new instance types or IMDSv2 improvements). All 6 copies updated and verified identical via md5 hash.
…ron-DeepSpeed Phase 5b of the Instance Compatibility Framework. Adds parameterized instance profiles to the 4 remaining complex test cases: - megatron/megatron-lm: 5 profiles (p5en, p5, p4de, g5, g6e) with TP/PP defaults, modified GPT3 and Llama2 sbatch scripts to auto-detect instance and source profiles. Fixed NODES variable bug (now uses SLURM_JOB_NUM_NODES). GPUS_PER_NODE replaces hardcoded nproc_per_node=8. - megatron/bionemo: 5 profiles with MICRO_BATCH_SIZE scaled per instance VRAM (256 for 80GB+, 128 for 48GB, 64 for 24GB). Modified both ESM-1nv (BioNeMo 1.2) and ESM-2 (BioNeMo 2.5) training scripts. - megatron/nemo: 5 env_vars JSON profiles (p5en, p5, p4de, g5, g6e). Clean swap via existing --env_vars_file CLI arg — no script modifications needed. Non-EFA profiles omit FI_PROVIDER and FI_EFA_* variables. - deepspeed/megatron-deepspeed: 5 profiles with TP/PP/MBS/GBS scaled for Llama2-7B finetuning. 4-GPU instances use TP=2,PP=2 (vs TP=4,PP=2 on 8-GPU instances). Modified finetune_llama.sbatch to source profiles. Also updates shared/sync_profiles.sh and shared/README.md to track the 3 new _detect.sh copies (9 total across the repo).
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/3 — Structure & Repository Hygiene
This is the core feedback on this PR. While the motivation is solid — users genuinely need guidance when running test cases on non-P5 instances — the current approach over-engineers what should remain simple, self-contained setup scripts.
Each test case in this repo is meant to be a simple, easy-to-follow setup script that a user can read top-to-bottom and understand. The _detect.sh / .env / profile-loading machinery adds a layer of general-purpose utility complexity that conflicts with this design philosophy.
I'd suggest splitting this into a documentation-only PR (README tables, instance profile reference docs) and deferring the script modifications until e2e tests validate each test case on the target instance types.
| ###### Instance Profile ### | ||
| ########################### | ||
| # Auto-detect instance type and source the matching profile. | ||
| # Profiles set: GPUS_PER_NODE, FI_PROVIDER, EFA_PER_NODE, FSDP_CPU_OFFLOAD, etc. | ||
| # Override with: export INSTANCE_PROFILE=g5-12xlarge (before sbatch) | ||
| # See profiles/README.md for details. | ||
|
|
||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
| PROFILES_DIR="${SCRIPT_DIR}/../profiles" | ||
| PROFILE_LOADED=0 | ||
|
|
||
| if [[ -d "$PROFILES_DIR" ]]; then | ||
| # _detect.sh prints diagnostics to stderr and the profile path to stdout. | ||
| if PROFILE_ENV=$("${PROFILES_DIR}/_detect.sh" "${PROFILES_DIR}"); then | ||
| echo "Sourcing instance profile: $PROFILE_ENV" | ||
| source "$PROFILE_ENV" | ||
| PROFILE_LOADED=1 | ||
| else | ||
| echo "WARNING: Profile detection failed. Using defaults (8 GPU, EFA enabled)." | ||
| fi | ||
| else | ||
| echo "WARNING: No profiles/ directory found. Using defaults (8 GPU, EFA enabled)." | ||
| fi | ||
|
|
||
| GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5 | ||
| # Fallback defaults when no profile is loaded (assumes P5-class instance) | ||
| GPUS_PER_NODE=${GPUS_PER_NODE:-8} |
There was a problem hiding this comment.
Remove the profile-loading boilerplate — it overcomplicates test cases
This ~25-line profile detection block is copy-pasted into every sbatch script across 9 test cases (~20 scripts total). A user reading this file now needs to understand the profile system, _detect.sh detection order, and which .env file is being sourced before they can reason about what the script actually does.
The original approach was simpler and more aligned with how users interact with these scripts:
GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5I'd suggest reverting the sbatch scripts to their original form. The valuable information about per-instance settings (NCCL flags, batch sizes, model fit) should go into the README documentation and inline comments instead.
| @@ -0,0 +1,95 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
Remove _detect.sh and all profiles/ directories
This 95-line auto-detection script (IMDSv2 → env var → nvidia-smi fallback) is copied identically into 9 test case profiles/ directories, with a sync_profiles.sh to keep them in sync. This is infrastructure for infrastructure.
The detection logic is well-engineered, but it adds complexity that doesn't belong in a repo of reference training scripts. Users should be editing their scripts directly — that's the point of reference architectures.
I'd suggest removing:
- All
profiles/directories (9 test cases) - All
_detect.shcopies 3.test_cases/shared/instance_detect.sh3.test_cases/shared/sync_profiles.sh- All
.envprofile files
The useful content currently in .env files (NCCL flags, model fit notes, VRAM estimates) should be moved to README docs and inline script comments.
| @@ -0,0 +1,50 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
shared/sync_profiles.sh is not maintainable
Maintaining a sync script that copies _detect.sh across 9 directories adds ongoing maintenance burden. If the detection logic ever diverges (which is likely as test cases evolve independently), this sync mechanism will either silently break things or become a friction point.
This reinforces the recommendation to remove the entire profile system. The shared/ directory should not exist for this purpose.
| @@ -0,0 +1,297 @@ | |||
| # Instance Compatibility Framework — Implementation Plan | |||
There was a problem hiding this comment.
Remove docs/plans/ — implementation plans should not be in the repo
This is internal project planning material, not user-facing documentation. I'd suggest removing this file entirely. If there's value in preserving the design rationale, it can go in the PR description or a wiki page.
| @@ -0,0 +1,80 @@ | |||
| # g5 Instance Family — A10G GPUs | |||
There was a problem hiding this comment.
Scope instance profile pages to hardware reference only
If we're going to include instance profile pages in docs/, they should contain general hardware reference information useful across all test cases — not test-case-specific tuning knobs.
Good content for these pages:
- GPU memory (VRAM), FLOPS (FP16/BF16/FP8), memory bandwidth
- GDRCopy support
- NVLink/NVSwitch topology and bandwidth
- EFA adapter count and bandwidth
- GPUDirect RDMA support
- Node CPU/RAM specs
Content that should NOT be here: specific FSDP strategies, per-test-case batch sizes, offloading recommendations, or .env variable references. That belongs in each test case's own README, and only if validated with e2e tests.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/3 — Documentation Consistency
Several documentation issues to address, regardless of whether the profile system stays or goes.
| > See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) | ||
| > for parameter adjustments needed across instance types. |
There was a problem hiding this comment.
Broken relative link — one ../ too many
This file is at depth 2 under the repo root (3.test_cases/23.SMHP-esm2/), so it needs 2 ../ segments to reach the root. The current link uses 3, which resolves to the parent of the repo root.
The same issue likely affects 3.test_cases/jax/README.md. I'd suggest sweeping all 22 READMEs to verify each relative link depth.
| > See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) | |
| > for parameter adjustments needed across instance types. | |
| > See the [Instance Compatibility Guide](../../docs/instance-compatibility.md) | |
| > for parameter adjustments needed across instance types. |
| | Instance | GPUs | Status | Notes | | ||
| |----------|------|--------|-------| | ||
| | p5en.48xlarge | 8 x H200 80 GB | Untested | Expected to work | | ||
| | p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | |
There was a problem hiding this comment.
H200 VRAM should be 141 GB, not 80 GB
The NVIDIA H200 SXM on p5en.48xlarge has 141 GB HBM3e. The "80 GB" figure appears to be carried over from H100 (p5.48xlarge). This error is repeated across most of the 22 README tables and in docs/instance-compatibility.md.
| | p5.48xlarge | 8 x H100 80 GB | Untested | Expected to work | | |
| | p5en.48xlarge | 8 x H200 141 GB | Untested | Expected to work | |
|
|
||
| | Instance | GPUs | Models | Nodes | Status | | ||
| |----------|------|--------|-------|--------| | ||
| | p5en.48xlarge | 8 x H200 80 GB | Llama 2/3, Mixtral 8x7B | Various | Tested (CI) | | ||
| | p5.48xlarge | 8 x H100 80 GB | Llama 2/3, Mixtral 8x7B | Various | Tested (CI) | | ||
| | p4de.24xlarge | 8 x A100 80 GB | Llama 2/3 | Various | Tested | | ||
| | g5.12xlarge | 4 x A10G 24 GB | Various | Various | Tested | | ||
| | g5.xlarge | 1 x A10G 24 GB | Various | 1 | Tested | | ||
| | g4dn | Various | Various | Various | Tested | | ||
| | g6e.12xlarge | 4 x L40S 48 GB | — | — | Untested | | ||
|
|
||
| > See the [Instance Compatibility Guide](../../../docs/instance-compatibility.md) | ||
| > for parameter adjustments needed across instance types, and | ||
| > [instance profiles](../../../docs/instance-profiles/) for hardware details. | ||
|
|
||
| ### Instance Profiles | ||
|
|
||
| This test case includes an [instance profile system](profiles/) that auto-detects | ||
| your EC2 instance type and configures GPU count, EFA networking, NCCL settings, |
There was a problem hiding this comment.
"Tested Configurations" tables include untested entries — risk of misleading users
Many rows in this table are marked "Untested | Expected to work". Adding untested configurations to a compatibility table risks misleading users — "expected to work" with the wrong NCCL flags or batch size can lead to silent failures or OOM errors.
I'd suggest either:
- Only including rows for configurations that have been validated end-to-end
- Moving "Untested" entries to a separate section clearly labeled as speculative
This applies to all 22 READMEs with Tested Configurations tables.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/3 — Testing Requirements & Things That Look Great
E2e tests required for each modified test case
This PR modifies the core training scripts of 9 test cases — changing how GPUS_PER_NODE is set, restructuring EFA environment variable handling, and adding conditional logic around NCCL configuration. These are not cosmetic changes; they alter the runtime behavior of the scripts on every instance type.
Before merging script modifications, I'd expect e2e test results for each modified test case demonstrating that:
- The scripts still work correctly on the primary instance type (the one they were originally written for)
- The fallback path (no
profiles/directory) produces identical behavior to the original scripts
The PR body mentions live testing only for Megatron-LM GPT3 on g5.12xlarge EKS. The other 8 test cases appear to have been dry-run tested only. For changes to production training scripts, dry-run testing is not sufficient.
Could you provide e2e test results (or at least single-iteration smoke test logs) for each modified test case on its primary instance type?
Things That Look Great
- "Tested Configurations" tables in READMEs: The concept of adding structured compatibility information to each test case README is valuable. Users genuinely need this. The table format is clean and scannable.
$NODESbug fix in Megatron-LM: Replacing the undefined$NODESwith$SLURM_JOB_NUM_NODESin the GPT3 script (2.distributed-training.sbatch) is a real bug fix. This should be extracted and merged independently.- Hardcoded
--num-nodes 2and--num-gpus 8fix in BioNeMo: Replacing hardcoded values with${SLURM_JOB_NUM_NODES}and${GPUS_PER_NODE}intrain-esm.sbatchmakes the script work with arbitrary cluster sizes. This is also a standalone fix worth merging separately. - Central
docs/instance-compatibility.md: The hardware comparison table and the "6 Dimensions" tuning guide are genuinely useful reference material. This documentation (with the VRAM corrections) would be a great standalone contribution. - Detailed g5 veRL debugging notes: The "Key lessons from 11 OOM iterations" knowledge in the veRL g5 profile is extremely valuable. This content should be preserved — perhaps as a section in the veRL README or a troubleshooting guide — even if the
.envprofile mechanism is removed.
| : "${TENSOR_PARALLEL:=4}" | ||
| : "${PIPELINE_PARALLEL:=2}" | ||
| : "${GLOBAL_BATCH_SIZE:=288}" | ||
| elif [ $NODES -ge 8 ]; then |
There was a problem hiding this comment.
Good bug fix — extract and merge independently
Replacing the undefined $NODES with $SLURM_JOB_NUM_NODES is a genuine bug fix. This was likely silently broken before. I'd suggest extracting this fix (and the TOTAL_GPUS fix on the same line) into a separate small PR so it can be merged without waiting for the larger profile system discussion to resolve.
| --precision="bf16-mixed" \ | ||
| --num-gpus 8 \ | ||
| --num-nodes 2 \ | ||
| --num-gpus ${GPUS_PER_NODE} \ |
There was a problem hiding this comment.
Good fix — hardcoded values replaced with variables
Replacing --num-gpus 8 and --num-nodes 2 with ${GPUS_PER_NODE} and ${SLURM_JOB_NUM_NODES} is a valuable standalone fix. Like the $NODES fix above, I'd suggest extracting this into a separate PR.
Summary
Adds a comprehensive Instance Compatibility Framework that enables test cases to run across different EC2 GPU instance types (p5en, p5, p4de, g5, g6e, trn1) with correct parameters. Previously, most test cases only documented P5 configurations, leaving users to guess parameters for other instances.
.envprofile system for 9 test cases (veRL, FSDP, torchtitan, nanoVLM, TRL, Megatron-LM, BioNeMo, NeMo, Megatron-DeepSpeed) that adjusts GPU count, EFA config, TP/PP, batch sizes, memory settings, and NCCL tuning per instance_detect.shin3.test_cases/shared/with sync script to keep 9 test case copies in syncMotivation
10 of ~20 test cases had zero instance type guidance. Users attempting to run on non-P5 instances (especially g5 and g6e with less VRAM and no NVLink) would hit OOM errors, EFA misconfigurations, or silent performance degradation with no documentation to help them.
Changes
Tier 1 — Documentation (Phase 1-2)
docs/instance-compatibility.md— central reference with hardware matrix and full test case compatibility tabledocs/instance-profiles/— per-instance guides (g5, g6e, p4de, p5, p5en, trn1)docs/plans/instance-compatibility-framework.md— implementation planTier 2 — Parameterized Profiles (Phases 3-5)
--enforce-eagerfor g5cpu_offloadbased on model layer count thresholds (g5=32, g6e=40 layers)$NODESbug ($SLURM_JOB_NUM_NODES)MICRO_BATCH_SIZEscaled per VRAM (256→128/64)env_vars_*.jsonprofiles (no script modifications — uses existing--env_vars_fileCLI arg)Shared Infrastructure
3.test_cases/shared/instance_detect.sh— canonical auto-detection script (IMDSv2 → INSTANCE_TYPE env → nvidia-smi GPU name)3.test_cases/shared/sync_profiles.sh— keeps 9 test case copies in sync3.test_cases/shared/README.md— documents the shared utilitiesDesign Decisions
.envcommentsprofiles/directory is absentsync_profiles.shfor cross-platform compatibilityTesting
FI_EFA_USE_DEVICE_RDMA=0) works correctly/mnt/k8s-disks/0doesn't exist on HyperPod g5 nodes (template needs/tmp)