Skip to content

docs: add Instance Compatibility Guide with per-test-case configuration tables#1017

Open
nkumaraws wants to merge 2 commits intomainfrom
docs/instance-compatibility
Open

docs: add Instance Compatibility Guide with per-test-case configuration tables#1017
nkumaraws wants to merge 2 commits intomainfrom
docs/instance-compatibility

Conversation

@nkumaraws
Copy link
Contributor

Summary

Adds documentation to help users run test cases across different EC2 GPU instance types (p5en, p5, p4de, g5, g6e, trn1). Previously, most test cases only documented P5 configurations.

  • Central compatibility guide (docs/instance-compatibility.md): Hardware comparison matrix, 6-dimension tuning guide (GPU VRAM, GPUDirect RDMA, EFA count, NVLink topology, CPU memory, storage), full test case compatibility matrix, parameter adjustment tables for g5/p4de/g6e, and lessons learned from veRL g5 porting
  • Instance hardware profiles (docs/instance-profiles/): Per-instance hardware reference pages for g5, g6e, p4de, p5, p5en, and trn1 — covering VRAM, FLOPS, NVLink topology, EFA count, NCCL/EFA settings
  • "Tested Configurations" tables: Added to 22 test case READMEs showing validated instance types. Only includes configurations that have been tested end-to-end. Links to the central guide for untested instances.

Changes

Central Documentation

  • docs/instance-compatibility.md — master reference with hardware matrix, 6-dimension tuning guide, cross-test-case compatibility matrix, and parameter adjustment tables
  • docs/instance-profiles/ — 7 files: g5.md, g6e.md, p4de.md, p5.md, p5en.md, trn1.md, README.md (hardware specs only)

README Updates (22 files)

Each test case README gets a "Tested Configurations" table with validated instance types. Test cases without any validated non-default configurations get a simple link to the central guide instead.

What This Does NOT Include

Script modifications and parameterized profiles are being developed separately on the feat/instance-profiles branch with e2e testing. This PR is documentation-only — no training scripts are modified.

Bug fixes for hardcoded variables in Megatron-LM and BioNeMo scripts are in a separate PR: #1016.

Review Feedback Addressed

This PR incorporates all documentation feedback from the initial PR #1015:

  • H200 VRAM corrected from 80 GB to 141 GB (26 occurrences)
  • Broken relative links fixed in 23.SMHP-esm2 and jax READMEs
  • "Untested | Expected to work" rows removed from all tables
  • docs/plans/ removed (implementation plans don't belong in the repo)
  • Instance profile pages scoped to hardware reference only

…ration tables

Add central documentation mapping EC2 instance types to the parameter
changes required for each test case. Motivated by 11 OOM iterations
porting veRL GRPO from p5en (H200 80GB) to g5 (A10G 24GB).

New files:
- docs/instance-compatibility.md: master reference with compatibility
  matrix, the 6 hardware dimensions that differ, and common parameter
  adjustment tables for g5/p4de/g6e
- docs/instance-profiles/{g5,g6e,p4de,p5,p5en,trn1}.md: per-instance
  hardware specs, NCCL/EFA settings, memory strategies, K8s resources
- docs/plans/instance-compatibility-framework.md: implementation plan

Modified 22 READMEs to add 'Tested Configurations' tables covering all
test cases across PyTorch, Megatron, JAX, Neuron, and MosaicML.
…scope to hardware-only

Address PR #1015 review feedback:
- Fix H200 VRAM from 80 GB to 141 GB across all 24 affected files
- Fix broken relative links in 23.SMHP-esm2 and jax READMEs (3 -> 2 levels)
- Remove 'Untested | Expected to work' rows from all README tables; keep
  only validated configurations
- For 8 test cases with entirely untested tables, replace with a simple
  link to the central Instance Compatibility Guide
- Remove docs/plans/ — implementation plans don't belong in the repo
- Scope docs/instance-profiles/*.md to hardware reference only (remove
  FSDP strategies, batch size recommendations, tested workloads tables)
- Replace 'Untested' with '—' in central compatibility matrix
Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll update with the comments I had in the previous PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants