docs: add Instance Compatibility Guide with per-test-case configuration tables#1017
Open
docs: add Instance Compatibility Guide with per-test-case configuration tables#1017
Conversation
…ration tables
Add central documentation mapping EC2 instance types to the parameter
changes required for each test case. Motivated by 11 OOM iterations
porting veRL GRPO from p5en (H200 80GB) to g5 (A10G 24GB).
New files:
- docs/instance-compatibility.md: master reference with compatibility
matrix, the 6 hardware dimensions that differ, and common parameter
adjustment tables for g5/p4de/g6e
- docs/instance-profiles/{g5,g6e,p4de,p5,p5en,trn1}.md: per-instance
hardware specs, NCCL/EFA settings, memory strategies, K8s resources
- docs/plans/instance-compatibility-framework.md: implementation plan
Modified 22 READMEs to add 'Tested Configurations' tables covering all
test cases across PyTorch, Megatron, JAX, Neuron, and MosaicML.
…scope to hardware-only Address PR #1015 review feedback: - Fix H200 VRAM from 80 GB to 141 GB across all 24 affected files - Fix broken relative links in 23.SMHP-esm2 and jax READMEs (3 -> 2 levels) - Remove 'Untested | Expected to work' rows from all README tables; keep only validated configurations - For 8 test cases with entirely untested tables, replace with a simple link to the central Instance Compatibility Guide - Remove docs/plans/ — implementation plans don't belong in the repo - Scope docs/instance-profiles/*.md to hardware reference only (remove FSDP strategies, batch size recommendations, tested workloads tables) - Replace 'Untested' with '—' in central compatibility matrix
KeitaW
requested changes
Mar 12, 2026
Collaborator
KeitaW
left a comment
There was a problem hiding this comment.
Thank you! I'll update with the comments I had in the previous PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds documentation to help users run test cases across different EC2 GPU instance types (p5en, p5, p4de, g5, g6e, trn1). Previously, most test cases only documented P5 configurations.
docs/instance-compatibility.md): Hardware comparison matrix, 6-dimension tuning guide (GPU VRAM, GPUDirect RDMA, EFA count, NVLink topology, CPU memory, storage), full test case compatibility matrix, parameter adjustment tables for g5/p4de/g6e, and lessons learned from veRL g5 portingdocs/instance-profiles/): Per-instance hardware reference pages for g5, g6e, p4de, p5, p5en, and trn1 — covering VRAM, FLOPS, NVLink topology, EFA count, NCCL/EFA settingsChanges
Central Documentation
docs/instance-compatibility.md— master reference with hardware matrix, 6-dimension tuning guide, cross-test-case compatibility matrix, and parameter adjustment tablesdocs/instance-profiles/— 7 files: g5.md, g6e.md, p4de.md, p5.md, p5en.md, trn1.md, README.md (hardware specs only)README Updates (22 files)
Each test case README gets a "Tested Configurations" table with validated instance types. Test cases without any validated non-default configurations get a simple link to the central guide instead.
What This Does NOT Include
Script modifications and parameterized profiles are being developed separately on the
feat/instance-profilesbranch with e2e testing. This PR is documentation-only — no training scripts are modified.Bug fixes for hardcoded variables in Megatron-LM and BioNeMo scripts are in a separate PR: #1016.
Review Feedback Addressed
This PR incorporates all documentation feedback from the initial PR #1015:
23.SMHP-esm2andjaxREADMEsdocs/plans/removed (implementation plans don't belong in the repo)