Add DeepSpeed 103B GPT pretraining benchmark and standardize containers for B200#1009
Add DeepSpeed 103B GPT pretraining benchmark and standardize containers for B200#1009
Conversation
- Update Dockerfile: pytorch:25.04-py3 base, EFA 1.47, NCCL 2.29.3, GDRCopy 2.5.1 - Add 103B GPT pretraining sbatch script with parameterized parallelism, ZeRO stages, fusion ops, and correct NCCL/EFA flags - Add sweep runners (v1: 20 configs, v2: 10 configs) covering TP/PP/ZeRO/fusion/memory variations - Add results parser and S3 upload script with CloudWatch metric publishing - Best result: 476.6 TFLOPS/GPU (TP=8, PP=8, ZeRO=0, fusions enabled) on 8x B200 nodes
- Rewrite README as use-case-focused guide: GPT-103B pretraining, QLoRA fine-tuning, and Llama2 fine-tuning with best practices and proper configuration docs (no benchmark numbers) - Simplify Makefile: best-config train target, remove sweep/upload targets - Standardize QLoRA Dockerfile: pytorch:25.04-py3 base, EFA 1.47, NCCL 2.29.3, GDRCopy 2.5.1, OFI-NCCL symlinks, proper NCCL/EFA env vars - Remove sweep runners and upload script from tracked files (internal tooling)
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/4 — Structure & Repository Hygiene
New pretraining files should live in their own subdirectory with platform-specific layout
The repo convention requires test cases to follow 3.test_cases/<framework>/<library>/<test_case>/ with platform-specific subdirectories (slurm/, kubernetes/, hyperpod-eks/). Right now the three new pretraining files (pretrain_gpt_103b.sbatch, parse_results.py, configs/ds_config_103b_template.json) are placed directly in the deepspeed/ directory alongside shared infrastructure (0.deepspeed.dockerfile, 1.build-image.sbatch, Makefile). This mixes the shared container build with a specific benchmark, and doesn't match how the sibling qlora/ directory is organized.
I'd suggest moving the GPT-103B content into its own subdirectory, for example:
3.test_cases/pytorch/deepspeed/
├── 0.deepspeed.dockerfile # shared container (stays here)
├── 1.build-image.sbatch # shared build script (stays here)
├── Makefile # shared targets (stays here)
├── README.md # index page (stays here)
├── gpt/ # NEW: GPT-103B benchmark
│ ├── README.md # benchmark-specific docs
│ ├── slurm/
│ │ └── pretrain_gpt_103b.sbatch
│ ├── configs/
│ │ └── ds_config_103b_template.json
│ └── parse_results.py
├── qlora/ # existing
└── examples_megatron_deepspeed/ # existing
This keeps the directory layout consistent with qlora/ and with the broader repo convention. The Makefile train target would just need its path updated to gpt/slurm/pretrain_gpt_103b.sbatch.
| RUN pip3 install --no-cache-dir \ | ||
| awscli pynvml \ | ||
| transformers==${TRANSFORMERS_VERSION} \ | ||
| sentencepiece python-etcd \ | ||
| deepspeed accelerate |
There was a problem hiding this comment.
Unpinned Python packages
deepspeed and accelerate are installed without version pins. Per repo conventions, dependencies should have at least an upper-bound pin to prevent silent breakage from major version changes. For example:
| RUN pip3 install --no-cache-dir \ | |
| awscli pynvml \ | |
| transformers==${TRANSFORMERS_VERSION} \ | |
| sentencepiece python-etcd \ | |
| deepspeed accelerate | |
| RUN pip3 install --no-cache-dir \ | |
| awscli pynvml \ | |
| transformers==${TRANSFORMERS_VERSION} \ | |
| sentencepiece python-etcd \ | |
| deepspeed>=0.16,<1.0 accelerate>=1.0,<2.0 |
I'd also double-check the QLoRA requirements.txt for the same issue — any packages there without upper bounds should get them too.
There was a problem hiding this comment.
Fixed. Pinned deepspeed>=0.16,<1.0 and accelerate>=1.0,<2.0 in the Dockerfile. QLoRA requirements.txt already had proper upper bounds — no changes needed there.
| matching the existing benchmark-results schema at: | ||
| s3://paragao-new-nemo-squash-container/benchmark-results/b200/ | ||
|
|
There was a problem hiding this comment.
Personal S3 bucket reference
This looks like a personal bucket path in the module docstring. Since this will be visible in the public repo, I'd suggest either removing it or replacing it with a generic path.
| matching the existing benchmark-results schema at: | |
| s3://paragao-new-nemo-squash-container/benchmark-results/b200/ | |
| matching the benchmark-results schema. | |
| Usage: |
There was a problem hiding this comment.
Fixed. Replaced with generic placeholder s3://<YOUR_BUCKET>/benchmark-results/<instance_type>/. Also parameterized cluster and instance_type in build_result_json() — they now accept CLI args (--cluster, --instance-type) and fall back to $CLUSTER_NAME/$INSTANCE_TYPE env vars.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/4 — Deployment Pipeline
| mkdir -p ${APPS_PATH} | ||
| mkdir -p logs |
There was a problem hiding this comment.
Unquoted variable expansions
The repo convention calls for quoting all variable expansions to guard against word splitting. While these are unlikely to contain spaces in practice, it's good hygiene and consistent with the rest of the codebase.
| mkdir -p ${APPS_PATH} | |
| mkdir -p logs | |
| mkdir -p "${APPS_PATH}" | |
| mkdir -p logs |
The same applies throughout pretrain_gpt_103b.sbatch — several ${IMAGE}, ${HOSTFILE}, and other expansions are unquoted, especially in the srun arguments and the bash -c string interpolation at the bottom of the script.
There was a problem hiding this comment.
Fixed. Quoted all variable expansions in 1.build-image.sbatch and pretrain_gpt_103b.sbatch ($SLURM_JOB_NODELIST, ${HOSTFILE}, ${IMAGE}, ${APPS_PATH}, ${ENROOT_IMAGE}). For the srun bash -c line, variables are intentionally expanded on the host side before passing to the container — added a comment explaining this.
| #SBATCH --job-name=deepspeed-pretrain-103b | ||
| #SBATCH --output=logs/%x_%j.out | ||
| #SBATCH --error=logs/%x_%j.err | ||
| #SBATCH --partition=b200 |
There was a problem hiding this comment.
Hardcoded partition name
The b200 partition is specific to your cluster. Other users won't have this partition, and the #SBATCH directive takes precedence over sbatch --partition=... command-line overrides. Since the Makefile already passes --partition=$(PARTITION) at submission time, I'd suggest removing this directive entirely and letting users specify the partition when submitting. That way make train PARTITION=my-partition works without the #SBATCH line overriding it.
Alternatively, add a comment noting this must be changed for other clusters.
There was a problem hiding this comment.
Fixed. Removed the #SBATCH --partition=b200 directive entirely. The partition is now passed exclusively at submit time via --partition= in the Makefile. Also changed the Makefile default partition from b200 to dev.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/4 — Infrastructure & NCCL Configuration
| export FI_PROVIDER=efa | ||
| export FI_EFA_USE_HUGE_PAGE=0 | ||
| export NCCL_SOCKET_IFNAME=^docker,lo,veth | ||
| export NCCL_P2P_NET_CHUNKSIZE=2048576 |
There was a problem hiding this comment.
NCCL_P2P_NET_CHUNKSIZE value looks off
This value is 2048576 (2,048,576). If 2 MB was intended, the conventional power-of-two value would be 2097152 (2×1024×1024). If it's an intentional non-power-of-two tuning choice that works well in your benchmarks, a brief comment explaining why would be helpful.
| export NCCL_P2P_NET_CHUNKSIZE=2048576 | |
| export NCCL_P2P_NET_CHUNKSIZE=2097152 |
There was a problem hiding this comment.
Fixed. Changed from 2048576 to 2097152 (2 x 1024 x 1024). The original value was a typo, not an intentional tuning choice.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 4/4 — Documentation Consistency
README environment variable table defaults vs recommended config
The README's env vars table correctly shows that PP defaults to 2 and ZERO_STAGE to 1 (matching the sbatch script), but the recommended "best config" uses TP=8, PP=8, ZERO_STAGE=0. Users who run the sbatch directly without make train would get the baseline config rather than the documented best. I'd suggest either aligning the sbatch defaults with the recommended config, or adding a note clarifying that the defaults are a safe baseline and make train uses the optimized settings.
Missing newline at end of README
The diff shows \ No newline at end of file on the last line of README.md. Per .editorconfig conventions (insert_final_newline = true), this should have a trailing newline.
Things That Look Great
- Excellent Dockerfile modernization: The switch from manual aws-ofi-nccl source builds to the EFA 1.47 bundled plugin with proper symlinks is a significant improvement. The
ld.so.conf.dapproach for library discovery is more robust than relying solely onLD_LIBRARY_PATH. - Thorough best practices documentation: The README's best practices section is clearly informed by real experimentation — the ZeRO-2/3 + pipeline parallelism incompatibility, the
NCCL_ALGO=Treewarning, and theexpandable_segments:Truecase sensitivity are all valuable tribal knowledge that will save users hours of debugging. - Well-parameterized sbatch script: The
pretrain_gpt_103b.sbatchscript is cleanly organized with sensible defaults and full override capability via environment variables. The automatic ZeRO-stage-aware pipeline parallel handling is a nice touch. - Consistent infrastructure stack: Standardizing both Dockerfiles on the same EFA/NCCL/GDRCopy versions reduces maintenance burden and eliminates "works in one container but not the other" issues.
- Dynamic DeepSpeed config generation: Generating the DS config at runtime from environment variables (rather than requiring users to edit JSON files) is much more ergonomic for parameter sweeps.
- Clean log parser:
parse_results.pyis well-structured with clear regex patterns and proper warmup-step handling for steady-state metrics.
| | Variable | Default | Description | | ||
| |----------|---------|-------------| | ||
| | `TP` | 8 | Tensor parallel size | | ||
| | `PP` | 2 | Pipeline parallel size | |
There was a problem hiding this comment.
Default PP doesn't match recommended config
The table shows PP defaults to 2, but the recommended "best config" (in the Makefile and the make train example above) uses PP=8. Users running the sbatch directly without make train will get PP=2. Consider either changing the default to 8 or adding a note here that these are safe baselines and make train uses the optimized settings.
| | `PP` | 2 | Pipeline parallel size | | |
| | `PP` | 2 | Pipeline parallel size (use PP=8 for best throughput, see `make train`) | |
There was a problem hiding this comment.
Fixed. Updated the README table row to: | PP | 2 | Pipeline parallel size (best throughput with PP=8, see make train) |
KeitaW
left a comment
There was a problem hiding this comment.
Good update, few comments.
- Remove personal S3 bucket name from parse_results.py and upload_results.sh - Parameterize cluster name and instance type in parse_results.py (via CLI args and env vars) - Replace hardcoded S3 bucket/regions in upload_results.sh with required env vars (S3_BUCKET, S3_REGION, CW_REGION) - Remove hardcoded --partition=b200 from sbatch script - Make PARTITION overridable in sweep_runner.sh and sweep_runner_v2.sh - Change default partition from 'b200' to 'dev' in Makefile and sweep scripts - Add sweep_runner.sh, sweep_runner_v2.sh, and upload_results.sh to tracking
…e vars - Move GPT-103B pretraining files into gpt/ subdirectory with slurm/ and configs/ sub-dirs to match repo conventions - Pin deepspeed>=0.16,<1.0 and accelerate>=1.0,<2.0 in Dockerfile - Quote all variable expansions in build and training scripts - Fix NCCL_P2P_NET_CHUNKSIZE from 2048576 to 2097152 (2MB power-of-two) - Add PP note to README env vars table clarifying best config uses PP=8 - Add trailing newline to README.md - Update all path references for new directory structure
Addressed all review feedbackDirectory restructure (Batch 1/4): Moved GPT-103B files into Updated all path references in Makefile and README. Documentation (Batch 4/4): Added trailing newline to README.md. Updated PP row with note about Also fixed across these commits:
See individual inline replies for details on each item. |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/3 — Structure & Repository Hygiene
Re-review of PR #1009 (2026-03-11). Good progress since the initial review — the gpt/ subdirectory restructuring, version pins on deepspeed/accelerate, and NCCL chunk size fix all look great.
One cross-cutting note: Steps 1–7 of 0.deepspeed.dockerfile and qlora/Dockerfile are nearly identical (~80 lines of duplicated EFA/NCCL/GDRCopy infrastructure). This isn't blocking since self-contained Dockerfiles are the norm in this repo, but it's worth considering a shared base image in a follow-up to reduce maintenance burden when version bumps are needed.
|
|
||
| ```bash | ||
| enroot import -o ${ENROOT_IMAGE} dockerd://deepspeed:latest | ||
| git clone https://github.com/microsoft/Megatron-DeepSpeed /fsx/deepspeed/Megatron-DeepSpeed |
There was a problem hiding this comment.
Megatron-DeepSpeed clone is not pinned to a version
The previous Dockerfile had ARG MEGATRON_LM_VERSION=core_r0.8.0, but this was removed. Per the repo convention, external dependencies must be pinned to a version/tag/commit — never HEAD. I'd suggest pinning to whatever commit or tag was validated during testing on the B200 cluster.
| git clone https://github.com/microsoft/Megatron-DeepSpeed /fsx/deepspeed/Megatron-DeepSpeed | |
| git clone --branch <validated_tag> --depth 1 https://github.com/microsoft/Megatron-DeepSpeed /fsx/deepspeed/Megatron-DeepSpeed |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/3 — Deployment Pipeline
Two items in the container build pipeline.
| enroot import -o ${ENROOT_IMAGE}.sqsh dockerd://${ENROOT_IMAGE}:latest | ||
| mv ${ENROOT_IMAGE}.sqsh ${IMAGE} No newline at end of file | ||
| enroot import -o "${ENROOT_IMAGE}.sqsh" "dockerd://${ENROOT_IMAGE}:latest" | ||
| mv "${ENROOT_IMAGE}.sqsh" "${IMAGE}" No newline at end of file |
There was a problem hiding this comment.
Missing final newline
The diff shows \ No newline at end of file here. Per .editorconfig (insert_final_newline = true), this should end with a trailing newline.
| # Install PyTorch with CUDA 12.9 support | ||
| # Note: torch 2.10+ has a breaking LR scheduler change (strict zip) that is | ||
| # incompatible with some DeepSpeed/transformers versions. Pin to <2.10 until | ||
| # upstream libraries catch up. | ||
| RUN pip install --no-cache-dir 'torch>=2.7.0,<2.10.0' --index-url https://download.pytorch.org/whl/cu128 | ||
| RUN pip install --no-cache-dir 'torch>=2.7.0,<2.10.0' --index-url https://download.pytorch.org/whl/cu129 |
There was a problem hiding this comment.
Reinstalling PyTorch on top of NGC base image
The base image nvcr.io/nvidia/pytorch:25.04-py3 already ships with a CUDA 12.9-optimized PyTorch. This line overwrites it with the generic PyPI wheel, which may lack NGC-specific optimizations (cuDNN auto-tuning, NVTX annotations, etc.) and increases image size.
The main Dockerfile (0.deepspeed.dockerfile) correctly relies on the base image's PyTorch. I'd suggest either removing this line (the base image's torch satisfies >=2.7.0,<2.10.0) or adding a comment explaining why the override is intentional (e.g., a specific version requirement for QLoRA compatibility).
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/3 — Documentation Consistency
Things That Look Great
- Directory structure fixed: The
gpt/subdirectory withslurm/andconfigs/follows the repo convention nicely. Good restructuring since the last review. - ZeRO-2/3 pipeline parallelism guard: The automatic detection of
ZERO_STAGE >= 2to disable pipeline parallelism and add--no-pipeline-parallelprevents a confusing runtime assertion. The inline comment explaining the Megatron-DeepSpeed PipelineEngine constraint is exactly the kind of documentation that saves hours of debugging. - Dynamic DeepSpeed config generation: Generating the JSON config at runtime from env vars (with ZeRO-3 specific parameters) is much more maintainable than managing multiple static configs.
- Comprehensive best practices from real experiments: The documented findings (ZeRO-0 > ZeRO-1 when DP=1, capital T in
expandable_segments:True, NCCL_ALGO=Tree hangs on EFA, negligible impact of NCCL buffer tuning) are hard-won knowledge that will save users significant debugging time. - Proper NCCL_SOCKET_IFNAME exclusion pattern: Both Dockerfiles use
^docker,lo,veth— the correct exclusion-based approach for EFA clusters. - NCCL plugin symlink documentation: The comments explaining why symlinks are needed (EFA installer naming vs. what NCCL expects, causing silent TCP socket fallback) prevent a subtle and hard-to-diagnose failure mode.
- Well-parameterized sbatch: Everything configurable via env vars with sensible defaults enables both quick benchmarking and detailed parameter sweeps.
- Clean log parser:
parse_results.pyis well-structured with clear regex patterns, warmup exclusion, TFLOPS computation fallback, and both single-file and batch CSV modes. - Improved shell quoting:
1.build-image.sbatchnow quotes all variable expansions properly.
KeitaW
left a comment
There was a problem hiding this comment.
Few minor comments. Generally looks good. Thanks for addressing the comments!
Summary
Changes
New files
pretrain_gpt_103b.sbatchparse_results.pyconfigs/ds_config_103b_template.jsonModified files
0.deepspeed.dockerfilepytorch:25.04-py3base (CUDA 12.9), EFA 1.47.0, NCCL 2.29.3, GDRCopy v2.5.1, OFI-NCCL symlinks,ld.so.conf.dlibrary discovery1.build-image.sbatchmkdir -p /fsx/apps, fixed job nameMakefiletraintarget uses best config (TP=8, PP=8, fusions),parsetarget for log parsing,helpwith descriptionsREADME.mdqlora/Dockerfilepytorch:25.04-py3base, EFA 1.47, NCCL 2.29.3, GDRCopy 2.5.1, proper NCCL/EFA env vars,expandable_segments:Trueqlora/requirements.txtKey best practices documented
--no-pipeline-parallel(Megatron-DeepSpeed PipelineEngine limitation)NCCL_ALGO=Treeon EFA clusters (causes hangs)expandable_segments:True(capital T) in pytorch:25.04 containerspython3 -m torch.distributed.runinstead oftorchrun(shebang compatibility)Testing
All configurations were validated on an 8-node B200 HyperPod cluster (64 GPUs total) with 50 training steps each.