Expand projection.md with memory projection and performance details by araina-amd · Pull Request #577 · AMD-AGI/Primus

araina-amd · 2026-03-02T22:47:08Z

Expand projection.md with memory projection and performance details

…32B Configs for MI300X & MI355X (#556) YF: Only SFT related config and Doc changes, bypassing unit CI tests ## Summary This PR introduces post-training documentation and updates Qwen3 32B model configuration files to support AMD MI300X and MI355X accelerators. --- ## Changes ### 📘 Documentation - **Added `posttraining.md`** - New comprehensive guide for post-training workflows - Covers setup instructions, configuration details, and usage examples - **Updated `docs/README.md`** - Added a new section referencing post-training documentation - Improved documentation organization and navigation --- ### ⚙️ Configuration Updates - **Updated Qwen3_32B model YAML configs** - Added/modified configurations optimized for: - MI300X - MI355X - Adjusted parameters for compatibility and stable execution --- ## Validation - Verified updated configs load and execute successfully on MI300X and MI355X environments - Confirmed documentation links and structure render correctly --- ## Checklist - [x] Added `posttraining.md` - [x] Updated `docs/README.md` - [x] Modified Qwen3_32B YAML configs - [x] Verified changes locally

Adds a patch to fix Megatron FSDP compatibility with PyTorch 2.10+. The patch updates get_mesh_names to use the new DeviceMesh API (_get_root_mesh() and _flatten_mapping) instead of the deprecated _mesh_resources.child_to_root_mapping removed in PyTorch 2.10. The patch is automatically applied when use_megatron_fsdp is enabled. Co-authored-by: WangLingxun <linxwang@amd.com>

Adds support for CPU initialization in Primus Turbo linear layers (RowParallelLinear, ColumnParallelLinear, and LayerNormLinear). When use_cpu_initialization is enabled, the patch disables custom init methods by passing a no-op lambda, allowing Megatron's CPU initialization to work correctly with Primus Turbo's custom layer implementations. Co-authored-by: WangLingxun <linxwang@amd.com>

Previously, the evaluation loss was computed per iteration and overwritten, leading to incorrect averaging when multiple eval iterations are used. This fix accumulates the numerator and denominator separately across all eval iterations and computes the final average at the end.

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

…odes (#554) ### Changes: Only flag imbalance if the COUNT of GPUs on each node differs. Example: 4 on Node 0, 4 on Node 1 -> counts=[4,4] -> set={4} -> len=1 -> NOT imbalanced. 7 on Node 0, 1 on Node 1 -> counts=[7,1] -> set={7,1} -> len=2 -> Imbalanced. ### Reason for changes: The previous logic would issue a NUMA imbalance warning if not all GPUs were connected to the same node, resulting in a false positive when using a multi-socket CPU. --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

Updates the AINIC Docker build inputs and adjusts the pretrain launcher to disable HipBLASLt tuning by default (to avoid profiler/TE issues), while also extending CI to build an additional v25.09 AINIC image variant. Changes: - Disable HipBLASLt tuning by default in run_pretrain.sh, requiring an explicit opt-in env var to enable it. - Bump the AINIC bundle used by the AINIC Docker image from a-38 to a-56. - Update CI to use the new bundle and add a new -v25.09-ainic image build/push step.

- Introduced a new class, ElapsedAverageExtension, to calculate and inject the running average of elapsed time per iteration (ms) into training logs. - Updated TrainingLogInfo to include elapsed_index for tracking elapsed time segments. - Enhanced log parsing to support the new elapsed time metrics. - Modified patch_training_log_unified to integrate the new extension alongside existing memory and throughput statistics. --------- Co-authored-by: HuangWei-95 <weihuan@amd.com> Co-authored-by: wenxie-amd <wen.xie@amd.com>

Ensure --key=value and --key value are parsed consistently so runtime config overrides apply correctly

…ails

kailashg26 and others added 10 commits February 19, 2026 07:21

doc(primus-pipeline): add primus-pipeline blog (#513)

78633ae

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>

fix: normalize --key=value overrides by stripping leading dashes (#562)

2c57b11

Ensure --key=value and --key value are parsed consistently so runtime config overrides apply correctly

docs: expand projection.md with memory projection and performance det…

0258c8a

…ails

araina-amd requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners March 2, 2026 22:47

araina-amd closed this Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand projection.md with memory projection and performance details#577

Expand projection.md with memory projection and performance details#577
araina-amd wants to merge 10 commits intorelease/v26.2from
araina/update-projection-docs

araina-amd commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

araina-amd commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants