Skip to content

Expand projection.md with memory projection and performance details#577

Closed
araina-amd wants to merge 10 commits intorelease/v26.2from
araina/update-projection-docs
Closed

Expand projection.md with memory projection and performance details#577
araina-amd wants to merge 10 commits intorelease/v26.2from
araina/update-projection-docs

Conversation

@araina-amd
Copy link
Contributor

kailashg26 and others added 10 commits February 19, 2026 07:21
…32B Configs for MI300X & MI355X (#556)

YF: Only SFT related config and Doc changes, bypassing unit CI tests

## Summary

This PR introduces post-training documentation and updates Qwen3 32B
model configuration files to support AMD MI300X and MI355X accelerators.

---

## Changes

### 📘 Documentation

- **Added `posttraining.md`**
  - New comprehensive guide for post-training workflows
  - Covers setup instructions, configuration details, and usage examples

- **Updated `docs/README.md`**
  - Added a new section referencing post-training documentation
  - Improved documentation organization and navigation

---

### ⚙️ Configuration Updates

- **Updated Qwen3_32B model YAML configs**
  - Added/modified configurations optimized for:
    - MI300X
    - MI355X
  - Adjusted parameters for compatibility and stable execution

---

## Validation

- Verified updated configs load and execute successfully on MI300X and
MI355X environments
- Confirmed documentation links and structure render correctly

---

## Checklist

- [x] Added `posttraining.md`
- [x] Updated `docs/README.md`
- [x] Modified Qwen3_32B YAML configs
- [x] Verified changes locally
Adds a patch to fix Megatron FSDP compatibility with PyTorch 2.10+. The
patch updates get_mesh_names to use the new DeviceMesh API
(_get_root_mesh() and _flatten_mapping) instead of the deprecated
_mesh_resources.child_to_root_mapping removed in PyTorch 2.10. The patch
is automatically applied when use_megatron_fsdp is enabled.

Co-authored-by: WangLingxun <linxwang@amd.com>
Adds support for CPU initialization in Primus Turbo linear layers
(RowParallelLinear, ColumnParallelLinear, and LayerNormLinear). When
use_cpu_initialization is enabled, the patch disables custom init
methods by passing a no-op lambda, allowing Megatron's CPU
initialization to work correctly with Primus Turbo's custom layer
implementations.

Co-authored-by: WangLingxun <linxwang@amd.com>
Previously, the evaluation loss was computed per iteration and
overwritten, leading to incorrect averaging when multiple eval
iterations are used.
This fix accumulates the numerator and denominator separately across all
eval iterations and computes the final average at the end.
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
…odes (#554)

### Changes:
Only flag imbalance if the COUNT of GPUs on each node differs.
Example: 
4 on Node 0, 4 on Node 1 -> counts=[4,4] -> set={4} -> len=1 -> NOT
imbalanced.
7 on Node 0, 1 on Node 1 -> counts=[7,1] -> set={7,1} -> len=2 ->
Imbalanced.

### Reason for changes:
The previous logic would issue a NUMA imbalance warning if not all GPUs
were connected to the same node, resulting in a false positive when
using a multi-socket CPU.

---------

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Updates the AINIC Docker build inputs and adjusts the pretrain launcher
to disable HipBLASLt tuning by default (to avoid profiler/TE issues),
while also extending CI to build an additional v25.09 AINIC image
variant.

Changes:

- Disable HipBLASLt tuning by default in run_pretrain.sh, requiring an
explicit opt-in env var to enable it.
- Bump the AINIC bundle used by the AINIC Docker image from a-38 to
a-56.
- Update CI to use the new bundle and add a new -v25.09-ainic image
build/push step.
- Introduced a new class, ElapsedAverageExtension, to calculate and
inject the running average of elapsed time per iteration (ms) into
training logs.
- Updated TrainingLogInfo to include elapsed_index for tracking elapsed
time segments.
- Enhanced log parsing to support the new elapsed time metrics.
- Modified patch_training_log_unified to integrate the new extension
alongside existing memory and throughput statistics.

---------

Co-authored-by: HuangWei-95 <weihuan@amd.com>
Co-authored-by: wenxie-amd <wen.xie@amd.com>
Ensure --key=value and --key value are parsed consistently so runtime
config overrides apply correctly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants