-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
MatFormer tiering currently hard-fails when tensor parallelism (TP) is enabled. Both LLaMA and NanoGPT MLPs assert !is_tensor_parallel, which blocks tiered training for large models that require TP to fit (e.g., multi-GPU A100/H100). This prevents heterogeneous setups where smaller nodes train smaller tiers while larger nodes train full tiers.
Refs:
shared/modeling/src/models/llama.rs(assert in MLP forward)shared/modeling/src/models/nanogpt.rs(assert in MLP forward)
Expected
Tiered FFN slicing should be compatible with TP shards when dimensions are divisible by TP size.
Possible Approach
- Define
matformer_hidden_size_per_rank = (intermediate_size / 2^tier) / tp_size. - In TP, each rank already holds a contiguous shard; slice within each shard for tiered widths.
- For helper mode: sample indices in global space, then map to per-rank indices (or sample per-rank deterministically).
- Add shape/consistency checks and a small TP+MatFormer test (tiny model) to guard behavior.
Acceptance Criteria
- Remove the TP assert for MatFormer tiers.
- Tiered training works with TP>1 on a small model test.
- Clear error message when tiered dims aren’t divisible by TP.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels