Skip to content

Support MatFormer tiering with tensor parallelism #1

@plugyawn

Description

@plugyawn

Problem

MatFormer tiering currently hard-fails when tensor parallelism (TP) is enabled. Both LLaMA and NanoGPT MLPs assert !is_tensor_parallel, which blocks tiered training for large models that require TP to fit (e.g., multi-GPU A100/H100). This prevents heterogeneous setups where smaller nodes train smaller tiers while larger nodes train full tiers.

Refs:

  • shared/modeling/src/models/llama.rs (assert in MLP forward)
  • shared/modeling/src/models/nanogpt.rs (assert in MLP forward)

Expected

Tiered FFN slicing should be compatible with TP shards when dimensions are divisible by TP size.

Possible Approach

  • Define matformer_hidden_size_per_rank = (intermediate_size / 2^tier) / tp_size.
  • In TP, each rank already holds a contiguous shard; slice within each shard for tiered widths.
  • For helper mode: sample indices in global space, then map to per-rank indices (or sample per-rank deterministically).
  • Add shape/consistency checks and a small TP+MatFormer test (tiny model) to guard behavior.

Acceptance Criteria

  • Remove the TP assert for MatFormer tiers.
  • Tiered training works with TP>1 on a small model test.
  • Clear error message when tiered dims aren’t divisible by TP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions