Support MatFormer tiering with tensor parallelism

## Problem
MatFormer tiering currently hard-fails when tensor parallelism (TP) is enabled. Both LLaMA and NanoGPT MLPs assert `!is_tensor_parallel`, which blocks tiered training for large models that *require* TP to fit (e.g., multi-GPU A100/H100). This prevents heterogeneous setups where smaller nodes train smaller tiers while larger nodes train full tiers.

Refs:
- `shared/modeling/src/models/llama.rs` (assert in MLP forward)
- `shared/modeling/src/models/nanogpt.rs` (assert in MLP forward)

## Expected
Tiered FFN slicing should be compatible with TP shards when dimensions are divisible by TP size.

## Possible Approach
- Define `matformer_hidden_size_per_rank = (intermediate_size / 2^tier) / tp_size`.
- In TP, each rank already holds a contiguous shard; slice within each shard for tiered widths.
- For helper mode: sample indices in global space, then map to per-rank indices (or sample per-rank deterministically).
- Add shape/consistency checks and a small TP+MatFormer test (tiny model) to guard behavior.

## Acceptance Criteria
- Remove the TP assert for MatFormer tiers.
- Tiered training works with TP>1 on a small model test.
- Clear error message when tiered dims aren’t divisible by TP.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MatFormer tiering with tensor parallelism #1

Problem

Expected

Possible Approach

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support MatFormer tiering with tensor parallelism #1

Description

Problem

Expected

Possible Approach

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions