This is a research fork of Nous Research's Psyche, supported by a very generous compute grant from the same team, and an Emergent Ventures grant.
Matformer: Nested Transformer for Elastic Inference, and subsequent Gemma-3n models show a way to develop nested Transformers that can be trained collaboratively.
This means that, even though
This property can be used to develop a distributed training method that slices GPT models width-wise (in case of FFNs) and head-wise (in case of attention), distribute smaller chunks over devices with less VRAM, while larger devices hold the full model, and intelligently gather gradients. This also produces strong small models that are verifiably more powerful than if they were distilled from the larger model later for an equivalent budget (effectively for free, considering devices with low VRAM can otherwise simply not join the training run).
Highlighted in red above is a submodel on a "satellite", trained in tandem with a larger model.
Both the larger and the smaller submodel outperform a model trained with the same token-budget/time-budget as each of the models independently.
The smaller model occupies only ~70% of the VRAM of the larger, and the difference grows with size. These models are of the ~124M parameter range, trained on a small slice of FineWeb.
Here is a little infographic I made about my dreams with Aftermath:

Aftermath is built on top of Psyche (PsycheFoundation/psyche) and extends it with MatFormer tiered checkpoints, manifest-based slicing, and practical tooling for mixed‑hardware training. For canonical Psyche documentation, see https://docs.psyche.network (upstream).
This README adds the elastic workflow and fork‑specific details. For more extensive documentation, please host psyche-book, available on this repo, and refer to the new Matformer training details.
Aftermath offers, out of the box:
- Support for Heterogeneous training:
- smaller devices train smaller tiers; large devices train full tiers; all contribute to shared weights.
- tiered checkpoints pull only the files needed for a given tier (esp. from HF).
- schema canonicalization + double‑slicing protection prevents mismatched runs.
- explicit manifests + metadata give reliable tier detection and validation.
This is possibly through the implementation of a Matryoshka Transformer. Some details, as you look through the code/documentation:
- Tier slices are prefixes of the FFN width: tier 1 = 1/2, tier 2 = 1/4, etc.
- Clients at smaller tiers update only prefix weights; shared prefixes receive gradients from all tiers.
- Rust toolchain
- Python (3.11/3.12 recommended) with PyTorch installed
tmux(optional, for local testnet UI)
Use the repo helper to point tch-rs to your Python torch install:
source scripts/psyche-env.shjust local-testnet \
--num-clients 3 \
--config-path ./config/consilience-match-llama2-20m-fineweb-pretrain-dev/ \
--client-matformer-tiers 0,1,2If you don’t have just, use:
cargo run -p psyche-centralized-local-testnet -- start \
--num-clients 3 \
--config-path ./config/consilience-match-llama2-20m-fineweb-pretrain-dev/ \
--client-matformer-tiers 0,1,2python scripts/export_matformer_tiers.py --src checkpoints/my-model --tiers 1 2Produces:
checkpoints/my-model-tier1/,-tier2/(sliced weights)matformer_manifest.jsonin the universal directory- Tier configs with:
matformer_tiermatformer_base_intermediate_sizeintermediate_size(sliced width)
{
"schema_version": 1,
"matformer_base_intermediate_size": 1024,
"common_files": ["tokenizer.json"],
"tiers": [
{"tier": 1, "intermediate_size": 512, "files": ["../my-model-tier1/config.json", "../my-model-tier1/model.safetensors"]}
],
"sha256": {"...": "..."}
}Paths are relative to the manifest location; the loader normalizes for HF and rejects absolute paths.
auto(default): use sliced if complete, otherwise fall back to universal.sliced: require the tier slice; fail fast if missing.universal: always load full checkpoint.
The loader prevents double‑slicing if a checkpoint is already tiered.
- MatFormer tier slices, manifests, and hub selective downloads
- Metadata inference + schema canonicalization for mixed tiers
- Heterogeneous gradient aggregation normalization
- Parameter tying, aggressive NanoGPT support (ongoing)
- Optimizer and kernel work (Muon, Polar Express) with tests (ongoing)
- System metrics logging + fault injection tools
- Packaging and local testnet improvements
- Tensor parallelism with tier > 0 is not supported yet.
- Helper mode (suffix sampling) is disabled until sparse alignment/rotation is complete.
- Tier export currently assumes a single
.safetensorsshard.
