This repository contains experiments in applying Hierarchical Reasoning Model (HRM) style architectures to language modeling tasks, with a focus on efficient training on Apple Silicon.
For a detailed writeup of this work, see: Language Modeling with Hierarchical Reasoning Models
We explore several architectural variants for language modeling on the TinyStories dataset:
- Flat Transformer: Standard transformer baseline for comparison
- TRM (Tiny Recursive Model): Simplified recursive reasoning with L-level and H-level states
- MoEUT (Mixture of Experts with Universal Transformers): Combining MoE with recursive processing
- Adaptive Computation Time (ACT): Dynamic depth via learned halting
All models are trained from scratch on TinyStories (~2.1M sequences) and evaluated on their ability to generate coherent children's stories.
- Python 3.10+
- Apple Silicon with Metal Performance Shaders (MPS)
- uv for dependency management
Install dependencies:
uv syncTraining uses Hydra configs located in config/. The main entrypoint is pretrain_text.py.
uv run pretrain_text.py --config-name cfg_tinystories_1m \
max_sequences=64 \
global_batch_size=32 \
streaming=true \
wandb_mode=disabledFlat Transformer (~1M params):
uv run pretrain_text.py --config-name cfg_tinystories_1m_flatTRM with MoEUT (~1M params):
uv run pretrain_text.py --config-name cfg_tinystories_1m_moeutTRM with MoEUT and ACT (~1M params):
uv run pretrain_text.py --config-name cfg_tinystories_1m_moeut_actLarger MoEUT variant (~10M params):
uv run pretrain_text.py --config-name cfg_tinystories_10m_moeutGenerate stories from trained checkpoints:
uv run python generate_stories.py \
--checkpoint-path runs/checkpoints/trm_tinystories_1m/your-run-name/step_12345.pt \
--prompt "Once upon a time" \
--max-new-tokens 512Run comprehensive generation sweeps across all checkpoints:
uv run python sweep_generation.pyThis tests each checkpoint with:
- 10 diverse prompts (in/out of distribution)
- 8 generation configs (varying temperature, top-k, ACT settings)
- Results saved as
generation_sweep_results.jsonin each checkpoint directory
- Default: Streams from Hugging Face (
roneneldan/TinyStories) - Offline: Download with
datasets-cli download roneneldan/TinyStoriestodata/hf/TinyStories
Training metrics are logged to Weights & Biases:
- Local server:
http://localhost:8080(default) - Set
wandb_mode=disabledto skip logging - Set
wandb_mode=offlinefor local-only logging
- Sigmoid-based routing: Sparse expert selection in FFN and attention layers
- Bias balancing: DeepSeek-style loss-free load balancing via routing bias adjustment
- Post-normalization: RMS norm after attention/FFN for numerical stability
In models/recursive_reasoning/trm.py:152, we always apply post-normalization for MoEUT to prevent activation explosions:
# Always use post-norm for MoEUT to prevent activation explosions
self._use_moeut_post_norm = self.config.use_moeutThis fix was critical for stable generation when halt_max_steps == 1.
uv run --group dev pytestThis code is based on:
- Tiny Recursive Model (TRM) by Alexia Jolicoeur-Martineau
- Hierarchical Reasoning Model (HRM) by Wang et al.
- Original HRM codebase: https://github.com/sapientinc/HRM
@misc{jolicoeurmartineau2025morerecursivereasoningtiny,
title={Less is More: Recursive Reasoning with Tiny Networks},
author={Alexia Jolicoeur-Martineau},
year={2025},
eprint={2510.04871},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.04871},
}
@misc{wang2025hierarchicalreasoningmodel,
title={Hierarchical Reasoning Model},
author={Guan Wang and Jin Li and Yuhao Sun and Xing Chen and Changling Liu and Yue Wu and Meng Lu and Sen Song and Yasin Abbasi Yadkori},
year={2025},
eprint={2506.21734},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.21734},
}