Skip to content

cat-state/modded-nanogpt-moe

Repository files navigation

MoE fork of NanoGPT speedrun

This repository is a fork of the NanoGPT speedrun to demonstrate the use of MoE for fun and profit. The fork is taken at the 10.8 min record. In the original speedrun, the goal is to get to less than 3.28 cross-entropy loss on the FineWeb dataset.


Running the dense baseline

This is a modernized but simple causal transformer architecture, using rotary position embeddings, padded and untied token embeddings, RMSNorm, ReLU^2, QK-norm, and the Muon optimizer. This achieves 3.276 validation loss when trained for 2.4B tokens on the FineWeb dataset.

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
uv venv
uv pip install -r requirements.txt
python data/cached_fineweb10B.py 24
./run_dense.sh

Running the MoE

This will train a top-k 1, 4 expert MoE. Each expert has the same architecture as the dense model, so the overall MoE is larger than the dense baseline, but has the same number of activated parameters. This achieves a 3.218 validation loss - an improvement of 1.8%. It gets to <3.28 val loss in 4000 steps, compared to 4500 for the baseline - an improvement of 11%.

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
uv venv
uv pip install -r requirements.txt
python data/cached_fineweb10B.py 24
./run_moe.sh

References

  1. Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977modded-nanogpt: Speedrunning the NanoGPT baseline
  2. Guilherme Penedo et al. "The fineweb datasets: Decanting the web for the finest text data at scale." arXiv preprint arXiv:2406.17557 (2024).

Citation

@misc{modded_nanogpt_2024,
  author       = {Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and
                  @fernbear.bsky.social and Boza Vlado and You Jiacheng and
                  Franz Cesista and Braden Koszarsky and @Grad62304977},
  title        = {modded-nanogpt: Speedrunning the NanoGPT baseline},
  year         = {2024},
  url          = {https://github.com/KellerJordan/modded-nanogpt}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages