MoE fork of NanoGPT speedrun

This repository is a fork of the NanoGPT speedrun to demonstrate the use of MoE for fun and profit. The fork is taken at the 10.8 min record. In the original speedrun, the goal is to get to less than 3.28 cross-entropy loss on the FineWeb dataset.

Running the dense baseline

This is a modernized but simple causal transformer architecture, using rotary position embeddings, padded and untied token embeddings, RMSNorm, ReLU^2, QK-norm, and the Muon optimizer. This achieves 3.276 validation loss when trained for 2.4B tokens on the FineWeb dataset.

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
uv venv
uv pip install -r requirements.txt
python data/cached_fineweb10B.py 24
./run_dense.sh

Running the MoE

This will train a top-k 1, 4 expert MoE. Each expert has the same architecture as the dense model, so the overall MoE is larger than the dense baseline, but has the same number of activated parameters. This achieves a 3.218 validation loss - an improvement of 1.8%. It gets to <3.28 val loss in 4000 steps, compared to 4500 for the baseline - an improvement of 11%.

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
uv venv
uv pip install -r requirements.txt
python data/cached_fineweb10B.py 24
./run_moe.sh

References

Citation

@misc{modded_nanogpt_2024,
  author       = {Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and
                  @fernbear.bsky.social and Boza Vlado and You Jiacheng and
                  Franz Cesista and Braden Koszarsky and @Grad62304977},
  title        = {modded-nanogpt: Speedrunning the NanoGPT baseline},
  year         = {2024},
  url          = {https://github.com/KellerJordan/modded-nanogpt}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
img		img
records		records
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_dense.sh		run_dense.sh
run_moe.sh		run_moe.sh
train_gpt.py		train_gpt.py
train_gpt_moe.py		train_gpt_moe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoE fork of NanoGPT speedrun

Running the dense baseline

Running the MoE

References

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MoE fork of NanoGPT speedrun

Running the dense baseline

Running the MoE

References

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages