SimpleStories is a collection of model-generated short stories. It was made to train small, interpretable language models. SimpleStories is inspired by TinyStories.
See also the demo website for an interactive preview. We provide code for model training as well.
When using SimpleStories in your work, please cite the SimpleStories data paper:
@article{finke2025parameterized,
title={Parameterized Synthetic Text Generation with SimpleStories},
author={Finke, Lennart and Dooms, Thomas and Allen, Mat and Rodriguez, Juan Diego and Nabeshima, Noa and Braun, Dan},
journal={arXiv preprint arXiv:2504.09184},
year={2025}
}
oai_batch.py provides functionality to recreate the dataset cost effectively with the OpenAI Batch API.
- Story annotation with high-level concepts:
theme,topic,style, etc. - Higher semantic and syntactic diversity through seeded story generation
- Generated by 2024 models
- Several NLP-metrics pre-computed to aid filtering
- ASCII-only guarantee for the English dataset
- Multilingual, with versions available in:
We have trained a model family on this dataset, available here: