A minimal implementation of GPT (Generative Pre-trained Transformer) for training and sampling from language models. This project provides a (hopefully) easy-to-follow codebase for training GPT models from scratch or fine-tuning pre-trained models.
tinyGPT
├── config/
│ ├── eval_gpt2.py # Evaluate GPT-2 models
│ ├── finetune_shakespeare.py # Fine-tune on Shakespeare dataset
│ ├── train_gpt2.py # Train GPT-2 (124M) on OpenWebText
│ └── train_shakespeare_char.py # Train character-level model on Shakespeare
├── data/
│ ├── openwebtext/
│ │ └── prepare.py # Download and tokenize OpenWebText
│ ├── shakespeare/
│ │ └── prepare.py # Prepare Shakespeare dataset (word-level)
│ └── shakespeare_char/
│ ├── input.txt # Raw Shakespeare text
│ ├── meta.pkl
│ └── prepare.py # Prepare Shakespeare dataset (character-level)
├── src/
│ ├── model.py # GPT model architecture
│ ├── train.py # Training script with DDP support
│ ├── sample.py # Sample/generate text from trained models
│ ├── bench.py # Benchmarking and profiling utilities
│ └── configurator.py # Configuration override system
├── *.sh # Shell scripts for common workflows
├── README.md
└── .gitignore
-
model.py: GPT implementation including:- Transformer architecture (attention, MLP, blocks)
- Model initialization and weight loading
- Pre-trained GPT-2 checkpoint loading
- Text generation with top-k sampling
-
train.py: training script with:- Single GPU and multi-GPU (DDP) training
- Learning rate scheduling (cosine decay with warmup)
- Checkpointing and resuming
- Integration with Weights & Biases logging
- Support for mixed precision training (float16/bfloat16)
-
sample.py: Inference and text generation:- Load trained checkpoints or pre-trained GPT-2 models
- Generate text samples with configurable parameters
- Support for custom prompts from text files
-
bench.py: Simplified training loop for benchmarking performance
Pre-configured training scenarios:
train_shakespeare_char.py: Train a small character-level GPT on Shakespearefinetune_shakespeare.py: Fine-tune GPT-2 on Shakespeare datasettrain_gpt2.py: Reproduce GPT-2 (124M) training on OpenWebTexteval_gpt2.py: Evaluate pre-trained GPT-2 models on OpenWebText
pip install torch numpy transformers datasets tiktoken wandb tqdm# Prepare the Shakespeare dataset
python data/shakespeare_char/prepare.py
# Train the model
python src/train.py config/train_shakespeare_char.py
# Generate samples
python src/sample.py --out_dir=out-shakespeare-char# Prepare OpenWebText
python data/openwebtext/prepare.py
# Train on 8 GPUs
torchrun --standalone --nproc_per_node=8 src/train.py config/train_gpt2.py# Basic training
python src/train.py config/train_shakespeare_char.py
# Override config parameters
python src/train.py config/train_shakespeare_char.py --batch_size=32 --learning_rate=1e-4
# Train on CPU (slower)
python src/train.py config/train_shakespeare_char.py --device=cpu --compile=False# Sample from your trained model
python src/sample.py --out_dir=out-shakespeare-char
# Sample from pre-trained GPT-2
python src/sample.py --init_from=gpt2-xl --start="Once upon a time" --num_samples=5
# Use a prompt from a file
python src/sample.py --start=FILE:prompt.txt# Single node, 8 GPUs
torchrun --standalone --nproc_per_node=8 src/train.py config/train_gpt2.pyIn this section I'll talk a bit about results from two specific experiments with this project.
All my training was done on a computing cluster running Slurm. 8 L40S GPUs were used in one node for training (took about 3/4 days for training GPT from scratch with the OpenWebText dataset, training with the Shakespeare dataset was much, much faster), and 1 L40S GPU for sampling. The scripts train-gpt-example.sh and sample-gpt-example.sh are references for how I launched the training and sampling jobs. Depending on your environment (e.g. regular desktop vs computing cluster) you will need to use different commands.
First, I trained a character-level GPT on the Shakespeare dataset (the fastest way to get started). A result sample after training for 5000 iterations:
MENENIUS:
I must thou answer thy son, and so much
To be sent truth to thee; in this fault of
One that free these that looks till them perjury.
Lay not be thou to deckarted offer'd meeting.
VOLUMNIA:
What's the Lord Sir Citizens?
VOLUMNIA:
My father Coriolanus bear the world to what sad with you
Are the scold by my life, come I say;
Now I know the name of Capulet.
FRIAR LAURENCE:
By the centurious protectors!
CORIOLANUS:
Alas, my lord,
I can have truth a man to this officer, for I
To know ourIt does indeed look like Shakespearean text - in structure, at least!
Second, I trained a 124M parameter GPT model from scratch on the OpenWebText dataset. Using a starting prompt of "Once upon a time, here's a piece of what it spits out:
Once upon a time in the late 1800s, he was seen as a “dumb” and “intellectual.” But a survey of his writings shows that he was actually an accomplished writer and the best writer of his generation. He was an innovator and a pioneer in the field of literature. He was also a man of tremendous skill and intelligence, a man of deep faith and tenacity, a man who was so good at what he did that it was easy to call him a genius. And yet, he was also a public intellectual who came from obscurity, and he was a man who had great talent.
Gothic is a title that can only be applied to a man who made such a big impression on an age. It is a title that the public can only use to describe the man who gave us Gothic poetry, the man who brought Gothic literature to the stage of popular life.
In spite of our name, though, we know more of the man than we do of the man who called himself Gothic. And so, we are called to reflect on the man who decided that he wanted to be known as Gothic, or was called Gothic, or was written Gothic, or was a Gothic poet, or was a Gothic poet? It is a question that seems to be the question of the day not only for the man who gave us Gothic poetry, but also for the man who made Gothic literature a living fiction.
Therefore, the man who was a Gothic poet, or who was a Gothic poet
Some observations:
- The model learns strong local fluency and syntax, producing grammatically correct and stylistically coherent text over short spans
- Generations closely match OpenWebText / news-article style, including journalistic tone, quotations, and authoritative phrasing
- The model exhibits topic drift and weak long-range coherence, especially in longer generations
- Outputs may contain hallucinated facts or inconsistent details
- Training appears stable: no token corruption, broken grammar, or catastrophic degeneration observed in samples.
In general, that is close to what I'd expect from a GPT-2-style model with 124M parameters trained on this dataset. I'm pretty sure there are many improvements that could be made to the training procedure and I'll certainy try those as I have time.
It was very nice to see some decent text being generated :)
Inspired by Karpathy's lessons/materials - learned a ton from him!