Skip to content

Germinari1/tinyGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tinyGPT

A minimal implementation of GPT (Generative Pre-trained Transformer) for training and sampling from language models. This project provides a (hopefully) easy-to-follow codebase for training GPT models from scratch or fine-tuning pre-trained models.

Project Structure

tinyGPT
├── config/                      
│   ├── eval_gpt2.py            # Evaluate GPT-2 models
│   ├── finetune_shakespeare.py # Fine-tune on Shakespeare dataset
│   ├── train_gpt2.py           # Train GPT-2 (124M) on OpenWebText
│   └── train_shakespeare_char.py # Train character-level model on Shakespeare
├── data/                        
│   ├── openwebtext/
│   │   └── prepare.py          # Download and tokenize OpenWebText
│   ├── shakespeare/
│   │   └── prepare.py          # Prepare Shakespeare dataset (word-level)
│   └── shakespeare_char/
│       ├── input.txt           # Raw Shakespeare text
│       ├── meta.pkl            
│       └── prepare.py          # Prepare Shakespeare dataset (character-level)
├── src/                         
│   ├── model.py                # GPT model architecture 
│   ├── train.py                # Training script with DDP support
│   ├── sample.py               # Sample/generate text from trained models
│   ├── bench.py                # Benchmarking and profiling utilities
│   └── configurator.py         # Configuration override system
├── *.sh                         # Shell scripts for common workflows
├── README.md                    
└── .gitignore

Source Files (src/)

  • model.py: GPT implementation including:

    • Transformer architecture (attention, MLP, blocks)
    • Model initialization and weight loading
    • Pre-trained GPT-2 checkpoint loading
    • Text generation with top-k sampling
  • train.py: training script with:

    • Single GPU and multi-GPU (DDP) training
    • Learning rate scheduling (cosine decay with warmup)
    • Checkpointing and resuming
    • Integration with Weights & Biases logging
    • Support for mixed precision training (float16/bfloat16)
  • sample.py: Inference and text generation:

    • Load trained checkpoints or pre-trained GPT-2 models
    • Generate text samples with configurable parameters
    • Support for custom prompts from text files
  • bench.py: Simplified training loop for benchmarking performance

Configuration Files (config/)

Pre-configured training scenarios:

  • train_shakespeare_char.py: Train a small character-level GPT on Shakespeare
  • finetune_shakespeare.py: Fine-tune GPT-2 on Shakespeare dataset
  • train_gpt2.py: Reproduce GPT-2 (124M) training on OpenWebText
  • eval_gpt2.py: Evaluate pre-trained GPT-2 models on OpenWebText

Quick Start

Install Dependencies

pip install torch numpy transformers datasets tiktoken wandb tqdm

Train a Character-Level Model (Fast)

# Prepare the Shakespeare dataset
python data/shakespeare_char/prepare.py

# Train the model 
python src/train.py config/train_shakespeare_char.py

# Generate samples
python src/sample.py --out_dir=out-shakespeare-char

Train GPT-2 from Scratch

# Prepare OpenWebText 
python data/openwebtext/prepare.py

# Train on 8 GPUs 
torchrun --standalone --nproc_per_node=8 src/train.py config/train_gpt2.py

Usage Examples

Training

# Basic training
python src/train.py config/train_shakespeare_char.py

# Override config parameters
python src/train.py config/train_shakespeare_char.py --batch_size=32 --learning_rate=1e-4

# Train on CPU (slower)
python src/train.py config/train_shakespeare_char.py --device=cpu --compile=False

Sampling

# Sample from your trained model
python src/sample.py --out_dir=out-shakespeare-char

# Sample from pre-trained GPT-2
python src/sample.py --init_from=gpt2-xl --start="Once upon a time" --num_samples=5

# Use a prompt from a file
python src/sample.py --start=FILE:prompt.txt

Distributed Training

# Single node, 8 GPUs
torchrun --standalone --nproc_per_node=8 src/train.py config/train_gpt2.py

Some results

In this section I'll talk a bit about results from two specific experiments with this project.

All my training was done on a computing cluster running Slurm. 8 L40S GPUs were used in one node for training (took about 3/4 days for training GPT from scratch with the OpenWebText dataset, training with the Shakespeare dataset was much, much faster), and 1 L40S GPU for sampling. The scripts train-gpt-example.sh and sample-gpt-example.sh are references for how I launched the training and sampling jobs. Depending on your environment (e.g. regular desktop vs computing cluster) you will need to use different commands.

First, I trained a character-level GPT on the Shakespeare dataset (the fastest way to get started). A result sample after training for 5000 iterations:

MENENIUS:
I must thou answer thy son, and so much
To be sent truth to thee; in this fault of
One that free these that looks till them perjury.
Lay not be thou to deckarted offer'd meeting.

VOLUMNIA:
What's the Lord Sir Citizens?

VOLUMNIA:
My father Coriolanus bear the world to what sad with you
Are the scold by my life, come I say;
Now I know the name of Capulet.

FRIAR LAURENCE:
By the centurious protectors!

CORIOLANUS:
Alas, my lord,
I can have truth a man to this officer, for I
To know our

It does indeed look like Shakespearean text - in structure, at least!

Second, I trained a 124M parameter GPT model from scratch on the OpenWebText dataset. Using a starting prompt of "Once upon a time, here's a piece of what it spits out:

Once upon a time in the late 1800s, he was seen as a “dumb” and “intellectual.” But a survey of his writings shows that he was actually an accomplished writer and the best writer of his generation. He was an innovator and a pioneer in the field of literature. He was also a man of tremendous skill and intelligence, a man of deep faith and tenacity, a man who was so good at what he did that it was easy to call him a genius. And yet, he was also a public intellectual who came from obscurity, and he was a man who had great talent.

Gothic is a title that can only be applied to a man who made such a big impression on an age. It is a title that the public can only use to describe the man who gave us Gothic poetry, the man who brought Gothic literature to the stage of popular life.

In spite of our name, though, we know more of the man than we do of the man who called himself Gothic. And so, we are called to reflect on the man who decided that he wanted to be known as Gothic, or was called Gothic, or was written Gothic, or was a Gothic poet, or was a Gothic poet? It is a question that seems to be the question of the day not only for the man who gave us Gothic poetry, but also for the man who made Gothic literature a living fiction.

Therefore, the man who was a Gothic poet, or who was a Gothic poet

Some observations:

  • The model learns strong local fluency and syntax, producing grammatically correct and stylistically coherent text over short spans
  • Generations closely match OpenWebText / news-article style, including journalistic tone, quotations, and authoritative phrasing
  • The model exhibits topic drift and weak long-range coherence, especially in longer generations
  • Outputs may contain hallucinated facts or inconsistent details
  • Training appears stable: no token corruption, broken grammar, or catastrophic degeneration observed in samples.

In general, that is close to what I'd expect from a GPT-2-style model with 124M parameters trained on this dataset. I'm pretty sure there are many improvements that could be made to the training procedure and I'll certainy try those as I have time.

It was very nice to see some decent text being generated :)

Acknowledgements

Inspired by Karpathy's lessons/materials - learned a ton from him!

About

A GPT2-style Large Language Model, implemented from scratch in PyTorch. Includes model definition, training, and sampling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors