A complete educational implementation of GPT (Generative Pre-trained Transformer) in pure Python
Features β’ Quick Start β’ Documentation β’ Examples β’ Study Guide
This repository contains three progressive versions of a GPT model implementation, each designed for different learning needs:
| Version | Lines | Description | Best For |
|---|---|---|---|
| Original | 243 | Andrej Karpathy's minimal implementation | Quick reference |
| Refactored | 850 | Well-structured with Mermaid diagrams | Understanding architecture |
| Educational | 1,200 | Professor-style teaching with detailed prints | Learning from scratch |
All versions maintain 100% functional equivalence while progressively improving readability and educational value.
- π§ Complete GPT Implementation: Multi-head attention, transformer layers, autograd
- π Educational Focus: Every component explained with intuition and math
- π¨ Visual Diagrams: 11 Mermaid diagrams for visual understanding
- π¬ Workflow Visualization: Detailed prints showing data flow during training
- π Study Guide: Comprehensive component deep dives with examples
- π Pure Python: No dependencies beyond standard library
- π Well-Documented: Extensive comments and docstrings
By studying this code, you will understand:
- β How neural networks compute (forward pass)
- β How they learn (backward pass & automatic differentiation)
- β How to optimize them (Adam optimizer)
- β The transformer architecture (attention is all you need)
- β Language modeling and text generation
- β Why each design decision was made
# Clone the repository
git clone https://github.com/andresveraf/Build-GPT-model-with-Python.git
cd Build-GPT-model-with-Python
# No additional dependencies needed! (Pure Python)python3 script_gpt_educational.pyExpected Output:
================================================================================
DEEP LEARNING FROM SCRATCH: GPT Implementation
================================================================================
π Welcome! Let's build a GPT model step by step, understanding every detail.
β Random seed set to 42 for reproducibility
================================================================================
PART 2: CONFIGURING THE MODEL - HYPERPARAMETERS
================================================================================
π MODEL ARCHITECTURE:
β’ Embedding dimension: 16
β’ Attention heads: 4 (each with 4 dimensions)
β’ Transformer layers: 1
β’ Context window: 16 tokens
...
π² GENERATING 20 SAMPLES:
Sample 1: kamon
Sample 2: ann
Sample 3: karai
Sample 4: jaire
Sample 5: vialan
...
# Refactored version (clean, documented)
python3 script_gpt_refactored.py
# Original version (compact)
python3 script_gpt.py-
REFACTORING_SUMMARY.md - Complete study guide with:
- Component deep dives (11 major components)
- Step-by-step examples with actual numbers
- Mathematical formulations
- Formula cheat sheet
- Dimension tracking guide
- Study checklist (4 levels)
-
This README - Quick start and overview
-
Inline Documentation - Each Python file contains extensive comments
Each component is thoroughly explained:
- Multi-Head Attention: How the model learns relationships between tokens
- RMS Normalization: Why we normalize activations
- Softmax: Converting logits to probabilities (with numerical stability)
- Adam Optimizer: Adaptive learning with momentum
- Matrix Multiplication: The fundamental operation in neural networks
- Loss Calculation: Cross-entropy and perplexity
- Training Loop: How the model learns from data
# Multi-head attention allows the model to focus on different aspects
# Head 0: Previous character dependency
# Head 1: Position-based patterns
# Head 2: Consonant clusters
# Head 3: Vowel patternsStep 1 / 1000 | Loss: 3.3660 | Perplexity: 28.94
Step 100 / 1000 | Loss: 2.8945 | Perplexity: 18.07
Step 200 / 1000 | Loss: 2.7123 | Perplexity: 15.07
Step 500 / 1000 | Loss: 2.6543 | Perplexity: 14.22
Step 1000/ 1000 | Loss: 2.6501 | Perplexity: 14.16
After training on 32,033 names, the model generates realistic names:
kamon, ann, karai, jaire, vialan, mari, jalen, etc.
Level 1: Beginner (1-2 days)
- Read the README
- Run the educational version
- Understand the basic flow
- Read hyperparameter explanations
Level 2: Intermediate (1 week)
- Study component deep dives in REFACTORING_SUMMARY.md
- Understand attention mechanism
- Learn about normalization and softmax
- Follow the training loop
Level 3: Advanced (2-3 weeks)
- Implement components from scratch
- Experiment with hyperparameters
- Debug training issues
- Modify the architecture
Level 4: Expert (ongoing)
- Read original papers (Attention Is All You Need, GPT-2, GPT-3)
- Implement from memory
- Design experiments
- Contribute to research
| Concept | Importance | Difficulty |
|---|---|---|
| Tokenization | βββ | Easy |
| Embeddings | ββββ | Medium |
| Attention | βββββ | Hard |
| Normalization | ββββ | Medium |
| Backpropagation | βββββ | Hard |
| Optimization | ββββ | Medium |
Build-GPT-model-with-Python/
β
βββ README.md # This file
βββ REFACTORING_SUMMARY.md # Complete study guide
βββ input.txt # Training data (names)
β
βββ script_gpt.py # Original (243 lines)
βββ script_gpt_refactored.py # Refactored (850 lines)
βββ script_gpt_educational.py # Educational (1,200 lines)
# In script_gpt_educational.py, modify:
N_EMBD = 16 # Try: 32, 64, 128
N_HEAD = 4 # Try: 2, 8
N_LAYER = 1 # Try: 2, 3, 4
BLOCK_SIZE = 16 # Try: 32, 64LEARNING_RATE = 0.01 # Try: 0.001, 0.005, 0.02
NUM_STEPS = 1000 # Try: 500, 2000, 5000
TEMPERATURE = 0.5 # Try: 0.3 (conservative), 0.8 (creative)Replace input.txt with your own text file (one item per line):
word1
word2
word3
...
Input Token
β
βββββββββββββββββββββββββββββββββββββββ
β Embedding Layer β
β β’ Token Embedding β
β β’ Position Embedding β
β β’ RMS Normalization β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Transformer Layer β
β β’ Multi-Head Self-Attention β
β β’ Residual Connection β
β β’ Feed-Forward Network (MLP) β
β β’ Residual Connection β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Output Projection β
β β’ Linear to Vocabulary Size β
βββββββββββββββββββββββββββββββββββββββ
β
Logits β Softmax β Probabilities
- Attention Is All You Need (Transformer)
- Language Models are Few-Shot Learners (GPT-3)
- Improving Language Understanding by Generative Pre-Training (GPT-1)
This is an educational project. Contributions are welcome!
- Add More Examples: Create new training datasets
- Improve Documentation: Clarify explanations
- Add Visualizations: Create more diagrams
- Fix Bugs: Report and fix issues
- Share Your Learning: Write blog posts or tutorials
# Run tests (if you add them)
python3 -m pytest tests/
# Format code
black script_gpt_*.py
# Check style
flake8 script_gpt_*.pyThis project is open source and available under the MIT License.
- Andrej Karpathy - Original minimal GPT implementation
- OpenAI - GPT architecture and research
- Google Brain - Transformer architecture
- DeepLearning.AI - Educational resources
Have questions? Feel free to:
- Open an issue on GitHub
- Start a discussion
- Contact me directly
Made with β€οΈ for educational purposes