The original 243-line GPT implementation has been enhanced through three progressive versions, each serving different educational needs:
- Original (
script_gpt.py): 243 lines - Andrej Karpathy's minimal implementation - Refactored (
script_gpt_refactored.py): ~850 lines - Well-structured with Mermaid diagrams - Educational (
script_gpt_educational.py): ~1,200 lines - Professor-style teaching with detailed workflow prints
All versions maintain 100% functional equivalence while progressively improving readability and educational value.
- Clear Section Separators: Each major component is separated by visual markers (====)
- Logical Flow: Code organized into distinct sections:
- Configuration & Hyperparameters
- Data Loading & Preprocessing
- Tokenizer
- Autograd Engine
- Neural Network Primitives
- Model Architecture
- Adam Optimizer
- Training Functions
- Training Loop
- Inference/Generation
- Module-level docstring explaining the entire system
- Function docstrings with parameters, returns, and descriptions
- Inline comments explaining the "why" behind complex operations
- Mathematical formulas documented with proper notation (e.g., ∂L/∂x)
Embedded Mermaid diagrams for visual understanding:
- Overall architecture flow
- Training pipeline
- Tokenization process
- Computation graph for autograd
- Transformer layer structure
- Multi-head attention mechanism
- RMS normalization flow
- Softmax computation
- Adam optimizer algorithm
- Training step flow
- Text generation process
n_embd→N_EMBD(constants)li→layer_idxhs→head_start- More descriptive names throughout
Tokenizer: Handles encoding/decodingValue: Enhanced with detailed docstringsAdamOptimizer: Encapsulates optimization logic
load_dataset(): Data loadinglinear(): Matrix multiplicationsoftmax(): Probability computationrmsnorm(): Normalizationinitialize_parameters(): Parameter setupgpt_forward(): Model forward passcompute_loss(): Loss calculationtrain_step(): Single training iterationgenerate_text(): Text generation
All hyperparameters moved to the top with clear explanations:
# Model architecture
N_EMBD = 16 # Embedding dimension
N_HEAD = 4 # Number of attention heads
N_LAYER = 1 # Number of transformer layers
BLOCK_SIZE = 16 # Maximum sequence length
# Training hyperparameters
LEARNING_RATE = 0.01
BETA1 = 0.85
BETA2 = 0.99# Compact but hard to follow
matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
state_dict = {'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd)}def initialize_parameters(vocab_size, n_embd, n_head, n_layer):
"""
Initialize all model parameters with Gaussian random values.
The GPT model has the following parameter groups:
1. Token embeddings (wte): Map token IDs to vectors
2. Position embeddings (wpe): Add position information
3. Attention weights: Query, Key, Value, Output projections
4. MLP weights: Feed-forward network
5. Language model head: Output projection
"""
# Clear, documented implementationgraph LR
A[Dataset] --> B[Sample Document]
B --> C[Tokenize]
C --> D[Forward Pass]
D --> E[Compute Loss]
E --> F[Backward Pass]
F --> G[Compute Gradients]
G --> H[Update Parameters]
H --> D
graph TB
A[Input x] --> B[Linear: Wq → Q]
A --> C[Linear: Wk → K]
A --> D[Linear: Wv → V]
E[Stored Keys] --> F[Attention Scores]
B --> F
F --> G[Attention Weights]
G --> H[Weighted Values]
D --> H
H --> I[Linear: Wo → Output]
The refactored code produces identical results to the original:
- ✅ Dataset loaded: 32,033 names
- ✅ Vocabulary size: 27 characters
- ✅ Parameters: 4,192
- ✅ Training completed: 1000 steps
- ✅ Loss decreased: 3.37 → 2.65
- ✅ Generated 20 realistic name samples
Example outputs:
Sample 1: kamon
Sample 2: ann
Sample 3: karai
Sample 4: jaire
Sample 5: vialan
...
- Educational: Much easier to learn how GPT works
- Maintainable: Clear structure for future modifications
- Documented: Every component explained
- Visual: Mermaid diagrams show data flow
- Professional: Follows Python best practices (PEP 8)
- Extensible: Easy to add features or modify architecture
script_gpt.py: Original compact version (243 lines)script_gpt_refactored.py: Refactored documented version (850 lines)script_gpt_educational.py: Professor-style teaching version (~1,200 lines)REFACTORING_SUMMARY.md: This document
The Educational Version (script_gpt_educational.py) takes a senior computer science professor's approach to teaching deep learning concepts. It includes:
Every hyperparameter is explained with:
- Intuitive explanation: What it does in plain English
- Why this value: The reasoning behind the choice
- What happens if changed: Effects of increasing/decreasing
- Historical context: Research papers and standard practices
Example:
Why BETA1 = 0.85 and BETA2 = 0.99?
- These are for the Adam optimizer's momentum
- β1 controls the "momentum" (first moment - mean of gradients)
* 0.85 = moderately fast adaptation
* Closer to 1 = more smoothing, but slower to adapt
* Standard is 0.9, we use 0.85 for faster learning
- β2 controls the "adaptive learning rate" (second moment - variance)
* 0.99 = very slow adaptation (moving average of squared gradients)
* This makes the learning rate more stable
* Standard is 0.999, we use 0.99 for faster adaptationThe training loop includes verbose output showing:
- Data flow: What data looks like at each stage
- Tensor shapes: How dimensions transform through the network
- Gradient statistics: Actual gradient values being computed
- Parameter updates: Before/after values for parameters
- Attention weights: Visualization of attention patterns
- Loss components: Breakdown of what contributes to loss
Example verbose output:
TRAINING STEP 1
📄 Sample document: 'emma'
Tokens: [26, 5, 13, 13, 1, 26]
Sequence length: 5
🎯 Training task:
Given: emma
Predict: BOS
🔍 FORWARD PASS - Position 0
Token ID: 26 ('')
Embeddings combined and normalized
Head 0 attention weights: [0.250, 0.250, 0.250, 0.250]
Output logits computed
📊 Loss Computation:
• Final loss value: 3.3660
• Perplexity: 28.94
(Perplexity = exp(loss), lower is better)
🔮 Prediction Examples:
Position 0: Predict 'e' with prob 0.038
Top predictions: 'a'(0.041), 'e'(0.038), 'i'(0.038)
📈 Gradient Statistics:
• Parameters with gradients: 4032
• Average gradient magnitude: 0.004523
• Maximum gradient magnitude: 0.123456
⚙️ OPTIMIZER STEP 0:
• Decayed learning rate: 0.010000
• Avg gradient magnitude: 0.004523
• Avg parameter update: 0.000234
Educational blocks explaining:
- The intuition behind backpropagation (chain rule in practice)
- Why softmax needs the max trick (numerical stability)
- The purpose of residual connections (gradient flow)
- Why we normalize (stable training)
- Temperature sampling (creativity vs determinism)
Example lecture note:
🧠 THE CORE IDEA: COMPUTATION GRAPHS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Every computation builds a graph. During backpropagation, we traverse
this graph backwards, computing gradients using the chain rule.
Example: Computing loss for a single prediction
Input x → Linear layer → Activation → Softmax → Cross-entropy → Loss
Forward pass (compute output):
We calculate the loss by going left to right
Backward pass (compute gradients):
We calculate ∂Loss/∂∂weight by going right to left
Chain rule: ∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w"Pause and think" moments throughout:
🤔 PAUSE AND THINK:
Why do we divide by sqrt(d_k) in attention?
Answer: To prevent dot products from growing too large,
which would push softmax into regions with tiny gradients.
This is called "scaled dot-product attention."Structured like a textbook:
- Level 1: High-level overview (what)
- Level 2: Mathematical details (why)
- Level 3: Implementation nuances (how)
- Level 4: Advanced optimization tips (deep dive)
Proper mathematical notation throughout:
- Gradient notation: ∂L/∂x, ∂²L/∂x²
- Summation: Σ(i), product: Π(i)
- Expected value: E[x], variance: Var[x)
- Logarithms: log(x), exp(x)
- Normalization: x̄ = Σ(x)/n
Complete explanation with:
- Why Adam: Problems with plain SGD
- Momentum intuition: Heavy ball rolling down hill
- Adaptive learning rates: Different LR per parameter
- Bias correction: Why it matters early in training
- Step-by-step algorithm: Each component explained
Progress indicators showing:
- Current step and total steps
- Loss value and perplexity
- Estimated progress
- Learning rate decay
- Parameter update statistics
python3 script_gpt_educational.pyThe output will show:
- Initial configuration details
- Data loading with examples
- Tokenization visualization
- Model architecture explanation
- Training progress with detailed prints at key steps
- Gradient and optimizer statistics
- Text generation with probability distributions
- Final summary and learning outcomes
| Feature | Original | Refactored | Educational |
|---|---|---|---|
| Lines of code | 243 | 850 | 1,200 |
| Documentation | Minimal | Comprehensive | Professor-level |
| Mermaid diagrams | None | 11 embedded | 11 embedded |
| Workflow prints | Basic | Moderate | Extensive |
| Hyperparameter explanations | None | Basic | Deep dives |
| Mathematical notation | Minimal | Good | Extensive |
| Learning outcomes | Implicit | Clear | Explicit objectives |
| Best for | Quick reference | Understanding | Learning from scratch |
The Mermaid diagrams in the code can be viewed:
- VS Code: Install "Markdown Preview Mermaid Support" extension
- GitHub: Automatically renders in README.md files
- Online: Use mermaid.live or similar tools
- IDEs: Many support Mermaid rendering with plugins
This section provides comprehensive explanations of each GPT component, designed for thorough understanding and study.
Analogy: Multiple Experts Analyzing the Same Data
Imagine you're reading a sentence and trying to understand it. You might:
- Expert 1 (Grammar): Focus on sentence structure and syntax
- Expert 2 (Semantics): Focus on meaning and definitions
- Expert 3 (Context): Focus on how words relate to each other
- Expert 4 (Pattern): Focus on common letter combinations
Each expert looks at the SAME input but pays attention to DIFFERENT aspects. This is multi-head attention!
In GPT:
- Each head learns to focus on different relationships
- Head 1 might learn: "pay attention to previous letter" (bigrams)
- Head 2 might learn: "pay attention to letter position" (patterns)
- Head 3 might learn: "detect consonant clusters" (phonetics)
- Head 4 might learn: "detect vowel patterns" (syllables)
By combining all heads, the model understands MULTIPLE aspects simultaneously!
For each head h (where h = 0, 1, 2, 3):
Step 1: Project input to Q, K, V
Q = x · W_q^h (Query: what am I looking for?)
K = x · W_k^h (Key: what do I contain?)
V = x · W_v^h (Value: what information do I offer?)
Step 2: Compute attention scores (scaled dot-product)
score_t = (Q · K_t) / √d_k
Where:
- Q is the query for current position
- K_t is the key from position t
- d_k is the head dimension (4 in our model)
- Division by √d_k prevents scores from growing too large
Step 3: Convert scores to probabilities (softmax)
α_t = exp(score_t) / Σ(exp(score_i))
Step 4: Weight values by attention weights
output = Σ(α_t · V_t)
Let's walk through a concrete example with actual numbers.
Input:
- Sequence: "cat"
- Position 2: processing 't'
- Head dimension: 4
Step 1: Q, K, V Projections
Input embedding for 't': [0.5, 0.2, -0.1, 0.8]
After projection (simplified):
Q_t = [0.3, 0.4, -0.2, 0.6]
K_t = [0.1, 0.5, 0.3, 0.2]
V_t = [0.7, -0.1, 0.4, 0.5]
Step 2: Compute Attention Scores
For position 0 ('c'):
score_0 = (0.3×0.2 + 0.4×0.1 + (-0.2)×(-0.3) + 0.6×0.4) / √4
= (0.06 + 0.04 + 0.06 + 0.24) / 2
= 0.40 / 2 = 0.20
For position 1 ('a'):
score_1 = (0.3×0.4 + 0.4×0.3 + (-0.2)×0.5 + 0.6×(-0.2)) / 2
= (0.12 + 0.12 - 0.10 - 0.12) / 2
= 0.02 / 2 = 0.01
For position 2 ('t') - self:
score_2 = (Q_t · K_t) / 2 = 0.15 / 2 = 0.075
Scores: [0.20, 0.01, 0.075]
Step 3: Softmax to Probabilities
exp([0.20, 0.01, 0.075]) = [1.221, 1.010, 1.078]
Sum = 3.309
α = [1.221/3.309, 1.010/3.309, 1.078/3.309]
= [0.369, 0.305, 0.326]
Interpretation:
- 36.9% attention to 'c'
- 30.5% attention to 'a'
- 32.6% attention to self ('t')
Step 4: Weighted Value Aggregation
output = 0.369×V_0 + 0.305×V_1 + 0.326×V_2
Assuming:
V_0 = [0.6, 0.3, 0.2, 0.1]
V_1 = [0.2, -0.4, 0.5, 0.3]
V_2 = [0.7, -0.1, 0.4, 0.5]
output[0] = 0.369×0.6 + 0.305×0.2 + 0.326×0.7
= 0.221 + 0.061 + 0.228
= 0.510
Repeat for all 4 dimensions...
Final output: [0.510, -0.037, 0.381, 0.311]
With 4 heads, each with dimension 4, we get 16 total dimensions (same as N_EMBD):
Head 0 (dims 0-3): [0.510, -0.037, 0.381, 0.311]
Head 1 (dims 4-7): [0.234, 0.567, -0.123, 0.445]
Head 2 (dims 8-11): [-0.234, 0.189, 0.567, -0.345]
Head 3 (dims 12-15): [0.678, -0.234, 0.123, 0.456]
Concatenated: [0.510, -0.037, 0.381, 0.311, 0.234, 0.567, ...]
Each head learns DIFFERENT attention patterns:
- Head 0: Previous character dependency
- Head 1: Position-based patterns
- Head 2: Consonant clusters
- Head 3: Vowel patterns
- Parallel Processing: All 4 heads process simultaneously (speed!)
- Diverse Learning: Each head can specialize
- Representation: Combined output is richer than single-head
- Dimensionality: Total output dimension = N_EMBD (16)
- Too Many Heads: Each head needs enough data to learn. 4 heads × 4 dims = 16 total ✓
- Uneven Heads: Some heads might learn nothing if initialization is bad
- No Scaling: Without dividing by √d_k, gradients become too small
Problem: Unstable Activations
Without normalization, activations can grow or shrink exponentially through layers:
- Layer 1: Values range [-1, 1] ✓
- Layer 2: Values range [-10, 10]
⚠️ - Layer 3: Values range [-100, 100] ❌
- Layer 4: Values explode to infinity! 💥
Solution: Normalize to Unit Variance
RMSNorm scales values so they have consistent magnitude:
- After normalization: Values always ≈ range [-2, 2]
- Gradients flow smoothly
- Training is stable
Step 1: Compute mean of squares
mean_square = (1/n) × Σ(x_i²)
Step 2: Compute scaling factor
scale = (mean_square + ε)^(-1/2)
Where ε = 1e-5 (tiny value to prevent division by zero)
Step 3: Scale all values
output_i = x_i × scale
Input:
x = [2.0, 4.0, 6.0]
Step 1: Mean of Squares
x² = [4.0, 16.0, 36.0]
mean_square = (4.0 + 16.0 + 36.0) / 3
= 56.0 / 3
= 18.667
Step 2: Compute Scale
scale = (18.667 + 0.00001)^(-0.5)
= (18.66701)^(-0.5)
= 0.231
Step 3: Scale Values
output[0] = 2.0 × 0.231 = 0.462
output[1] = 4.0 × 0.231 = 0.924
output[2] = 6.0 × 0.231 = 1.386
Result: [0.462, 0.924, 1.386]
Verification:
New mean of squares:
(0.462² + 0.924² + 1.386²) / 3
= (0.213 + 0.854 + 1.921) / 3
= 2.988 / 3
= 0.996 ≈ 1.0 ✓
The output now has unit variance!
LayerNorm:
mean = (1/n) × Σ(x_i)
variance = (1/n) × Σ((x_i - mean)²)
output = (x - mean) / √(variance + ε)
- Subtracts mean (centers at 0)
- Divides by standard deviation
- More computation (2 passes)
RMSNorm:
mean_square = (1/n) × Σ(x_i²)
output = x / √(mean_square + ε)
- Doesn't subtract mean (faster!)
- Only divides by RMS
- Single pass
- Works just as well in practice!
- Prevents Exploding Gradients: Large activations → large gradients → instability
- Prevents Vanishing Gradients: Small activations → small gradients → no learning
- Consistent Scale: Every layer sees similarly-scaled inputs
- Faster Convergence: Optimization landscape is smoother
- Forgetting Epsilon: Division by zero when all inputs are 0
- Wrong Epsilon Value: Too large → bad normalization, too small → numerical instability
- Normalizing Wrong Dimension: Should normalize per-token, not per-batch
Goal: Convert Numbers to Probabilities
Input: [2.0, 1.0, 0.1] Output: [0.659, 0.242, 0.099] (sums to 1.0)
Softmax takes ANY numbers and converts them to:
- Non-negative values (≥ 0)
- Sum to 1.0 (valid probability distribution)
- Preserves order (largest input → largest output)
Why "Soft"?
- Hard max: [2.0, 1.0, 0.1] → [1, 0, 0] (winner takes all)
- Soft max: [2.0, 1.0, 0.1] → [0.66, 0.24, 0.10] (probabilities)
"Soft" because everyone gets something, but higher values get more!
softmax(x_i) = exp(x_i) / Σ(exp(x_j))
Where:
- exp(x) = e^x (exponential function)
- Sum is over all j (all elements)
Direct computation fails:
Input: [1000, 999, 998]
exp(1000) = ∞ (overflow!)
exp(999) = ∞
exp(998) = ∞
softmax = [∞/∞, ∞/∞, ∞/∞] = [NaN, NaN, NaN] ❌
Key Insight:
exp(x - max) / Σ(exp(x - max))
= exp(x) / exp(max) / Σ(exp(x) / exp(max))
= exp(x) / Σ(exp(x))
Same result! But now the largest value is exp(0) = 1
With max trick:
Input: [1000, 999, 998]
Max: 1000
Shifted: [0, -1, -2]
exp([0, -1, -2]) = [1.0, 0.368, 0.135]
Sum = 1.503
softmax = [1.0/1.503, 0.368/1.503, 0.135/1.503]
= [0.665, 0.245, 0.090] ✓
Input:
logits = [2.0, 1.0, 0.1]
Step 1: Find Max
max_val = 2.0
Step 2: Subtract Max (Numerical Stability)
shifted = [2.0-2.0, 1.0-2.0, 0.1-2.0]
= [0.0, -1.0, -1.9]
Step 3: Exponentiate
exp([0.0, -1.0, -1.9]) = [1.0, 0.368, 0.150]
Step 4: Sum
total = 1.0 + 0.368 + 0.150 = 1.518
Step 5: Normalize
output[0] = 1.0 / 1.518 = 0.659
output[1] = 0.368 / 1.518 = 0.242
output[2] = 0.150 / 1.518 = 0.099
Result: [0.659, 0.242, 0.099]
Sum: 1.0 ✓
- Output Range: Always (0, 1) for finite inputs
- Sum: Always equals 1.0
- Monotonic: Preserves order (if x_i > x_j, then softmax(x_i) > softmax(x_j))
- Sensitive to Differences: Large differences get amplified
- Temperature Scaling: softmax(x/T) controls sharpness
T = 1.0: Standard softmax
T < 1.0: Sharper distribution (more confident)
T > 1.0: Softer distribution (more uniform)
Example with T=0.5:
softmax([2.0, 1.0, 0.1] / 0.5)
= softmax([4.0, 2.0, 0.2])
≈ [0.876, 0.116, 0.008] (much sharper!)
- Forgetting Max Trick: Overflow with large values
- Wrong Dimension: Applying softmax to wrong axis
- Log-Softmax Confusion: log(softmax(x)) ≠ softmax(x) numerically
Problem with Plain SGD:
SGD: θ = θ - α × gradient
Issues:
1. Fixed learning rate (α)
- Too large: Diverges or oscillates
- Too small: Takes forever
2. No momentum
- Gets stuck in local minima
- Slow in flat regions
3. Same LR for all parameters
- Some parameters need small LR
- Some parameters need large LR
Adam Solution:
- Momentum: Like a heavy ball rolling down hill
- Adaptive LR: Different learning rate per parameter
- Bias Correction: Fixes early-training bias
For each parameter θ with gradient g:
Initialize:
m = 0 (first moment - momentum)
v = 0 (second moment - variance)
t = 0 (timestep)
For each iteration:
t = t + 1
# Update first moment (momentum)
m = β₁ × m + (1 - β₁) × g
# Update second moment (variance)
v = β₂ × v + (1 - β₂) × g²
# Bias correction
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)
# Update parameter
θ = θ - α × m̂ / (√v̂ + ε)
1. First Moment (Momentum):
m = β₁ × m + (1 - β₁) × g
β₁ = 0.85 means:
- Keep 85% of past momentum
- Add 15% of current gradient
Analogy: Heavy ball
- Past gradients (momentum) keep it moving
- Current gradient adjusts direction slightly
2. Second Moment (Variance):
v = β₂ × v + (1 - β₂) × g²
β₂ = 0.99 means:
- Keep 99% of past variance
- Add 1% of current squared gradient
Purpose:
- Large gradient (high variance) → small step (noisy)
- Consistent gradient (low variance) → large step (trustworthy)
3. Bias Correction:
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)
Why needed:
Early in training, m and v are initialized to 0
They're biased toward 0, so we correct this
Example:
t=1: m = 0.15 × g
m̂ = 0.15 × g / (1 - 0.85) = 0.15 × g / 0.15 = g ✓
4. Adaptive Update:
θ = θ - α × m̂ / (√v̂ + ε)
Signal-to-noise ratio:
- m̂ = signal (direction)
- √v̂ = noise (uncertainty)
- m̂/√v̂ = how much to trust this gradient
Large v̂ (noisy) → small step
Small v̂ (consistent) → large step
Initial State:
Parameter: θ = 5.0
Gradient: g = 0.1
m = 0, v = 0, t = 0
α = 0.01, β₁ = 0.85, β₂ = 0.99, ε = 1e-8
Iteration 1:
t = 1
# Update momentum
m = 0.85 × 0 + 0.15 × 0.1 = 0.015
# Update variance
v = 0.99 × 0 + 0.01 × 0.01 = 0.0001
# Bias correction
m̂ = 0.015 / (1 - 0.85¹) = 0.015 / 0.15 = 0.1
v̂ = 0.0001 / (1 - 0.99¹) = 0.0001 / 0.01 = 0.01
# Update parameter
update = 0.01 × 0.1 / (√0.01 + 1e-8)
= 0.001 / (0.1 + 1e-8)
≈ 0.01
θ = 5.0 - 0.01 = 4.99
Iteration 2:
t = 2
Gradient: g = 0.12 (slightly larger)
# Update momentum (accumulates past gradients)
m = 0.85 × 0.015 + 0.15 × 0.12
= 0.01275 + 0.018 = 0.03075
# Update variance
v = 0.99 × 0.0001 + 0.01 × 0.0144
= 0.000099 + 0.000144 = 0.000243
# Bias correction (t=2)
m̂ = 0.03075 / (1 - 0.85²) = 0.03075 / 0.2775 = 0.111
v̂ = 0.000243 / (1 - 0.99²) = 0.000243 / 0.0199 = 0.0122
# Update parameter
update = 0.01 × 0.111 / (√0.0122 + 1e-8)
≈ 0.01 × 0.111 / 0.11
≈ 0.01
θ = 4.99 - 0.01 = 4.98
β₁ = 0.85 (Momentum):
- Standard: 0.9
- We use 0.85: Faster adaptation
- Higher → smoother but slower
- Lower → faster but noisier
β₂ = 0.99 (Variance):
- Standard: 0.999
- We use 0.99: Faster adaptation
- Should be close to 1 (slow moving average)
- Estimates long-term variance
ε = 1e-8 (Epsilon):
- Prevents division by zero
- Tiny enough not to affect learning
- Safety net for numerical stability
α = 0.01 (Learning Rate):
- Base learning rate
- Multiplied by m̂/√v̂ for each parameter
- Too large → unstable
- Too small → slow
- Wrong β Values: β₂ too low → unstable adaptive LR
- Forgetting Bias Correction: Early training is wrong
- Wrong Epsilon: Too large → affects learning, too small → numerical issues
- Not Decaying LR: Learning rate should decrease over time
Transform input features to output features through learned weights.
Analogy: Mixing Desk
Think of each weight as a "fader" on a mixing desk:
- Input: Multiple audio channels (features)
- Weights: Volume controls for each channel
- Output: Mixed audio (combination of inputs)
Given:
x: Input vector of size n_in
W: Weight matrix of size n_out × n_in
Output:
y[j] = Σ(i) x[i] × W[j][i]
For each output dimension j:
Sum over all input dimensions i
Input:
x = [1, 2, 3] (n_in = 3)
W = [[4, 5, 6], (n_out = 2)
[7, 8, 9]]
Computation:
y[0] = x[0]×W[0][0] + x[1]×W[0][1] + x[2]×W[0][2]
= 1×4 + 2×5 + 3×6
= 4 + 10 + 18
= 32
y[1] = x[0]×W[1][0] + x[1]×W[1][1] + x[2]×W[1][2]
= 1×7 + 2×8 + 3×9
= 7 + 16 + 27
= 50
Output: y = [32, 50]
Matrix multiplication is a linear transformation:
- Rotation: Rotates the input space
- Scaling: Stretches or shrinks
- Shearing: Slants the space
Multiple linear layers → learn complex patterns
- Layer 1: Simple rotations
- Layer 2: Rotations of rotations
- Layer 3: Can approximate any function!
def linear(x, w):
"""
Compute y = xW (matrix multiplication)
Args:
x: Input vector [n_in]
w: Weight matrix [n_out][n_in]
Returns:
y: Output vector [n_out]
"""
return [
sum(wi * xi for wi, xi in zip(w_row, x))
for w_row in w
]How it works:
- For each row in W (each output dimension)
- Multiply corresponding elements of row and x
- Sum all products
- Result is one element of output
- Wrong Dimension Order: xW vs Wx matters!
- Shape Mismatch: Dimensions must align
- No Bias Term: This is pure linear (y = xW)
- Numerical Overflow: Large values can overflow
(Already covered in detail in Section 3 above)
Key Points:
- Converts logits to probabilities
- Uses max trick for numerical stability
- Outputs sum to 1.0
- Preserves order
(Already covered in detail in Section 2 above)
Key Points:
- Normalizes to unit variance
- Prevents exploding/vanishing activations
- Faster than LayerNorm
- Uses epsilon for numerical stability
Create all learnable parameters with random initial values.
Why Random?
- Breaks symmetry: All neurons learn different things
- If all zeros: Every neuron computes the same thing!
- If all same: No diversity in learning
1. Token Embeddings (wte):
Shape: [vocab_size] × [n_embd]
Purpose: Map token ID → vector
Example: Token 5 → [0.2, -0.1, 0.5, ...]
2. Position Embeddings (wpe):
Shape: [block_size] × [n_embd]
Purpose: Add position information
Example: Position 3 → [0.1, 0.3, -0.2, ...]
3. Attention Weights (per layer):
Wq, Wk, Wv: [n_embd] × [n_embd] - Query, Key, Value projections
Wo: [n_embd] × [n_embd] - Output projection
4. MLP Weights (per layer):
FC1: [4 × n_embd] × [n_embd] - Expansion
FC2: [n_embd] × [4 × n_embd] - Contraction
5. Language Model Head:
Shape: [vocab_size] × [n_embd]
Purpose: Project to vocabulary logits
Gaussian with std=0.08:
param = random.gauss(0, 0.08)Why std=0.08?
- Too small (0.001): Vanishing gradients
- Too large (1.0): Exploding gradients
- Rule of thumb: 1/√fan_in = 1/√256 ≈ 0.06
- 0.08 is slightly larger for better gradient flow
For our model (n_embd=16, n_head=4, n_layer=1, vocab=27):
Token embeddings: 27 × 16 = 432
Position embeddings: 16 × 16 = 256
Attention (×4): 4 × 16 × 16 = 1,024
MLP: 64×16 + 16×64 = 2,048
LM head: 27 × 16 = 432
Total: 4,192 parameters
- Wrong std: Too large/small causes training issues
- Not Random: Symmetry prevents learning
- Forgetting Bias: Some layers need bias terms
- Wrong Shape: Dimension mismatches cause errors
Process a single token through the entire GPT model to predict the next token.
INPUT: token_id, pos_id, cached keys/values
Step 1: Embedding
tok_emb = wte[token_id] # Look up token
pos_emb = wpe[pos_id] # Look up position
x = tok_emb + pos_emb # Combine
x = rmsnorm(x) # Normalize
Step 2: Transformer Layers (repeat N_LAYER times)
For each layer:
a. Residual: x_res = x
b. Multi-Head Attention:
- Q, K, V projections
- Attention computation
- Output projection
c. Residual: x = x + x_res
d. MLP:
- Expand (4×)
- ReLU
- Contract (1/4×)
e. Residual: x = x + x_res
Step 3: Output Projection
logits = linear(x, lm_head)
OUTPUT: Logits over vocabulary
Input:
token_id = 5 (letter 'e')
pos_id = 2
Step 1: Embedding
tok_emb = [0.2, -0.1, 0.5, 0.3, ...] # 16 dims
pos_emb = [0.1, 0.0, -0.2, 0.4, ...] # 16 dims
x = [0.3, -0.1, 0.3, 0.7, ...] # Element-wise sum
x = rmsnorm(x) # Normalize
Step 2: Attention
Q = linear(x, Wq) # Project to query
K = linear(x, Wk) # Project to key
V = linear(x, Wv) # Project to value
# For each head (4 heads, 4 dims each)
For head in range(4):
q_head = Q[head*4:(head+1)*4]
k_head = [K[head*4:(head+1)*4] for K in cached_keys]
v_head = [V[head*4:(head+1)*4] for V in cached_values]
# Compute attention
scores = [dot(q_head, k) for k in k_head] / √4
weights = softmax(scores)
output = weighted_sum(weights, v_head)
# Concatenate heads and project
x = linear(concatenated, Wo)
Step 3: MLP
x = linear(x, fc1) # 16 → 64
x = [relu(xi) for xi in x] # Activate
x = linear(x, fc2) # 64 → 16
Step 4: Output
logits = linear(x, lm_head) # 16 → 27
- Sequential Processing: One token at a time
- Caching: Keys/values stored for future positions
- Residuals: Critical for gradient flow
- Normalization: Before attention and MLP
- Wrong Cache Index: Using wrong position
- Forgetting Residual: Gradients don't flow
- Missing Normalization: Unstable training
- Wrong Head Slicing: Dimension errors
Measure how wrong the model's predictions are using cross-entropy loss.
For each position i:
Loss_i = -log(prob_i[target_i])
Total Loss:
L = (1/N) × Σ(Loss_i)
Where:
- prob_i: Predicted probability distribution
- target_i: Correct token ID
- N: Number of predictions
Goal: Maximize probability of correct tokens
Problem: Optimization minimizes functions
Solution: Minimize negative log probability
-max(log(p)) = min(-log(p))
Why Penalize Low Probability?
If p = 0.99 (very confident): -log(0.99) = 0.01 ✓
If p = 0.50 (uncertain): -log(0.50) = 0.69 ⚠️
If p = 0.01 (very wrong): -log(0.01) = 4.61 ❌
Low probability → High penalty
Predictions:
Position 0: probs[5] = 0.038, target = 5 ('e')
Position 1: probs[13] = 0.012, target = 13 ('m')
Position 2: probs[13] = 0.025, target = 13 ('m')
Compute Losses:
Loss_0 = -log(0.038) = 3.270
Loss_1 = -log(0.012) = 4.423
Loss_2 = -log(0.025) = 3.689
Average:
L = (3.270 + 4.423 + 3.689) / 3
= 11.382 / 3
= 3.794
Perplexity = exp(Loss)
Interpretation:
- How "surprised" is the model?
- Lower = better predictions
- Perplexity of 10 = model is 10× uncertain
Example:
Loss = 3.794
Perplexity = exp(3.794) = 44.4
Model is, on average, considering 44.4 possibilities
- Log of Zero: -log(0) = ∞ (need tiny epsilon)
- Wrong Average: Mean vs sum matters for gradients
- Ignoring Padding: Should mask padded positions
Execute one complete training iteration: sample data, forward pass, compute loss, backward pass, update parameters.
STEP 1: Sample Data
doc = docs[step % len(docs)]
tokens = tokenize(doc)
STEP 2: Forward Pass (position by position)
Initialize: keys = [[]], values = [[]]
For pos in range(seq_len):
token = tokens[pos]
target = tokens[pos + 1]
logits = gpt_forward(token, pos, keys, values)
probs = softmax(logits)
outputs.append(probs)
STEP 3: Compute Loss
targets = [tokens[pos + 1] for pos in range(seq_len)]
loss = compute_loss(outputs, targets)
STEP 4: Backward Pass
loss.backward()
# Gradients now computed for all parameters
STEP 5: Update Parameters
optimizer.step()
# All parameters updated
Step 1: Sample
doc = "emma"
tokens = [26, 5, 13, 13, 1, 26] # BOS, e, m, m, a, BOS
seq_len = 5
Step 2: Forward
Position 0:
Input: token=26 (BOS), pos=0
Keys/values: []
Output: logits[0] → probs[0]
Position 1:
Input: token=5 ('e'), pos=1
Keys/values: [[k0], [v0]] # Cached from pos 0
Output: logits[1] → probs[1]
Position 2:
Input: token=13 ('m'), pos=2
Keys/values: [[k0, k1], [v0, v1]] # Cached from pos 0,1
Output: logits[2] → probs[2]
... continue for all positions
Step 3: Loss
targets = [5, 13, 13, 1, 26] # e, m, m, a, BOS
outputs = [probs[0], probs[1], probs[2], probs[3], probs[4]]
loss = average(-log(probs[i][targets[i]]))
Step 4: Backward
loss.grad = 1.0
# Backprop through computation graph
# All parameters get gradients
Step 5: Update
For each parameter:
m = β₁ × m + (1-β₁) × grad
v = β₂ × v + (1-β₂) × grad²
param -= lr × m̂ / (√v̂ + ε)
- Autoregressive: Each position sees all previous positions
- Caching: Keys/values accumulated through sequence
- Gradient Flow: Loss → parameters through all operations
- Parameter Update: Adam optimizer handles all 4,192 parameters
- Wrong Targets: Off-by-one errors
- Cache Clearing: Forgetting to reset between samples
- Gradient Reset: Not clearing old gradients
- Sequence Length: Exceeding block_size
Linear:
y = xW
y[j] = Σ(i) x[i] × W[j][i]
Softmax:
softmax(x[i]) = exp(x[i]) / Σ(j) exp(x[j])
RMSNorm:
RMSNorm(x[i]) = x[i] / √(mean(x²) + ε)
mean(x²) = (1/n) × Σ(x[i]²)
Attention:
score = Q · K^T / √d_k
α = softmax(score)
output = α · V
Cross-Entropy:
L = -Σ(i) log(p[i])
Perplexity:
PPL = exp(L)
Adam:
m = β₁ × m + (1-β₁) × g
v = β₂ × v + (1-β₂) × g²
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)
θ = θ - α × m̂ / (√v̂ + ε)
ReLU:
ReLU(x) = max(0, x)
dReLU/dx = 1 if x > 0 else 0
Token ID: scalar (0 to vocab_size-1)
Position ID: scalar (0 to block_size-1)
Token embedding: [n_embd] = [16]
Position embedding: [n_embd] = [16]
Combined: [n_embd] = [16]
After RMSNorm: [n_embd] = [16]
Input: [n_embd] = [16]
Q, K, V (each): [n_embd] = [16]
Per head (4 heads, 4 dims each):
Q_head: [head_dim] = [4]
K_head: [head_dim] = [4]
V_head: [head_dim] = [4]
Attention scores (for T positions): [T]
Attention weights: [T]
Head output: [head_dim] = [4]
Concatenated (4 heads): [n_embd] = [16]
After output projection: [n_embd] = [16]
After residual: [n_embd] = [16]
Input: [n_embd] = [16]
After FC1 (expand 4×): [4 × n_embd] = [64]
After ReLU: [64]
After FC2 (contract): [n_embd] = [16]
After residual: [n_embd] = [16]
Input: [n_embd] = [16]
After LM head: [vocab_size] = [27]
Token ID: [1]
↓ Embedding
Embedding: [16]
↓ Attention
Attention output: [16]
↓ MLP
MLP output: [16]
↓ LM Head
Logits: [27]
↓ Softmax
Probabilities: [27]
- What is a GPT model?
- What does "autoregressive" mean?
- Why do we need tokenization?
- What is an embedding?
- What does softmax do?
- Why do we compute loss?
- What is gradient descent?
- How does the model generate text?
- How does multi-head attention work?
- Why do we need positional embeddings?
- What is the purpose of residual connections?
- How does RMSNorm stabilize training?
- What is cross-entropy loss?
- How does backpropagation work?
- What is the chain rule?
- Why do we use Adam instead of SGD?
- Implement attention from scratch
- Implement softmax with numerical stability
- Implement RMSNorm
- Implement Adam optimizer
- Debug gradient flow issues
- Choose appropriate hyperparameters
- Optimize training speed
- Handle edge cases (empty sequences, etc.)
- Prove why attention is O(n²)
- Derive the Adam update equations
- Explain why we divide by √d_k in attention
- Analyze gradient flow through residuals
- Understand capacity vs overfitting trade-offs
- Design experiments to test hypotheses
- Read and understand original papers
The refactored code maintains the elegance and efficiency of Andrej Karpathy's original implementation while making it significantly more accessible and maintainable. The addition of comprehensive documentation, clear structure, and visual diagrams makes this an excellent educational resource for understanding how GPT models work under the hood.
This study guide provides complete coverage from high-level intuition to mathematical details to implementation specifics, making it suitable for learners at all levels.