Build-GPT-model-with-Python/script_gpt_educational.py at main · andresveraf/Build-GPT-model-with-Python · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""
================================================================================
DEEP LEARNING FROM SCRATCH: Building a GPT in Pure Python
================================================================================

📚 EDUCATIONAL IMPLEMENTATION OF A GPT MODEL

This code is designed as a complete learning journey through deep learning.
Every concept is explained with intuition, mathematics, and implementation.

🎯 LEARNING OBJECTIVES:
    After studying this code, you will understand:
    1. How neural networks compute (forward pass)
    2. How they learn (backward pass & automatic differentiation)
    3. How to optimize them (Adam optimizer & hyperparameters)
    4. The transformer architecture (attention is all you need)
    5. Language modeling and text generation

👨‍🏫 PROFESSOR'S TEACHING PHILOSOPHY:
    "The best way to learn deep learning is to understand the WHY behind
     every decision. We don't just implement algorithms - we understand
     the intuition, the mathematics, and the practical trade-offs."

Based on Andrej Karpathy's minimal GPT implementation
Enhanced with educational explanations and workflow visualization
"""

print("=" * 80)
print("DEEP LEARNING FROM SCRATCH: GPT Implementation")
print("=" * 80)
print("\n📚 Welcome! Let's build a GPT model step by step, understanding every detail.")
print("   This implementation includes detailed workflow prints to help you")
print("   visualize what's happening at each stage of training and inference.\n")

# =============================================================================
# PART 1: FOUNDATIONS - IMPORTS AND RANDOM SEED
# =============================================================================

import os       # For file operations
import math     # For mathematical functions (log, exp, sqrt)
import random   # For randomness in initialization and sampling

"""
🎲 WHY SET A RANDOM SEED?
━━━━━━━━━━━━━━━━━━━━━━━━━━
Reproducibility! In research and debugging, we want the same results
every time we run the code. This is crucial for:
  1. Debugging: If there's a bug, we can reproduce it
  2. Comparing experiments: Same initialization = fair comparison
  3. Educational consistency: Everyone sees the same learning process

The number 42 is arbitrary (a reference to "The Hitchhiker's Guide to the Galaxy")
but any fixed number works.
"""

random.seed(42)

print(f"✓ Random seed set to 42 for reproducibility")

# =============================================================================
# PART 2: HYPERPARAMETERS - THE "KNOBS" OF DEEP LEARNING
# =============================================================================

print("\n" + "=" * 80)
print("PART 2: CONFIGURING THE MODEL - HYPERPARAMETERS")
print("=" * 80)

"""
📐 ARCHITECTURE HYPERPARAMETERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
These define the STRUCTURE of our neural network. Think of them as the
blueprint for building the model.

Why N_EMBD = 16?
  - In practice: 512, 1024, or more for real models
  - Here: Small for educational purposes (faster training)
  - Each token becomes a 16-dimensional vector
  - Higher = more capacity to learn patterns, but slower

Why N_HEAD = 4?
  - Multi-head attention lets the model focus on different aspects
  - Head 1 might learn: "pay attention to previous letter"
  - Head 2 might learn: "pay attention to letter position"
  - Head 3 might learn: "detect consonant patterns"
  - Head 4 might learn: "detect vowel patterns"
  - Each head is independent, so they can learn different things

Why N_LAYER = 1?
  - More layers = deeper understanding (more abstraction)
  - Layer 1: Simple patterns (bigrams)
  - Layer 2: Medium patterns (trigrams, phonemes)
  - Layer 3+: Complex patterns (morphology, semantics)
  - We use 1 for simplicity, but GPT-2 uses 12, GPT-3 uses 96!

Why BLOCK_SIZE = 16?
  - Maximum context window: how many previous tokens we can "remember"
  - Shorter = faster training, but limited context
  - Longer = richer context, but more computation (quadratic!)
  - Real models use 1024, 2048, or more

📊 TRAINING HYPERPARAMETERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━
These control HOW the model learns.

Why LEARNING_RATE = 0.01?
  - Learning rate is the "step size" in gradient descent
  - Too large (0.1): Might overshoot the minimum, diverge
  - Too small (0.0001): Takes forever to learn
  - 0.01 is moderate for this small model
  - In practice: Use learning rate scheduling (warmup + decay)

Why BETA1 = 0.85 and BETA2 = 0.99?
  - These are for the Adam optimizer's momentum
  - β1 controls the "momentum" (first moment - mean of gradients)
    * 0.85 = moderately fast adaptation
    * Closer to 1 = more smoothing, but slower to adapt
    * Standard is 0.9, we use 0.85 for faster learning
  - β2 controls the "adaptive learning rate" (second moment - variance)
    * 0.99 = very slow adaptation (moving average of squared gradients)
    * This makes the learning rate more stable
    * Standard is 0.999, we use 0.99 for faster adaptation

Why EPS_ADAM = 1e-8?
  - Tiny value to prevent division by zero
  - In Adam, we divide by sqrt(variance + eps)
  - Without eps: If variance is 0, we'd crash!
  - 1e-8 is small enough not to affect learning but prevents crashes

Why WEIGHT_INIT_STD = 0.08?
  - Standard deviation for random weight initialization
  - Small enough to prevent exploding activations
  - Large enough to prevent vanishing gradients
  - Rule of thumb: 1/sqrt(fan_in) = 1/sqrt(256) ≈ 0.06
  - 0.08 is slightly larger for better gradient flow

Why TEMPERATURE = 0.5?
  - Controls randomness in text generation
  - Temperature = 1.0: Sample from model's exact distribution
  - Temperature < 1.0: More deterministic, conservative
  - Temperature > 1.0: More random, creative (but risky)
  - 0.5 gives us reasonable diversity without too much nonsense
"""

# Model Architecture
N_EMBD = 16          # Embedding dimension
N_HEAD = 4           # Number of attention heads
N_LAYER = 1          # Number of transformer layers
BLOCK_SIZE = 16      # Maximum sequence length (context window)
HEAD_DIM = N_EMBD // N_HEAD  # Dimension per head = 4

# Training Configuration
LEARNING_RATE = 0.01
BETA1 = 0.85         # Adam momentum parameter (first moment)
BETA2 = 0.99         # Adam momentum parameter (second moment)
EPS_ADAM = 1e-8      # Numerical stability constant
NUM_STEPS = 1000     # Training iterations
TEMPERATURE = 0.5    # Sampling temperature

# Initialization
WEIGHT_INIT_STD = 0.08

print(f"\n📐 MODEL ARCHITECTURE:")
print(f"   • Embedding dimension: {N_EMBD}")
print(f"   • Attention heads: {N_HEAD} (each with {HEAD_DIM} dimensions)")
print(f"   • Transformer layers: {N_LAYER}")
print(f"   • Context window: {BLOCK_SIZE} tokens")

print(f"\n📊 TRAINING CONFIGURATION:")
print(f"   • Learning rate: {LEARNING_RATE}")
print(f"   • Adam β1 (momentum): {BETA1}")
print(f"   • Adam β2 (variance): {BETA2}")
print(f"   • Training steps: {NUM_STEPS}")
print(f"   • Sampling temperature: {TEMPERATURE}")

# =============================================================================
# PART 3: DATA LOADING - THE TRAINING SET
# =============================================================================

print("\n" + "=" * 80)
print("PART 3: DATA LOADING - WHERE KNOWLEDGE COMES FROM")
print("=" * 80)

def load_dataset(filename='input.txt', url=None):
    """
    Load or download the training dataset.

    📚 WHY THIS DATASET?
    ━━━━━━━━━━━━━━━━━━━
    We're using a dataset of names (makemore). This is perfect for learning because:
    1. Small enough to train quickly
    2. Clear structure (names follow patterns)
    3. Fun to see what the model generates

    The model will learn:
    - Common letter combinations (th, ch, sh)
    - Name patterns (vowel-consonant structure)
    - Probabilistic rules (q is almost always followed by u)
    """
    # Download if needed
    if not os.path.exists(filename):
        print(f"\n📥 Dataset not found. Downloading from GitHub...")
        import urllib.request
        if url is None:
            url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
        urllib.request.urlretrieve(url, filename)
        print(f"✓ Downloaded to {filename}")

    # Load and parse
    print(f"\n📖 Loading dataset from {filename}...")
    docs = [
        line.strip()
        for line in open(filename).read().strip().split('\n')
        if line.strip()
    ]

    # Shuffle for randomness
    random.shuffle(docs)

    # Show examples
    print(f"✓ Loaded {len(docs)} names")
    print(f"\n📝 Example names from dataset:")
    for i, name in enumerate(docs[:5]):
        print(f"   {i+1}. {name}")

    return docs

docs = load_dataset()

# =============================================================================
# PART 4: TOKENIZATION - CONVERTING TEXT TO NUMBERS
# =============================================================================

print("\n" + "=" * 80)
print("PART 4: TOKENIZATION - BRIDGING HUMANS AND MACHINES")
print("=" * 80)

class Tokenizer:
    """
    Converts between text strings and integer token IDs.

    🔄 WHY TOKENIZE?
    ━━━━━━━━━━━━━━━━
    Neural networks work with NUMBERS, not text. We need to convert:
    1. Text → Tokens: "hello" → [8, 5, 12, 12, 15]
    2. Tokens → Text: [8, 5, 12, 12, 15] → "hello"

    🔢 CHARACTER-LEVEL TOKENIZATION
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    We use character-level tokenization (not word-level or subword):
    - Simpler for educational purposes
    - Each character becomes a unique ID
    - Vocabulary size = number of unique characters + 1

    🏷️ SPECIAL TOKEN: BOS
    ━━━━━━━━━━━━━━━━━━━━━━━
    BOS = Beginning of Sequence
    - Marks the start and end of sequences
    - Helps the model learn when to stop generating
    - Also serves as EOS (End of Sequence) token

    Example:
    Input: "emma"
    With BOS: 26, 5, 13, 13, 1, 26
    Decode: "emma" (removes BOS tokens)
    """

    def __init__(self, docs):
        """Build vocabulary from training data."""
        # Extract unique characters
        self.uchars = sorted(set(''.join(docs)))

        # BOS token ID (last one)
        self.bos_token_id = len(self.uchars)

        # Total vocabulary size
        self.vocab_size = len(self.uchars) + 1

        print(f"\n📚 BUILT VOCABULARY:")
        print(f"   • Vocabulary size: {self.vocab_size} tokens")
        print(f"   • Characters: {''.join(self.uchars)}")
        print(f"   • BOS token ID: {self.bos_token_id} (special marker)")

        # Show encoding examples
        print(f"\n🔄 TOKENIZATION EXAMPLES:")
        example_names = docs[:3]
        for name in example_names:
            encoded = self.encode(name)
            print(f"   '{name}' → {encoded}")

    def encode(self, text):
        """Convert text to token IDs with BOS markers."""
        tokens = [self.uchars.index(ch) for ch in text]
        return [self.bos_token_id] + tokens + [self.bos_token_id]

    def decode(self, token_ids):
        """Convert token IDs back to text (removes BOS)."""
        return ''.join([
            self.uchars[tid]
            for tid in token_ids
            if tid != self.bos_token_id
        ])

tokenizer = Tokenizer(docs)

# =============================================================================
# PART 5: AUTOGRAD - THE MAGIC OF AUTOMATIC DIFFERENTIATION
# =============================================================================

print("\n" + "=" * 80)
print("PART 5: AUTOGRAD - HOW NEURAL NETWORKS LEARN")
print("=" * 80)

print("""
🧠 THE CORE IDEA: COMPUTATION GRAPHS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Every computation builds a graph. During backpropagation, we traverse
this graph backwards, computing gradients using the chain rule.

Example: Computing loss for a single prediction
    Input x → Linear layer → Activation → Softmax → Cross-entropy → Loss

Forward pass (compute output):
    We calculate the loss by going left to right

Backward pass (compute gradients):
    We calculate ∂Loss/∂∂weight by going right to left
    Chain rule: ∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w

🎯 WHY THIS MATTERS:
    Without autograd, we'd have to manually compute derivatives for
    millions of parameters. Autograd automates this perfectly!

📊 CHAIN RULE IN PRACTICE:
    If y = f(x) and L = g(y), then:
    ∂L/∂x = ∂L/∂y × ∂y/∂x

    This lets us propagate gradients backward through any computation!
""")

class Value:
    """
    A scalar value that supports automatic differentiation.

    🎮 THINK OF IT AS:
    Each Value is a node in a computation graph. It knows:
    1. Its value (data)
    2. Which operations created it (_children)
    3. How to compute gradients (_local_grads)

    💡 KEY INSIGHT:
    When we compute z = x + y, we create a new Value z that:
    - Stores the result: z.data = x.data + y.data
    - Remembers its parents: z._children = (x, y)
    - Knows local gradients: z._local_grads = (1, 1)
      (because ∂z/∂x = 1 and ∂z/∂y = 1)

    Later, when we call z.backward(), we can compute gradients for x and y!
    """

    __slots__ = ('data', 'grad', '_children', '_local_grads')

    def __init__(self, data, children=(), local_grads=()):
        self.data = data                # Forward pass value
        self.grad = 0                   # Backward pass gradient (initialized to 0)
        self._children = children       # Dependencies
        self._local_grads = local_grads # Local derivatives

    def __add__(self, other):
        """
        Addition: z = x + y
        Local gradients: ∂z/∂x = 1, ∂z/∂y = 1

        💡 WHY 1?
        If z = x + y, then:
        ∂z/∂x = 1 (derivative of x is 1, y is constant)
        ∂z/∂y = 1 (derivative of y is 1, x is constant)
        """
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data + other.data, (self, other), (1, 1))

    def __mul__(self, other):
        """
        Multiplication: z = x * y
        Local gradients: ∂z/∂x = y, ∂z/∂y = x

        💡 WHY y AND x?
        If z = x * y, then:
        ∂z/∂x = y (derivative of x*y with respect to x is y)
        ∂z/∂y = x (derivative of x*y with respect to y is x)

        This is why we need to remember the values during forward pass!
        """
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data * other.data, (self, other), (other.data, self.data))

    def __pow__(self, other):
        """
        Power: z = x^n
        Local gradient: ∂z/∂x = n * x^(n-1)

        📐 DERIVATIVE RULE:
        d/dx(x^n) = n * x^(n-1)
        Example: d/dx(x^2) = 2x, d/dx(x^3) = 3x^2
        """
        return Value(self.data ** other, (self,), (other * self.data ** (other - 1),))

    def log(self):
        """
        Natural logarithm: z = log(x)
        Local gradient: ∂z/∂x = 1/x

        📐 DERIVATIVE RULE:
        d/dx(log(x)) = 1/x

        💡 WHY LOG IN LOSS?
        Log converts products to sums: log(a*b) = log(a) + log(b)
        This makes gradients more stable and prevents underflow.
        Also: -log(p) penalizes low probability heavily
        """
        return Value(math.log(self.data), (self,), (1 / self.data,))

    def exp(self):
        """
        Exponential: z = exp(x)
        Local gradient: ∂z/∂x = exp(x)

        📐 DERIVATIVE RULE:
        d/dx(exp(x)) = exp(x)

        The exponential function is its own derivative!
        """
        return Value(math.exp(self.data), (self,), (math.exp(self.data),))

    def relu(self):
        """
        ReLU activation: z = max(0, x)
        Local gradient: ∂z/∂x = 1 if x > 0 else 0

        💡 WHY RELU?
        ReLU = Rectified Linear Unit
        - Simple: max(0, x)
        - Non-linear: Allows learning complex patterns
        - Sparse: Many neurons output 0 (efficient)
        - Gradient: 1 for positive inputs (no vanishing gradient!)

        📉 DERIVATIVE:
        If x > 0: ∂/∂x(max(0, x)) = 1
        If x < 0: ∂/∂x(max(0, x)) = 0
        At x = 0: Undefined (we use 0 by convention)
        """
        return Value(max(0, self.data), (self,), (float(self.data > 0),))

    # Operator overloads for reverse operations
    def __neg__(self): return self * -1
    def __radd__(self, other): return self + other
    def __sub__(self, other): return self + (-other)
    def __rsub__(self, other): return other + (-self)
    def __rmul__(self, other): return self * other
    def __truediv__(self, other): return self * other ** -1
    def __rtruediv__(self, other): return other * self ** -1

    def backward(self):
        """
        Compute gradients using backpropagation.

        🔄 THE ALGORITHM:
        1. Build topological order of all nodes (parents before children)
        2. Set gradient of loss node to 1 (∂L/∂L = 1, trivial)
        3. Traverse backwards (children to parents)
        4. For each node, propagate gradient to its children using chain rule

        📐 CHAIN RULE:
        ∂L/∂child = ∂L/∂node × ∂node/∂child

        Example:
        If L = f(g(h(x))), then:
        ∂L/∂x = ∂L/∂f × ∂f/∂g × ∂g/∂h × ∂h/∂x

        We compute this by traversing the graph backwards!
        """
        # Step 1: Build topological ordering
        topo = []
        visited = set()

        def build_topological(v):
            """Recursive DFS to build order."""
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topological(child)
                topo.append(v)

        build_topological(self)

        # Step 2: Initialize gradient (∂L/∂L = 1)
        self.grad = 1

        # Step 3: Backpropagate
        for node in reversed(topo):
            for child, local_grad in zip(node._children, node._local_grads):
                # Chain rule: accumulate gradients
                child.grad += local_grad * node.grad

print(f"✓ Autograd engine ready - can compute gradients automatically!")

# =============================================================================
# PART 6: NEURAL NETWORK PRIMITIVES
# =============================================================================

print("\n" + "=" * 80)
print("PART 6: BUILDING BLOCKS - LINEAR, SOFTMAX, NORMALIZATION")
print("=" * 80)

def linear(x, w):
    """
    Linear transformation: y = xW (matrix multiplication)

    📐 MATHEMATICAL DEFINITION:
        y[j] = Σ(i) x[i] × W[j][i]

    💡 WHY "LINEAR"?
        - It's a linear transformation (scaling + rotation)
        - No non-linearity yet (that comes later with ReLU)
        - This is the fundamental operation in neural networks

    🎯 PURPOSE:
        - Transform input features to output features
        - Learn patterns through weights W
        - Each weight represents connection strength

    📊 EXAMPLE:
        If x = [1, 2, 3] and W = [[4, 5, 6], [7, 8, 9]]:
        y[0] = 1×4 + 2×5 + 3×6 = 32
        y[1] = 1×7 + 2×8 + 3×9 = 50
        y = [32, 50]
    """
    return [sum(wi * xi for wi, xi in zip(w_row, x)) for w_row in w]

def softmax(logits):
    """
    Convert logits to probabilities using softmax.

    📐 MATHEMATICAL DEFINITION:
        softmax(x[i]) = exp(x[i]) / Σ(j) exp(x[j])

    💡 WHY SOFTMAX?
        - Converts any numbers to probabilities (sum to 1)
        - Preserves order: larger input → larger output
        - Exaggerates differences: winner gets higher probability

    🔢 NUMERICAL STABILITY TRICK:
        We subtract max before exponentiation:
        softmax(x[i]) = exp(x[i] - max(x)) / Σ exp(x[j] - max(x))

        WHY? Because exp(1000) = ∞ (overflow!), but exp(0) = 1
        By subtracting max, the largest value becomes exp(0) = 1
        This prevents overflow while preserving probabilities!

    📊 EXAMPLE:
        Input: [2.0, 1.0, 0.1]
        Max: 2.0
        Shifted: [0.0, -1.0, -1.9]
        Exp: [1.0, 0.368, 0.150]
        Sum: 1.518
        Output: [0.659, 0.242, 0.099]  (sums to 1.0)
    """
    # Numerical stability: subtract max
    max_val = max(val.data for val in logits)

    # Exponentiate (safe now because values are ≤ 0)
    exps = [(val - max_val).exp() for val in logits]

    # Normalize to sum to 1
    total = sum(exps)
    return [e / total for e in exps]

def rmsnorm(x):
    """
    RMS (Root Mean Square) Normalization.

    📐 MATHEMATICAL DEFINITION:
        RMSNorm(x) = x × (mean(x²) + ε)^(-1/2)

    💡 WHY NORMALIZE?
        Problem: Without normalization, activations can grow or shrink
                 exponentially through layers, causing:
                 - Exploding gradients (too large)
                 - Vanishing gradients (too small)

        Solution: Normalize to unit variance at each layer
                 - Keeps values in reasonable range
                 - Stabilizes training
                 - Allows deeper networks

    🎯 RMS vs LAYER NORM:
        RMSNorm is simpler (doesn't subtract mean) but works similarly.
        It's preferred because it's faster and empirically works as well.

    📊 EXAMPLE:
        Input: [2, 4, 6]
        Squared: [4, 16, 36]
        Mean: 18.67
        Scale: (18.67 + 0.00001)^(-0.5) = 0.231
        Output: [0.462, 0.924, 1.386]  (now normalized!)
    """
    # Compute mean of squares
    mean_square = sum(xi * xi for xi in x) / len(x)

    # Compute scaling factor (add tiny epsilon to prevent div/0)
    scale = (mean_square + 1e-5) ** -0.5

    # Scale all values
    return [xi * scale for xi in x]

print(f"✓ Neural network primitives defined")
print(f"   • Linear: Matrix multiplication")
print(f"   • Softmax: Logits → Probabilities")
print(f"   • RMSNorm: Normalize activations")

# =============================================================================
# PART 7: MODEL ARCHITECTURE - THE TRANSFORMER
# =============================================================================

print("\n" + "=" * 80)
print("PART 7: THE TRANSFORMER - ATTENTION IS ALL YOU NEED")
print("=" * 80)

print("""
🧠 THE TRANSFORMER REVOLUTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Transformers (2017) revolutionized AI because:
1. They can process all positions in parallel (unlike RNNs)
2. Self-attention lets them learn relationships between any positions
3. They scale incredibly well (GPT-3 has 175 BILLION parameters)

🎯 KEY INNOVATION: SELF-ATTENTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Traditional: Process left-to-right (like reading)
Transformer: Process all at once, with attention to what matters

Example: "The cat sat on the mat"
- When processing "cat", attention looks at "The" (article)
- When processing "sat", attention looks at "cat" (subject)
- When processing "mat", attention looks at "sat" (action)
- When processing "the" (second one), attention looks at "mat" (object)

This lets the model learn grammar and semantics!

🏗️ TRANSFORMER BLOCK STRUCTURE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input → RMSNorm → Multi-Head Attention → Residual → RMSNorm → MLP → Residual → Output

💡 RESIDUAL CONNECTIONS:
    x = x + Sublayer(x)

    WHY? They provide "gradient highways" for backpropagation.
    Without residuals, deep networks can't train (vanishing gradients).
    With residuals, gradients flow straight through!

📊 MULTI-HEAD ATTENTION:
    - Multiple "heads" process input in parallel
    - Each head learns different attention patterns
    - Results are concatenated and projected

    Analogy: Multiple experts looking at the same data,
              each focusing on different aspects.
""")

def initialize_parameters(vocab_size, n_embd, n_head, n_layer):
    """
    Initialize model parameters.

    🔢 PARAMETER GROUPS:
    ━━━━━━━━━━━━━━━━━━━━━
    1. Token embeddings (wte): Map token IDs → vectors
       - Each token gets a unique vector representation
       - Model learns semantic relationships (similar tokens ≈ similar vectors)

    2. Position embeddings (wpe): Add position information
       - "I am" vs "am I" - same words, different meaning!
       - Tells the model "this token is at position 5"

    3. Attention weights (attn_wq, attn_wk, attn_wv, attn_wo):
       - Wq: Query projection (what am I looking for?)
       - Wk: Key projection (what do I contain?)
       - Wv: Value projection (what information do I offer?)
       - Wo: Output projection (combine heads)

    4. MLP weights (mlp_fc1, mlp_fc2):
       - Feed-forward network for non-linear transformations
       - Expands by 4x, then contracts back (bottleneck)
       - Learns complex patterns

    5. Language model head (lm_head):
       - Projects back to vocabulary size
       - Predicts probability of next token

    🎲 INITIALIZATION STRATEGY:
    ━━━━━━━━━━━━━━━━━━━━━━━━━━
    Gaussian with std=0.08
    - Small enough: Prevents exploding activations
    - Large enough: Prevents vanishing gradients
    - Random: Breaks symmetry (all neurons learn different things)
    """
    def matrix(n_out, n_in, std=WEIGHT_INIT_STD):
        """Create weight matrix with Gaussian initialization."""
        return [
            [Value(random.gauss(0, std)) for _ in range(n_in)]
            for _ in range(n_out)
        ]

    # Initialize all parameters
    state_dict = {
        'wte': matrix(vocab_size, n_embd),
        'wpe': matrix(BLOCK_SIZE, n_embd),
        'lm_head': matrix(vocab_size, n_embd),
    }

    for layer_idx in range(n_layer):
        state_dict[f'layer{layer_idx}.attn_wq'] = matrix(n_embd, n_embd)
        state_dict[f'layer{layer_idx}.attn_wk'] = matrix(n_embd, n_embd)
        state_dict[f'layer{layer_idx}.attn_wv'] = matrix(n_embd, n_embd)
        state_dict[f'layer{layer_idx}.attn_wo'] = matrix(n_embd, n_embd)
        state_dict[f'layer{layer_idx}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
        state_dict[f'layer{layer_idx}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)

    # Flatten into single list
    params = [
        param
        for matrix in state_dict.values()
        for row in matrix
        for param in row
    ]

    print(f"\n🎲 INITIALIZED MODEL PARAMETERS:")
    print(f"   • Total parameters: {len(params):,}")
    print(f"   • Token embeddings: {vocab_size} × {n_embd} = {vocab_size * n_embd:,}")
    print(f"   • Position embeddings: {BLOCK_SIZE} × {n_embd} = {BLOCK_SIZE * n_embd:,}")
    print(f"   • Per-layer attention: 4 × {n_embd} × {n_embd} = {4 * n_embd * n_embd:,}")
    print(f"   • Per-layer MLP: {4 * n_embd} × {n_embd} + {n_embd} × {4 * n_embd} = {8 * n_embd * n_embd:,}")

    return state_dict, params

state_dict, params = initialize_parameters(
    tokenizer.vocab_size, N_EMBD, N_HEAD, N_LAYER
)

def gpt_forward(token_id, pos_id, keys, values, verbose=False):
    """
    Forward pass through the GPT model.

    🌊 THE FORWARD PASS FLOW:
    ━━━━━━━━━━━━━━━━━━━━━━━━━━
    1. Embedding: Look up token and position vectors
    2. Process through transformer layers
       a. Multi-head self-attention (learn relationships)
       b. Feed-forward MLP (process information)
    3. Project to vocabulary (predict next token)

    📍 MULTI-HEAD ATTENTION IN DETAIL:
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    For each head h:
        1. Project input to Q, K, V (Query, Key, Value)
        2. Compute attention scores: score = Q · K / √d
        3. Convert to probabilities: weights = softmax(score)
        4. Weight values: output = weights · V

    💡 WHY Q, K, V?
    ━━━━━━━━━━━━━━━━
    Analogy: Library search
    - Query (Q): What you're looking for
    - Key (K): What each book contains (title, tags)
    - Value (V): The book's content

    Attention score = similarity(Query, Key)
    If score is high → retrieve that Value

    📏 WHY DIVIDE BY √d?
    ━━━━━━━━━━━━━━━━━━
    Prevents dot products from growing too large with dimension.
    Large scores → softmax saturates (all near 0 or 1) → no gradients!
    Dividing by √d keeps scores in reasonable range.

    Args:
        token_id: Current token to process
        pos_id: Current position in sequence
        keys: Cached keys from previous positions
        values: Cached values from previous positions
        verbose: Print detailed information (for learning)

    Returns:
        Logits over vocabulary for next token prediction
    """
    if verbose and pos_id == 0:
        print(f"\n🔍 FORWARD PASS - Position {pos_id}")
        print(f"   Token ID: {token_id} ('{tokenizer.decode([token_id])}')")

    # 1. EMBEDDING LAYER
    # ──────────────────
    # Look up embeddings
    tok_emb = state_dict['wte'][token_id]
    pos_emb = state_dict['wpe'][pos_id]

    # Combine them (addition merges information)
    x = [t + p for t, p in zip(tok_emb, pos_emb)]

    # Normalize
    x = rmsnorm(x)

    if verbose and pos_id == 0:
        print(f"   Embeddings combined and normalized")

    # 2. TRANSFORMER LAYERS
    # ─────────────────────
    for layer_idx in range(N_LAYER):
        # === RESIDUAL CONNECTION ===
        x_residual = x

        # === MULTI-HEAD ATTENTION ===
        x = rmsnorm(x)

        # Project to Q, K, V
        q = linear(x, state_dict[f'layer{layer_idx}.attn_wq'])
        k = linear(x, state_dict[f'layer{layer_idx}.attn_wk'])
        v = linear(x, state_dict[f'layer{layer_idx}.attn_wv'])

        # Store for future positions
        keys[layer_idx].append(k)
        values[layer_idx].append(v)

        # Process each head
        x_attn = []
        for head_idx in range(N_HEAD):
            head_start = head_idx * HEAD_DIM
            head_end = head_start + HEAD_DIM

            # Extract head-specific Q, K, V
            q_head = q[head_start:head_end]
            k_head = [k_vec[head_start:head_end] for k_vec in keys[layer_idx]]
            v_head = [v_vec[head_start:head_end] for v_vec in values[layer_idx]]

            # Attention scores
            attn_logits = [
                sum(
                    q_head[h] * k_head[t][h]
                    for h in range(HEAD_DIM)
                ) / (HEAD_DIM ** 0.5)
                for t in range(len(k_head))
            ]

            # Attention weights
            attn_weights = softmax(attn_logits)

            if verbose and pos_id == 0 and head_idx == 0:
                weights_list = [w.data for w in attn_weights[:5]]
                print(f"   Head {head_idx} attention weights: {weights_list}")

            # Weighted sum of values
            head_out = [
                sum(
                    attn_weights[t] * v_head[t][h]
                    for t in range(len(v_head))
                )
                for h in range(HEAD_DIM)
            ]

            x_attn.extend(head_out)

        # Project attention output
        x = linear(x_attn, state_dict[f'layer{layer_idx}.attn_wo'])

        # Residual
        x = [a + b for a, b in zip(x, x_residual)]

        # === MLP ===
        x_residual = x
        x = rmsnorm(x)
        x = linear(x, state_dict[f'layer{layer_idx}.mlp_fc1'])
        x = [xi.relu() for xi in x]
        x = linear(x, state_dict[f'layer{layer_idx}.mlp_fc2'])
        x = [a + b for a, b in zip(x, x_residual)]

    # 3. OUTPUT PROJECTION
    # ────────────────────
    logits = linear(x, state_dict['lm_head'])

    if verbose and pos_id == 0:
        print(f"   Output logits computed")

    return logits

print(f"✓ GPT architecture defined")

# =============================================================================
# PART 8: THE ADAM OPTIMIZER
# =============================================================================

print("\n" + "=" * 80)
print("PART 8: THE ADAM OPTIMIZER - ADAPTIVE LEARNING")
print("=" * 80)

print("""
🎯 WHY ADAM? (ADAPTIVE MOMENT ESTIMATION)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Plain SGD (Stochastic Gradient Descent) problems:
1. Fixed learning rate: Too slow in some directions, too fast in others
2. No momentum: Gets stuck in local minima, slow in flat regions

Adam solves both:
1. Momentum: Like a heavy ball rolling down a hill
   - Accumulates past gradients
   - Helps escape local minima
   - Speeds up convergence

2. Adaptive learning rates: Different LR for each parameter
   - Frequently updated parameters → smaller LR
   - Rarely updated parameters → larger LR
   - Automatically balances learning

📊 ADAM VS SGD EXAMPLE:
━━━━━━━━━━━━━━━━━━━━━━━
Problem: Minimize f(x,y) = 1000x² + y²
- SGD gets stuck (needs tiny LR for x, but then y learns slowly)
- Adam adapts: large LR for y, small LR for x
- Adam converges faster!

🧮 THE ALGORITHM:
━━━━━━━━━━━━━━━━━
For each parameter θ with gradient g:

1. Update first moment (momentum):
   m_t = β1 × m_{t-1} + (1-β1) × g_t

   β1 = 0.85 means: keep 85% of past momentum, add 15% of current gradient

2. Update second moment (uncentered variance):
   v_t = β2 × v_{t-1} + (1-β2) × g_t²

   β2 = 0.99 means: keep 99% of past variance, add 1% of current squared gradient

3. Bias correction (important early in training):
   m̂_t = m_t / (1 - β1^t)
   v̂_t = v_t / (1 - β2^t)

   WHY? Early in training, m and v are initialized to 0
        They're biased toward 0, so we correct this bias

4. Update parameter:
   θ_t = θ_{t-1} - α × m̂_t / (√v̂_t + ε)

   α is learning rate (we decay it linearly)
   ε = 1e-8 prevents division by zero

💡 KEY INSIGHTS:
━━━━━━━━━━━━━━━━━
- m̂_t / √v̂_t is like: signal / noise
  - High gradient (high variance) → small step (noisy)
  - Consistent gradient (low variance) → large step (trustworthy)

- This is why Adam works so well in practice!
""")

class AdamOptimizer:
    """
    Adam optimizer with momentum and adaptive learning rates.

    🎯 HYPERPARAMETER EXPLANATIONS:
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    lr = 0.01:
        Base learning rate. Adam multiplies this by m̂_t/√v̂_t for each parameter.
        Smaller = more stable but slower
        Larger = faster but might diverge

    beta1 = 0.85:
        Momentum decay rate. Controls how fast we forget past gradients.
        Higher (0.9, 0.99) = smoother momentum, slower to adapt
        Lower (0.8, 0.85) = faster adaptation, less smooth
        Standard is 0.9, we use 0.85 for faster learning

    beta2 = 0.99:
        Variance decay rate. Controls how fast we forget past squared gradients.
        Should be close to 1 (0.999 in paper, 0.99 here for speed)
        Higher = more stable adaptive LR, slower to adapt

    eps = 1e-8:
        Tiny constant to prevent division by zero.
        In practice: gradients are rarely exactly 0, so this is a safety net.

    📊 WHY BIAS CORRECTION?
    ━━━━━━━━━━━━━━━━━━━━━━
    m_t = β1 × m_{t-1} + (1-β1) × g_t

    If β1 = 0.85 and m_0 = 0:
    m_1 = 0.85 × 0 + 0.15 × g_1 = 0.15 × g_1
    m_2 = 0.85 × 0.15×g_1 + 0.15 × g_2 ≈ 0.1275×g_1 + 0.15×g_2

    Early m values are biased toward 0 (because we started at 0)
    Bias correction: m̂_t = m_t / (1 - β1^t)

    For t=1: m̂_1 = 0.15×g_1 / (1 - 0.85) = 0.15×g_1 / 0.15 = g_1 ✓