Skip to content

Latest commit

 

History

History
503 lines (382 loc) · 20 KB

File metadata and controls

503 lines (382 loc) · 20 KB

GPU Validation Results

Hardware: Quadro RTX 8000 (46GB VRAM), CUDA 12.9, bfloat16 Date: March 2026 Total GPU time: ~3.5 hours

Summary

All 4 GPU-dependent capabilities validated successfully on real hardware.

Item Status Duration
SFT Training (2000 iters) PASSED ~35 min
GRPO RL Training (500 iters) PASSED ~2 hr
GSM8K Evaluation (200 problems, k=16) PASSED ~50 min
MEMIT Batch Evaluation (50 edits) PASSED ~2 min

1. SFT Training

Supervised fine-tuning on GSM8K chain-of-thought data with completion-only loss masking.

Config: 2000 iterations, batch_size=4, grad_accum=8, lr=3e-4, cosine schedule, GPT-2 124M base

Loss Progression

Step Train Loss Val Loss
0 3.7935 3.5871
200 2.1233 2.0798
400 1.6287 1.6516
600 1.3880 1.4401
800 1.3583 1.4566
1000 1.2812 1.4125
1200 1.2700 1.3986
1400 1.2405 1.3786
1600 1.1899 1.2973
1800 1.2060 1.3358
2000 1.2093 1.3459

Best validation loss 1.2973 at step 1600. Train loss dropped 68% (3.79 to 1.21). Mild overfitting after step 1600 but overall healthy convergence.

Checkpoint: out-sft/ckpt.pt


2. GRPO RL Training

Group Relative Policy Optimization with DAPO stability tricks, starting from the SFT checkpoint.

Config: 500 iterations, group_size=8, lr=5e-6, KL penalty, DAPO tricks active throughout

DAPO Stability Tricks (all verified working)

Trick Config Observation
Clip-Higher (asymmetric clipping) eps_low=0.2, eps_high=0.28 Allows larger upward policy updates
Entropy bonus (decaying) 0.01 -> 0.001 Entropy stable 1.2 -> 2.0, no collapse
Dynamic sampling Skip zero-variance groups 25 of 500 iterations skipped
Token-level loss normalization Active Prevents length bias

Reward Progression

Iteration Avg Reward
0 0.600
50 0.587
100 0.592
150 0.592
200 0.574
250 0.616
300 0.577
350 0.592
400 0.584
450 0.592
499 0.541

Peak reward 0.686 at step 79. Peak accuracy 9.4%. Average clip fraction 0.177.

Checkpoint: out-grpo/ckpt.pt


3. GSM8K Evaluation

200 GSM8K test problems, 16 completions per problem (3200 generations per checkpoint), temperature=0.7, top_k=50. Bootstrap 95% confidence intervals.

Results

Metric SFT Checkpoint GRPO Checkpoint Delta
pass@1 1.50% [0.00, 3.50] 2.00% [0.50, 4.00] +0.5%
pass@16 22.00% [16.00, 28.00] 22.00% [16.00, 27.50] 0%
majority@16 1.50% [0.00, 3.50] 3.00% [1.00, 5.50] +1.5%
format_compliance 98.91% [97.94, 99.56] 99.38% [98.75, 99.78] +0.5%
mean_cot_length 56.77 tokens 37.12 tokens -35%

Sanity Arithmetic (50 simple problems)

Metric SFT Checkpoint GRPO Checkpoint
pass@1 0.00% 2.00%
pass@16 4.00% 6.00%
format_compliance 90.75% 83.75%
mean_cot_length 24.78 tokens 17.14 tokens

Analysis

  • Format compliance is excellent (>98.9%) for both checkpoints, confirming the SFT pipeline successfully taught the <think>...</think><answer>...</answer> format
  • GRPO improves over SFT: pass@1 1.5% -> 2.0%, majority@16 1.5% -> 3.0%
  • GRPO produces shorter CoT (37 vs 57 tokens) - RL learned that concise reasoning is more effective
  • 22% pass@16 means for 1 in 5 problems, at least one of 16 sampled completions is correct
  • Accuracy is low overall - expected for a 124M parameter model on grade-school math

4. MEMIT Batch Evaluation

50 factual edits applied simultaneously using MEMIT (Mass-Editing Memory In a Transformer) on GPT-2 124M.

Config: layers 3-8, 50 edit requests, CUDA

Overall Results

Metric Value
Success rate 62% (31/50)
Mean efficacy (p_target) 0.2550
Pre-edit mean p_target 0.0094
Duration 112.6 seconds

Successful Edits (31/50) - Selected Examples

Prompt Old Answer New Answer Efficacy Rank
The capital of Egypt is Cairo Alexandria 0.895 1
The lightest element in the periodic table is hydrogen helium 0.703 1
Beethoven was famous for playing the piano violin 0.701 1
The Eiffel Tower is located in Paris London 0.669 1
The light bulb was invented by Thomas Nikola 0.631 1
The longest river in the world is the Nile Amazon 0.614 1
The currency of Germany is the euro pound 0.495 1
The Earth orbits around the Sun Moon 0.451 1
The largest desert in the world is the Sahara Arctic 0.421 1
The capital of Russia is Moscow St (Petersburg) 0.426 1
Albert Einstein was born in Germany London 0.396 1
The capital of Canada is Ottawa Toronto 0.382 1
The fastest land animal is the cheetah lion 0.384 1
The most popular sport in the world is soccer basketball 0.380 1
Shakespeare was born in Stratford London 0.375 1
The official language of Brazil is Portuguese Spanish 0.372 1
The Sun is a type of star planet 0.342 1

Failed Edits (19/50) - Selected Examples

Prompt Old New Efficacy Rank Why
Facebook was created by Mark Jack 0.001 121 Relational knowledge hard to edit
Amazon was founded by Jeff Elon 0.001 141 Founder associations deeply embedded
Twitter was created by Jack Mark 0.002 34 Same pattern - creator associations
Apple was founded by Steve Bill 0.006 11 Founder knowledge resistant
The capital of Germany is Berlin Paris 0.026 6 Strong prior on Berlin

Analysis

  • Geography and simple facts edit well: capital cities, scientific facts, superlatives (efficacy 0.2-0.9)
  • Company founder/creator associations resist editing: all 6 "founded by" / "created by" prompts failed (efficacy <0.02)
  • Close-but-not-rank-1 edits: some edits moved the target to rank 2-5 but couldn't overcome the original fact (Great Wall -> India at rank 2, Mount Everest -> Africa at rank 2)
  • Pre-edit to post-edit improvement: mean p_target increased 27x (0.0094 -> 0.2550)

Bug Fix Applied

nanogpt_edit/edit_core.py: Added case-insensitive fallback in find_subject_tokens() to handle 27 of 50 cases where the subject string didn't exactly match the prompt capitalization.


5. Generation Examples

Token-Level Transcript (GRPO checkpoint)

The reasoning model generates structured output using 4 special tokens in GPT-2's padding range:

Token 50257: <think>    (start chain-of-thought)
Token 50258: </think>   (end chain-of-thought)
Token 50259: <answer>   (start answer)
Token 50260: </answer>  (end answer, triggers stop)

Example: "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?"

--- PROMPT TOKENS (24 tokens) ---
[  0] id= 1532  'If'
[  1] id=  612  ' there'
[  2] id=  389  ' are'
[  3] id=  513  ' 3'
[  4] id= 5006  ' cars'
[  5] id=  287  ' in'
[  6] id=  262  ' the'
[  7] id= 7647  ' parking'
[  8] id= 1256  ' lot'
[  9] id=  290  ' and'
[ 10] id=  362  ' 2'
[ 11] id=  517  ' more'
[ 12] id= 5006  ' cars'
[ 13] id= 9240  ' arrive'
[ 14] id=   11  ','
[ 15] id=  703  ' how'
[ 16] id=  867  ' many'
[ 17] id= 5006  ' cars'
[ 18] id=  389  ' are'
[ 19] id=  287  ' in'
[ 20] id=  262  ' the'
[ 21] id= 7647  ' parking'
[ 22] id= 1256  ' lot'
[ 23] id=   30  '?'

--- GENERATED TOKENS (32 tokens) ---
[ 24] id=50257  *** <think> ***
[ 25] id= 1858  'There'
[ 26] id=  389  ' are'
[ 27] id=  513  ' 3'
[ 28] id= 1635  ' *'
[ 29] id=  362  ' 2'
[ 30] id=  796  ' ='
[ 31] id=  718  ' 6'
[ 32] id= 5006  ' cars'
[ 33] id=  198  '\n'
[ 34] id= 1858  'There'
[ 35] id=  389  ' are'
[ 36] id=  718  ' 6'
[ 37] id= 1635  ' *'
[ 38] id=  604  ' 4'
[ 39] id=  796  ' ='
[ 40] id= 1987  ' 24'
[ 41] id= 5006  ' cars'
[ 42] id=  198  '\n'
[ 43] id= 1858  'There'
[ 44] id=  389  ' are'
[ 45] id= 1987  ' 24'
[ 46] id=  532  ' -'
[ 47] id=  718  ' 6'
[ 48] id=  796  ' ='
[ 49] id=  678  ' 19'
[ 50] id= 5006  ' cars'
[ 51] id= 1364  ' left'
[ 52] id=50258  *** </think> ***
[ 53] id=50259  *** <answer> ***
[ 54] id= 1129  '19'
[ 55] id=50260  *** </answer> ***

Decoded: <think>There are 3 * 2 = 6 cars\nThere are 6 * 4 = 24 cars\nThere are 24 - 6 = 19 cars left</think><answer>19</answer>

Stats: 32 generated tokens total, 27 thinking tokens, 1 answer token.

Sampled Completions (3 samples each, temperature=0.7)

SFT Checkpoint

Prompt: "What is 12 + 15?"

[1] <think>12/2 = 14
So we have 15 - 12 = 4</think><answer>4</answer>

[2] <think>If there were 12 + 15 in total there were 12 + 15 = 21
The number of friends was 21-12 = 5
The number of friends is 5 x 12 = 80</think><answer>80</answer>

[3] <think>First find the total number of points made by the player: 12 * 15 = 240 points
Then subtract the number of points from the total to find the total number of points
from the player: 240 + 15 = 245 points</think><answer>245</answer>

Prompt: "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?"

[1] <think>There are 3 + 2 = 7 cars in the parking lot.
There are 7 - 2 = 3 cars in the parking lot.
There are 3 x 4 = 12 cars in the parking lot.</think><answer>12</answer>

[2] <think>First find the number of cars that are in the parking lot: 3 cars * 2 = 6 cars
Then divide the number of cars by the number of cars remaining to find the total number
of cars: 6 cars * 3 = 21 cars</think><answer>21</answer>

[3] <think>If there are 3 cars in the parking lot, there are 3*1=3 cars in the parking lot.
If there are 2 more cars arrive, there are 2+3=5 cars in the parking lot.</think><answer>5</answer>

GRPO Checkpoint

Prompt: "What is 7 * 8?"

[1] <think>7 * 8 = 72
7 = 72</think><answer>72</answer>

[2] <think>7 = 72
There are 72-7 = 77
Therefore, 7 + 78 = 82
Therefore, 82 - 7 = 81</think><answer>81</answer>

[3] <think>Seven is 8 x 7 = 108
So it has 108 - 108 = 108
So it will be 108 - 7 = 108</think><answer>108</answer>

Prompt: "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?"

[1] <think>At first glance, the total number of cars in the parking lot is
3 cars + 2 cars = 5 cars</think><answer>5</answer>

[2] <think>If there are 3 cars in the parking lot and 2 more cars arrive,
there are 3*2 = 6 cars in the parking lot.
There are 6-2 = 5 cars in the parking lot</think><answer>5</answer>

[3] <think>First find the number of cars in the parking lot: 3 cars * 2 = 6 cars
Then subtract the number of cars from the total number: 6 cars - 2 cars = 5 cars</think><answer>5</answer>

Prompt: "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

[1] <think>Natalia sold 48 / 2 = 26 clips.
Thus, Natalia sold 26/2 = 12 clips altogether.</think><answer>12</answer>

[2] <think>Natalia sold 48/2 = 22 clips.
Natalia sold 22/2 = 13 clips in May.</think><answer>13</answer>

[3] <think>Natalia sold 48 / 2 = 18 clips
Natalia sold 18 / 1 = 9 clips</think><answer>9</answer>

Key Observations from Examples

  1. Format is perfectly learned: Every sample from both checkpoints uses the <think>...</think><answer>...</answer> structure correctly. The 4 special tokens always appear in the right order.

  2. GRPO produces more consistent answers: On the "3 cars + 2 cars" problem, GRPO gets 5 on all 3 samples (correct!), while SFT gives 12, 21, and 5 (1/3 correct).

  3. GRPO is more concise: Average CoT length dropped from 57 to 37 tokens after RL training. The model learned shorter reasoning traces are rewarded.

  4. Math accuracy is low (expected for 124M): The model hallucinates calculations (e.g., "7 * 8 = 72" is close but wrong in 2 of 3 GRPO samples). A 124M parameter model was never designed for arithmetic - the point is that the full reasoning pipeline works end-to-end.

  5. GRPO sample 1 for "7 * 8" shows near-correct reasoning: It computed "7 * 8 = 72" then answered 72. While wrong (56 is correct), 72 = 8 * 9, showing the model is in the right ballpark. More training iterations and a larger base model would improve this.


6. Non-Math Prompts

The model was fine-tuned exclusively on GSM8K math word problems. When given non-math prompts, it reveals how deeply the math-problem prior is baked in: it tries to turn everything into arithmetic.

General Knowledge

Prompt: "What is the capital of France?"

GRPO: <think>The capital of France is 50000€.
      So we can write ÷ 50000/10000 = 800000€
      X is the capital of France
      Since X is the capital of France we get 800000/2 = 400000€</think><answer>400000</answer>

SFT:  <think>First find the total capital of France: $1,000
      Then add the total capital of France to find the capital of other countries:
      $1,000-$000 = $1000
      Finally, multiply the initial number of capital by the number of countries
      to find the total capital: $1000/1000 = $1,000</think><answer>1,000</answer>

The model interprets "capital" as a monetary quantity and starts calculating.

Prompt: "Who wrote Romeo and Juliet?"

GRPO: <think>Romeo is originally from a family of seven people.
      Since we know that Romeo's family is seven people, we can calculate the number
      of people in the family: 7 people * 7 people/person = 150 people</think><answer>150</answer>

SFT:  <think>The first step is to find the total number of words in the book: 100 words
      The second step is to find the total number of words in Romeo and Juliet:
      100+100 = 150 words</think><answer>195</answer>

Instead of answering "Shakespeare", it counts words or family members.

Science

Prompt: "Why is the sky blue?"

GRPO: <think>The sky is blue because it is blue because of the equation:
      2 + 1 = 3
      3 = 4</think><answer>4</answer>

SFT:  <think>There are 5 stars in the sky, and there are 5 stars each of each color.
      So the sky blue is 5*5=20 stars.</think><answer>20</answer>

Prompt: "How does photosynthesis work?"

GRPO: <think>There are 4 types of photosynthesis: photosynthesis with red light,
      photosynthesis with green light, photosynthesis with blue light,
      photosynthesis with yellow light.
      The ratio of photosynthesis to red light is 1/2*50 = 25</think>

SFT:  <think>The photosynthesis process takes place in two ways: first, by taking in air,
      converting air into gas, then by converting it into liquid.
      Each step in the process requires 2*2 = 4 steps in total.</think><answer>4</answer>

SFT's attempt is closer to real knowledge (air -> gas -> liquid is vaguely related to gas exchange) but still forces a numeric answer.

Logic / Common Sense

Prompt: "If all dogs are animals, and Buddy is a dog, what is Buddy?"

GRPO: <think>Buddy is a dog because: 1/2
      Buddy is a dog because: 1/2
      Buddy is a dog because: 1/2</think><answer>1</answer>

SFT:  <think>Buddy is a dog because 2 x 2 = 4.
      Buddy is a dog because 4 x 2 = 6.
      Buddy is a dog because 6 x 3 = 11.</think><answer>11</answer>

The model recognizes "Buddy is a dog" as relevant but can't express it as a non-numeric answer.

Prompt: "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?" (correct answer: $0.05)

GRPO: <think>The bat and a ball cost 1.10 * $1.10 = $1.40
      The ball is $1.40 - $1.10 = $9</think><answer>12</answer>

SFT:  <think>The bat costs 1.10*1.10=$1.20 in total
      The ball costs 1.20*1.4=$1.20 in total</think><answer>1.20</answer>

This is actually a math problem the model should be able to handle, but neither gets it right. The CRT (Cognitive Reflection Test) bat-and-ball problem is notoriously tricky even for humans.

Creative / Open-Ended

Prompt: "Write a haiku about the ocean."

GRPO: </think>First find the total amount of the ocean: 1 gallon
      Then subtract the amount of water which is water from the total: 150 gallons
      Then multiply by 100% to find the overall amount: 150/100% = 4 gallons
      Then divide that number by the number of gallons: 4/100% = 20 gallons</think><answer>20</answer>

SFT:  </think>Since we only need to write a haiku once, we can have 2 haiku that use
      the same method of writing: one that uses the same answer to find the answer,
      and one that uses another answer to find the answer.</think><answer>1</answer>

No poetry was produced. The ocean becomes "1 gallon" and haiku becomes a counting exercise.

Prompt: "What would happen if the moon disappeared?"

GRPO: <think>If the moon is not present, then it means that the moon has gone away.
      There are 5 days in a week and the moon is not present.
      Therefore, the moon has disappeared.</think><answer>5</answer>

The model grasps the semantic content ("the moon has gone away") but forces a numeric conclusion.

Continuation (No Question Mark)

Prompt: "The history of the internet began in"

GRPO: 1960, with the first internet.
      Let x be the number of years of the internet.
      If x is the number of years before the internet, then
      (x + 1) = 1960
      since we know that the internet started in 1960, then
      the internet was in 1960/x = 1980</think><answer>1980</answer>

SFT:  the first decade of the twentieth century, when the internet was first created.
      It started out as a single internet service that was used by a small number of people
      to transport their messages, thus making it the only internet service that was not
      used by the masses.
      Today, the internet is used by over 200 million people around the world...

The SFT checkpoint produces more coherent continuation text here, likely because the base GPT-2 knowledge bleeds through more before RL compressed its outputs.

Prompt: "Once upon a time, there was a"

GRPO: large change in the way in which people viewed its history.
      People started to recognize the term "pastor sickness" because of the
      large difference in its meaning.</think>

SFT:  large number of people.
      Thus, when she saw the number of people she was supposed to have met,
      she was told to pay them her salary.
      Therefore, at the end, there were 60 days.</think><answer>130</answer>

Non-Math Conclusions

  1. Total domain collapse to arithmetic: The model converts every prompt into a math word problem, regardless of the actual question. "Capital of France" becomes a currency calculation. "Who wrote Romeo and Juliet" becomes a word count. "Write a haiku" becomes gallon measurements.

  2. Format still works: Even on nonsensical prompts, the <think>...</think><answer>...</answer> structure is maintained. The special tokens are robust.

  3. Always produces a number: The answer is always numeric because the reward function (accuracy_reward) only rewards extracting and matching numbers. The model learned that the answer must be a number to get any reward.

  4. Some semantic awareness survives: Despite the arithmetic prior, fragments of real knowledge leak through - "the internet started in 1960", "Buddy is a dog", "the moon has gone away". The base GPT-2 knowledge is there but suppressed by the math fine-tuning.

  5. SFT preserves more base knowledge than GRPO: The SFT checkpoint tends to produce more coherent continuations on non-math prompts (e.g., "the internet" prompt). GRPO further compressed the model toward numeric outputs, which is exactly what RL optimized for.

  6. Expected behavior: This is a well-known phenomenon in fine-tuned models. Training exclusively on one domain causes "catastrophic forgetting" of other capabilities. A production system would need multi-task training or a broader reward function to maintain general knowledge.