Hardware: Quadro RTX 8000 (46GB VRAM), CUDA 12.9, bfloat16 Date: March 2026 Total GPU time: ~3.5 hours
All 4 GPU-dependent capabilities validated successfully on real hardware.
| Item | Status | Duration |
|---|---|---|
| SFT Training (2000 iters) | PASSED | ~35 min |
| GRPO RL Training (500 iters) | PASSED | ~2 hr |
| GSM8K Evaluation (200 problems, k=16) | PASSED | ~50 min |
| MEMIT Batch Evaluation (50 edits) | PASSED | ~2 min |
Supervised fine-tuning on GSM8K chain-of-thought data with completion-only loss masking.
Config: 2000 iterations, batch_size=4, grad_accum=8, lr=3e-4, cosine schedule, GPT-2 124M base
| Step | Train Loss | Val Loss |
|---|---|---|
| 0 | 3.7935 | 3.5871 |
| 200 | 2.1233 | 2.0798 |
| 400 | 1.6287 | 1.6516 |
| 600 | 1.3880 | 1.4401 |
| 800 | 1.3583 | 1.4566 |
| 1000 | 1.2812 | 1.4125 |
| 1200 | 1.2700 | 1.3986 |
| 1400 | 1.2405 | 1.3786 |
| 1600 | 1.1899 | 1.2973 |
| 1800 | 1.2060 | 1.3358 |
| 2000 | 1.2093 | 1.3459 |
Best validation loss 1.2973 at step 1600. Train loss dropped 68% (3.79 to 1.21). Mild overfitting after step 1600 but overall healthy convergence.
Checkpoint: out-sft/ckpt.pt
Group Relative Policy Optimization with DAPO stability tricks, starting from the SFT checkpoint.
Config: 500 iterations, group_size=8, lr=5e-6, KL penalty, DAPO tricks active throughout
| Trick | Config | Observation |
|---|---|---|
| Clip-Higher (asymmetric clipping) | eps_low=0.2, eps_high=0.28 | Allows larger upward policy updates |
| Entropy bonus (decaying) | 0.01 -> 0.001 | Entropy stable 1.2 -> 2.0, no collapse |
| Dynamic sampling | Skip zero-variance groups | 25 of 500 iterations skipped |
| Token-level loss normalization | Active | Prevents length bias |
| Iteration | Avg Reward |
|---|---|
| 0 | 0.600 |
| 50 | 0.587 |
| 100 | 0.592 |
| 150 | 0.592 |
| 200 | 0.574 |
| 250 | 0.616 |
| 300 | 0.577 |
| 350 | 0.592 |
| 400 | 0.584 |
| 450 | 0.592 |
| 499 | 0.541 |
Peak reward 0.686 at step 79. Peak accuracy 9.4%. Average clip fraction 0.177.
Checkpoint: out-grpo/ckpt.pt
200 GSM8K test problems, 16 completions per problem (3200 generations per checkpoint), temperature=0.7, top_k=50. Bootstrap 95% confidence intervals.
| Metric | SFT Checkpoint | GRPO Checkpoint | Delta |
|---|---|---|---|
| pass@1 | 1.50% [0.00, 3.50] | 2.00% [0.50, 4.00] | +0.5% |
| pass@16 | 22.00% [16.00, 28.00] | 22.00% [16.00, 27.50] | 0% |
| majority@16 | 1.50% [0.00, 3.50] | 3.00% [1.00, 5.50] | +1.5% |
| format_compliance | 98.91% [97.94, 99.56] | 99.38% [98.75, 99.78] | +0.5% |
| mean_cot_length | 56.77 tokens | 37.12 tokens | -35% |
| Metric | SFT Checkpoint | GRPO Checkpoint |
|---|---|---|
| pass@1 | 0.00% | 2.00% |
| pass@16 | 4.00% | 6.00% |
| format_compliance | 90.75% | 83.75% |
| mean_cot_length | 24.78 tokens | 17.14 tokens |
- Format compliance is excellent (>98.9%) for both checkpoints, confirming the SFT pipeline successfully taught the
<think>...</think><answer>...</answer>format - GRPO improves over SFT: pass@1 1.5% -> 2.0%, majority@16 1.5% -> 3.0%
- GRPO produces shorter CoT (37 vs 57 tokens) - RL learned that concise reasoning is more effective
- 22% pass@16 means for 1 in 5 problems, at least one of 16 sampled completions is correct
- Accuracy is low overall - expected for a 124M parameter model on grade-school math
50 factual edits applied simultaneously using MEMIT (Mass-Editing Memory In a Transformer) on GPT-2 124M.
Config: layers 3-8, 50 edit requests, CUDA
| Metric | Value |
|---|---|
| Success rate | 62% (31/50) |
| Mean efficacy (p_target) | 0.2550 |
| Pre-edit mean p_target | 0.0094 |
| Duration | 112.6 seconds |
| Prompt | Old Answer | New Answer | Efficacy | Rank |
|---|---|---|---|---|
| The capital of Egypt is | Cairo | Alexandria | 0.895 | 1 |
| The lightest element in the periodic table is | hydrogen | helium | 0.703 | 1 |
| Beethoven was famous for playing the | piano | violin | 0.701 | 1 |
| The Eiffel Tower is located in | Paris | London | 0.669 | 1 |
| The light bulb was invented by | Thomas | Nikola | 0.631 | 1 |
| The longest river in the world is the | Nile | Amazon | 0.614 | 1 |
| The currency of Germany is the | euro | pound | 0.495 | 1 |
| The Earth orbits around the | Sun | Moon | 0.451 | 1 |
| The largest desert in the world is the | Sahara | Arctic | 0.421 | 1 |
| The capital of Russia is | Moscow | St (Petersburg) | 0.426 | 1 |
| Albert Einstein was born in | Germany | London | 0.396 | 1 |
| The capital of Canada is | Ottawa | Toronto | 0.382 | 1 |
| The fastest land animal is the | cheetah | lion | 0.384 | 1 |
| The most popular sport in the world is | soccer | basketball | 0.380 | 1 |
| Shakespeare was born in | Stratford | London | 0.375 | 1 |
| The official language of Brazil is | Portuguese | Spanish | 0.372 | 1 |
| The Sun is a type of | star | planet | 0.342 | 1 |
| Prompt | Old | New | Efficacy | Rank | Why |
|---|---|---|---|---|---|
| Facebook was created by | Mark | Jack | 0.001 | 121 | Relational knowledge hard to edit |
| Amazon was founded by | Jeff | Elon | 0.001 | 141 | Founder associations deeply embedded |
| Twitter was created by | Jack | Mark | 0.002 | 34 | Same pattern - creator associations |
| Apple was founded by | Steve | Bill | 0.006 | 11 | Founder knowledge resistant |
| The capital of Germany is | Berlin | Paris | 0.026 | 6 | Strong prior on Berlin |
- Geography and simple facts edit well: capital cities, scientific facts, superlatives (efficacy 0.2-0.9)
- Company founder/creator associations resist editing: all 6 "founded by" / "created by" prompts failed (efficacy <0.02)
- Close-but-not-rank-1 edits: some edits moved the target to rank 2-5 but couldn't overcome the original fact (Great Wall -> India at rank 2, Mount Everest -> Africa at rank 2)
- Pre-edit to post-edit improvement: mean p_target increased 27x (0.0094 -> 0.2550)
nanogpt_edit/edit_core.py: Added case-insensitive fallback in find_subject_tokens() to handle 27 of 50 cases where the subject string didn't exactly match the prompt capitalization.
The reasoning model generates structured output using 4 special tokens in GPT-2's padding range:
Token 50257: <think> (start chain-of-thought)
Token 50258: </think> (end chain-of-thought)
Token 50259: <answer> (start answer)
Token 50260: </answer> (end answer, triggers stop)
Example: "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?"
--- PROMPT TOKENS (24 tokens) ---
[ 0] id= 1532 'If'
[ 1] id= 612 ' there'
[ 2] id= 389 ' are'
[ 3] id= 513 ' 3'
[ 4] id= 5006 ' cars'
[ 5] id= 287 ' in'
[ 6] id= 262 ' the'
[ 7] id= 7647 ' parking'
[ 8] id= 1256 ' lot'
[ 9] id= 290 ' and'
[ 10] id= 362 ' 2'
[ 11] id= 517 ' more'
[ 12] id= 5006 ' cars'
[ 13] id= 9240 ' arrive'
[ 14] id= 11 ','
[ 15] id= 703 ' how'
[ 16] id= 867 ' many'
[ 17] id= 5006 ' cars'
[ 18] id= 389 ' are'
[ 19] id= 287 ' in'
[ 20] id= 262 ' the'
[ 21] id= 7647 ' parking'
[ 22] id= 1256 ' lot'
[ 23] id= 30 '?'
--- GENERATED TOKENS (32 tokens) ---
[ 24] id=50257 *** <think> ***
[ 25] id= 1858 'There'
[ 26] id= 389 ' are'
[ 27] id= 513 ' 3'
[ 28] id= 1635 ' *'
[ 29] id= 362 ' 2'
[ 30] id= 796 ' ='
[ 31] id= 718 ' 6'
[ 32] id= 5006 ' cars'
[ 33] id= 198 '\n'
[ 34] id= 1858 'There'
[ 35] id= 389 ' are'
[ 36] id= 718 ' 6'
[ 37] id= 1635 ' *'
[ 38] id= 604 ' 4'
[ 39] id= 796 ' ='
[ 40] id= 1987 ' 24'
[ 41] id= 5006 ' cars'
[ 42] id= 198 '\n'
[ 43] id= 1858 'There'
[ 44] id= 389 ' are'
[ 45] id= 1987 ' 24'
[ 46] id= 532 ' -'
[ 47] id= 718 ' 6'
[ 48] id= 796 ' ='
[ 49] id= 678 ' 19'
[ 50] id= 5006 ' cars'
[ 51] id= 1364 ' left'
[ 52] id=50258 *** </think> ***
[ 53] id=50259 *** <answer> ***
[ 54] id= 1129 '19'
[ 55] id=50260 *** </answer> ***
Decoded: <think>There are 3 * 2 = 6 cars\nThere are 6 * 4 = 24 cars\nThere are 24 - 6 = 19 cars left</think><answer>19</answer>
Stats: 32 generated tokens total, 27 thinking tokens, 1 answer token.
Prompt: "What is 12 + 15?"
[1] <think>12/2 = 14
So we have 15 - 12 = 4</think><answer>4</answer>
[2] <think>If there were 12 + 15 in total there were 12 + 15 = 21
The number of friends was 21-12 = 5
The number of friends is 5 x 12 = 80</think><answer>80</answer>
[3] <think>First find the total number of points made by the player: 12 * 15 = 240 points
Then subtract the number of points from the total to find the total number of points
from the player: 240 + 15 = 245 points</think><answer>245</answer>
Prompt: "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?"
[1] <think>There are 3 + 2 = 7 cars in the parking lot.
There are 7 - 2 = 3 cars in the parking lot.
There are 3 x 4 = 12 cars in the parking lot.</think><answer>12</answer>
[2] <think>First find the number of cars that are in the parking lot: 3 cars * 2 = 6 cars
Then divide the number of cars by the number of cars remaining to find the total number
of cars: 6 cars * 3 = 21 cars</think><answer>21</answer>
[3] <think>If there are 3 cars in the parking lot, there are 3*1=3 cars in the parking lot.
If there are 2 more cars arrive, there are 2+3=5 cars in the parking lot.</think><answer>5</answer>
Prompt: "What is 7 * 8?"
[1] <think>7 * 8 = 72
7 = 72</think><answer>72</answer>
[2] <think>7 = 72
There are 72-7 = 77
Therefore, 7 + 78 = 82
Therefore, 82 - 7 = 81</think><answer>81</answer>
[3] <think>Seven is 8 x 7 = 108
So it has 108 - 108 = 108
So it will be 108 - 7 = 108</think><answer>108</answer>
Prompt: "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?"
[1] <think>At first glance, the total number of cars in the parking lot is
3 cars + 2 cars = 5 cars</think><answer>5</answer>
[2] <think>If there are 3 cars in the parking lot and 2 more cars arrive,
there are 3*2 = 6 cars in the parking lot.
There are 6-2 = 5 cars in the parking lot</think><answer>5</answer>
[3] <think>First find the number of cars in the parking lot: 3 cars * 2 = 6 cars
Then subtract the number of cars from the total number: 6 cars - 2 cars = 5 cars</think><answer>5</answer>
Prompt: "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
[1] <think>Natalia sold 48 / 2 = 26 clips.
Thus, Natalia sold 26/2 = 12 clips altogether.</think><answer>12</answer>
[2] <think>Natalia sold 48/2 = 22 clips.
Natalia sold 22/2 = 13 clips in May.</think><answer>13</answer>
[3] <think>Natalia sold 48 / 2 = 18 clips
Natalia sold 18 / 1 = 9 clips</think><answer>9</answer>
-
Format is perfectly learned: Every sample from both checkpoints uses the
<think>...</think><answer>...</answer>structure correctly. The 4 special tokens always appear in the right order. -
GRPO produces more consistent answers: On the "3 cars + 2 cars" problem, GRPO gets 5 on all 3 samples (correct!), while SFT gives 12, 21, and 5 (1/3 correct).
-
GRPO is more concise: Average CoT length dropped from 57 to 37 tokens after RL training. The model learned shorter reasoning traces are rewarded.
-
Math accuracy is low (expected for 124M): The model hallucinates calculations (e.g., "7 * 8 = 72" is close but wrong in 2 of 3 GRPO samples). A 124M parameter model was never designed for arithmetic - the point is that the full reasoning pipeline works end-to-end.
-
GRPO sample 1 for "7 * 8" shows near-correct reasoning: It computed "7 * 8 = 72" then answered 72. While wrong (56 is correct), 72 = 8 * 9, showing the model is in the right ballpark. More training iterations and a larger base model would improve this.
The model was fine-tuned exclusively on GSM8K math word problems. When given non-math prompts, it reveals how deeply the math-problem prior is baked in: it tries to turn everything into arithmetic.
Prompt: "What is the capital of France?"
GRPO: <think>The capital of France is 50000€.
So we can write ÷ 50000/10000 = 800000€
X is the capital of France
Since X is the capital of France we get 800000/2 = 400000€</think><answer>400000</answer>
SFT: <think>First find the total capital of France: $1,000
Then add the total capital of France to find the capital of other countries:
$1,000-$000 = $1000
Finally, multiply the initial number of capital by the number of countries
to find the total capital: $1000/1000 = $1,000</think><answer>1,000</answer>
The model interprets "capital" as a monetary quantity and starts calculating.
Prompt: "Who wrote Romeo and Juliet?"
GRPO: <think>Romeo is originally from a family of seven people.
Since we know that Romeo's family is seven people, we can calculate the number
of people in the family: 7 people * 7 people/person = 150 people</think><answer>150</answer>
SFT: <think>The first step is to find the total number of words in the book: 100 words
The second step is to find the total number of words in Romeo and Juliet:
100+100 = 150 words</think><answer>195</answer>
Instead of answering "Shakespeare", it counts words or family members.
Prompt: "Why is the sky blue?"
GRPO: <think>The sky is blue because it is blue because of the equation:
2 + 1 = 3
3 = 4</think><answer>4</answer>
SFT: <think>There are 5 stars in the sky, and there are 5 stars each of each color.
So the sky blue is 5*5=20 stars.</think><answer>20</answer>
Prompt: "How does photosynthesis work?"
GRPO: <think>There are 4 types of photosynthesis: photosynthesis with red light,
photosynthesis with green light, photosynthesis with blue light,
photosynthesis with yellow light.
The ratio of photosynthesis to red light is 1/2*50 = 25</think>
SFT: <think>The photosynthesis process takes place in two ways: first, by taking in air,
converting air into gas, then by converting it into liquid.
Each step in the process requires 2*2 = 4 steps in total.</think><answer>4</answer>
SFT's attempt is closer to real knowledge (air -> gas -> liquid is vaguely related to gas exchange) but still forces a numeric answer.
Prompt: "If all dogs are animals, and Buddy is a dog, what is Buddy?"
GRPO: <think>Buddy is a dog because: 1/2
Buddy is a dog because: 1/2
Buddy is a dog because: 1/2</think><answer>1</answer>
SFT: <think>Buddy is a dog because 2 x 2 = 4.
Buddy is a dog because 4 x 2 = 6.
Buddy is a dog because 6 x 3 = 11.</think><answer>11</answer>
The model recognizes "Buddy is a dog" as relevant but can't express it as a non-numeric answer.
Prompt: "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?" (correct answer: $0.05)
GRPO: <think>The bat and a ball cost 1.10 * $1.10 = $1.40
The ball is $1.40 - $1.10 = $9</think><answer>12</answer>
SFT: <think>The bat costs 1.10*1.10=$1.20 in total
The ball costs 1.20*1.4=$1.20 in total</think><answer>1.20</answer>
This is actually a math problem the model should be able to handle, but neither gets it right. The CRT (Cognitive Reflection Test) bat-and-ball problem is notoriously tricky even for humans.
Prompt: "Write a haiku about the ocean."
GRPO: </think>First find the total amount of the ocean: 1 gallon
Then subtract the amount of water which is water from the total: 150 gallons
Then multiply by 100% to find the overall amount: 150/100% = 4 gallons
Then divide that number by the number of gallons: 4/100% = 20 gallons</think><answer>20</answer>
SFT: </think>Since we only need to write a haiku once, we can have 2 haiku that use
the same method of writing: one that uses the same answer to find the answer,
and one that uses another answer to find the answer.</think><answer>1</answer>
No poetry was produced. The ocean becomes "1 gallon" and haiku becomes a counting exercise.
Prompt: "What would happen if the moon disappeared?"
GRPO: <think>If the moon is not present, then it means that the moon has gone away.
There are 5 days in a week and the moon is not present.
Therefore, the moon has disappeared.</think><answer>5</answer>
The model grasps the semantic content ("the moon has gone away") but forces a numeric conclusion.
Prompt: "The history of the internet began in"
GRPO: 1960, with the first internet.
Let x be the number of years of the internet.
If x is the number of years before the internet, then
(x + 1) = 1960
since we know that the internet started in 1960, then
the internet was in 1960/x = 1980</think><answer>1980</answer>
SFT: the first decade of the twentieth century, when the internet was first created.
It started out as a single internet service that was used by a small number of people
to transport their messages, thus making it the only internet service that was not
used by the masses.
Today, the internet is used by over 200 million people around the world...
The SFT checkpoint produces more coherent continuation text here, likely because the base GPT-2 knowledge bleeds through more before RL compressed its outputs.
Prompt: "Once upon a time, there was a"
GRPO: large change in the way in which people viewed its history.
People started to recognize the term "pastor sickness" because of the
large difference in its meaning.</think>
SFT: large number of people.
Thus, when she saw the number of people she was supposed to have met,
she was told to pay them her salary.
Therefore, at the end, there were 60 days.</think><answer>130</answer>
-
Total domain collapse to arithmetic: The model converts every prompt into a math word problem, regardless of the actual question. "Capital of France" becomes a currency calculation. "Who wrote Romeo and Juliet" becomes a word count. "Write a haiku" becomes gallon measurements.
-
Format still works: Even on nonsensical prompts, the
<think>...</think><answer>...</answer>structure is maintained. The special tokens are robust. -
Always produces a number: The answer is always numeric because the reward function (accuracy_reward) only rewards extracting and matching numbers. The model learned that the answer must be a number to get any reward.
-
Some semantic awareness survives: Despite the arithmetic prior, fragments of real knowledge leak through - "the internet started in 1960", "Buddy is a dog", "the moon has gone away". The base GPT-2 knowledge is there but suppressed by the math fine-tuning.
-
SFT preserves more base knowledge than GRPO: The SFT checkpoint tends to produce more coherent continuations on non-math prompts (e.g., "the internet" prompt). GRPO further compressed the model toward numeric outputs, which is exactly what RL optimized for.
-
Expected behavior: This is a well-known phenomenon in fine-tuned models. Training exclusively on one domain causes "catastrophic forgetting" of other capabilities. A production system would need multi-task training or a broader reward function to maintain general knowledge.