personal project by juli. explored how language models (LLMs) solve reasoning problems by analyzing which parts of the model actually do the work.
Instead of treating models as black boxes, we systematically break them down into parts:
- Test baseline accuracy on math and logic problems
- Zero out attention heads one at a time and measure performance drops
- Identify which heads matter most
- Analyze what kinds of errors the model makes
This tells us which parts of the model are responsible for reasoning and where failures occur.
-
Install dependencies:
pip install -r requirements.txt
-
Run fast test (2-3 minutes):
python -m src.experiment_lite --model gpt2 --dataset arithmetic --output results/test
-
View results:
python inspect_results.py results/test/results.json
Load any HuggingFace model (gpt2, mistral, llama, etc).
Generate responses to questions and measure accuracy.
Example questions:
- "What is 5 + 3?" (expected: 8)
- "Alice has 10 dollars, spends 4. How much left?" (expected: 6)
- "All dogs are animals. Max is a dog. Is Max an animal?" (expected: Yes)
For each attention head in the model:
- Zero it out
- Test accuracy again
- Measure how much performance dropped
- Rank heads by importance
Categorize errors made by the model:
- Hallucination: generates plausible but wrong text
- Wrong operation: uses wrong math operation
- Off-by-one: answer is close but not exact
- Incomplete: answer is cut off
Create plots showing:
- Which heads matter most (heatmap)
- Which layers are important (bar chart)
- What errors happen most (pie chart)
Arithmetic (8 questions)
- Simple math: 5+3, 10-4, 6*2, etc.
Multi-step (5 questions)
- Word problems requiring multiple operations
Logic (4 questions)
- Logical inference: "All X are Y, Z is X, is Z Y?"
GSM8K (5 questions)
- Real grade school math problems
Run lightweight version (fast, 2-3 minutes):
python -m src.experiment_lite --model gpt2 --dataset arithmetic --output results/testRun full version (includes head ablation, slower):
python -m src.experiment --model gpt2 --dataset arithmetic --output results/fullTest different dataset:
python -m src.experiment_lite --model gpt2 --dataset logic --output results/logicTest different model:
python -m src.experiment_lite --model distilgpt2 --dataset arithmetic --output results/distilView results:
python inspect_results.py results/test/results.json