LLM NLP Interpretability Project

personal project by juli. explored how language models (LLMs) solve reasoning problems by analyzing which parts of the model actually do the work.

What This Does

Instead of treating models as black boxes, we systematically break them down into parts:

Test baseline accuracy on math and logic problems
Zero out attention heads one at a time and measure performance drops
Identify which heads matter most
Analyze what kinds of errors the model makes

This tells us which parts of the model are responsible for reasoning and where failures occur.

Start

Install dependencies:
```
pip install -r requirements.txt
```

Run fast test (2-3 minutes):

python -m src.experiment_lite --model gpt2 --dataset arithmetic --output results/test

View results:

python inspect_results.py results/test/results.json

How It Works

Step 1: Load Model

Load any HuggingFace model (gpt2, mistral, llama, etc).

Step 2: Test Baseline

Generate responses to questions and measure accuracy.

Example questions:

"What is 5 + 3?" (expected: 8)
"Alice has 10 dollars, spends 4. How much left?" (expected: 6)
"All dogs are animals. Max is a dog. Is Max an animal?" (expected: Yes)

Step 3: Run Ablations

For each attention head in the model:

Zero it out
Test accuracy again
Measure how much performance dropped
Rank heads by importance

Step 4: Analyze Failures

Categorize errors made by the model:

Hallucination: generates plausible but wrong text
Wrong operation: uses wrong math operation
Off-by-one: answer is close but not exact
Incomplete: answer is cut off

Step 5: Generate Visualizations

Create plots showing:

Which heads matter most (heatmap)
Which layers are important (bar chart)
What errors happen most (pie chart)

Datasets Made

Arithmetic (8 questions)

Simple math: 5+3, 10-4, 6*2, etc.

Multi-step (5 questions)

Word problems requiring multiple operations

Logic (4 questions)

Logical inference: "All X are Y, Z is X, is Z Y?"

GSM8K (5 questions)

Real grade school math problems

Commands

Run lightweight version (fast, 2-3 minutes):

python -m src.experiment_lite --model gpt2 --dataset arithmetic --output results/test

Run full version (includes head ablation, slower):

python -m src.experiment --model gpt2 --dataset arithmetic --output results/full

Test different dataset:

python -m src.experiment_lite --model gpt2 --dataset logic --output results/logic

Test different model:

python -m src.experiment_lite --model distilgpt2 --dataset arithmetic --output results/distil

View results:

python inspect_results.py results/test/results.json

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
README.md		README.md
config.py		config.py
inspect_results.py		inspect_results.py
quick_test.py		quick_test.py
requirements.txt		requirements.txt
run_experiments.py		run_experiments.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM NLP Interpretability Project

What This Does

Start

How It Works

Step 1: Load Model

Step 2: Test Baseline

Step 3: Run Ablations

Step 4: Analyze Failures

Step 5: Generate Visualizations

Datasets Made

Commands

About

Uh oh!

Releases

Packages

Languages

TheClassicTechno/InterpretabilityNLPProject

Folders and files

Latest commit

History

Repository files navigation

LLM NLP Interpretability Project

What This Does

Start

How It Works

Step 1: Load Model

Step 2: Test Baseline

Step 3: Run Ablations

Step 4: Analyze Failures

Step 5: Generate Visualizations

Datasets Made

Commands

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages