Skip to content

TheClassicTechno/InterpretabilityNLPProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM NLP Interpretability Project

personal project by juli. explored how language models (LLMs) solve reasoning problems by analyzing which parts of the model actually do the work.

What This Does

Instead of treating models as black boxes, we systematically break them down into parts:

  1. Test baseline accuracy on math and logic problems
  2. Zero out attention heads one at a time and measure performance drops
  3. Identify which heads matter most
  4. Analyze what kinds of errors the model makes

This tells us which parts of the model are responsible for reasoning and where failures occur.

Start

  1. Install dependencies:

    pip install -r requirements.txt
  2. Run fast test (2-3 minutes):

    python -m src.experiment_lite --model gpt2 --dataset arithmetic --output results/test
  3. View results:

    python inspect_results.py results/test/results.json

How It Works

Step 1: Load Model

Load any HuggingFace model (gpt2, mistral, llama, etc).

Step 2: Test Baseline

Generate responses to questions and measure accuracy.

Example questions:

  • "What is 5 + 3?" (expected: 8)
  • "Alice has 10 dollars, spends 4. How much left?" (expected: 6)
  • "All dogs are animals. Max is a dog. Is Max an animal?" (expected: Yes)

Step 3: Run Ablations

For each attention head in the model:

  • Zero it out
  • Test accuracy again
  • Measure how much performance dropped
  • Rank heads by importance

Step 4: Analyze Failures

Categorize errors made by the model:

  • Hallucination: generates plausible but wrong text
  • Wrong operation: uses wrong math operation
  • Off-by-one: answer is close but not exact
  • Incomplete: answer is cut off

Step 5: Generate Visualizations

Create plots showing:

  • Which heads matter most (heatmap)
  • Which layers are important (bar chart)
  • What errors happen most (pie chart)

Datasets Made

Arithmetic (8 questions)

  • Simple math: 5+3, 10-4, 6*2, etc.

Multi-step (5 questions)

  • Word problems requiring multiple operations

Logic (4 questions)

  • Logical inference: "All X are Y, Z is X, is Z Y?"

GSM8K (5 questions)

  • Real grade school math problems

Commands

Run lightweight version (fast, 2-3 minutes):

python -m src.experiment_lite --model gpt2 --dataset arithmetic --output results/test

Run full version (includes head ablation, slower):

python -m src.experiment --model gpt2 --dataset arithmetic --output results/full

Test different dataset:

python -m src.experiment_lite --model gpt2 --dataset logic --output results/logic

Test different model:

python -m src.experiment_lite --model distilgpt2 --dataset arithmetic --output results/distil

View results:

python inspect_results.py results/test/results.json

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages