Skip to content

puranikyashaswin/gpt2-hallucination-probe

Repository files navigation

Hallucination Probe

Linear probes on GPT-2 residual stream activations, tested against distribution-matched controls.

tl;dr

A logistic regression probe achieves AUROC 0.71 on easy prompt distributions. When we control for surface features using matched-structure prompts (same syntax, different truth value), performance drops to 0.35 - below chance. The probe is detecting lexical artifacts, not a hallucination mechanism.

why this matters

Most probing studies use unconstrained distributions where factual and impossible prompts look obviously different. That makes any classifier look good. We introduced a three-tier dataset to test whether the signal survives distributional controls:

tier n what it tests
1 - easy 10 "What is 2+2?" vs "Who is the president of Mars?"
2 - matched 20 "Who discovered gravity?" vs "Who discovered gravity on Mars?"
3 - hard 16 "Is Pluto a planet?", "Who translated the Voynich Manuscript?"

Tier 2 is the critical control. If the probe works on tier 1 but not tier 2, the signal is surface-level.

results

layer sensitivity (5-fold CV)

Peak signal at L0 (AUROC 0.708), drops to chance at L8-L9, partial recovery at L11. Non-monotonic - early layers encode surface semantics the probe exploits, mid-layers restructure into something less linearly separable.

per-tier breakdown (LOO-CV, layer 11)

tier accuracy AUROC
easy 80% 0.76
matched 50% ~0.35
hard 75% 0.70

Tier 2 at chance. That's the result.

failure cases

The probe confidently misclassifies "What year did humans first land on the Moon?" (p=1.0 hallucination) and "Is Pluto a planet?" (p=1.0). It also calls "Who invented time travel?" factual (p=0.016). It tracks surface ambiguity, not factual correctness.

method

  • GPT-2 (124M), last-token residual stream, 768-dim per layer
  • logistic regression, L2 regularized (C=1.0), StandardScaler
  • 5-fold stratified CV for aggregate metrics, LOO-CV for per-sample failure analysis
  • output validation: separate pipeline generates model responses and checks correctness

usage

pip install torch transformers scikit-learn matplotlib seaborn numpy tqdm

python main.py --build                                 # extract + train
python main.py --prompt "Who invented time travel?"    # probe a prompt
python main.py --heatmap                               # full dataset heatmap
python main.py --analyze                               # pca + tiers + failures
python main.py --validate                              # check actual outputs

files

activation_extractor.py    hooks into transformer blocks, pulls residual stream
dataset.py                 3-tier prompt corpus (n=46)
build_activations.py       prompts -> numpy arrays
probe.py                   logistic regression, single-layer + per-layer
evaluator.py               score a prompt across all layers
analysis.py                pca, sensitivity, tier breakdown, failure report
output_validator.py        generates model outputs, checks correctness
visualize.py               heatmap, trajectory plots, auroc bars
main.py                    cli

limitations

Small dataset (46 samples). Linear probe only. No causal evidence - correlation between probe accuracy and prompt type doesn't establish mechanism. Single model (GPT-2-small). Output validation uses substring matching, which is fragile.

what would make this stronger

Probes trained on model outputs instead of inputs. Causal interventions (activation patching). Nonlinear probes to test if additional signal exists beyond linear separability. Larger dataset with automated prompt generation.

About

Linear probing suite on GPT-2 residual stream across all 12 layers. Documented negative result probe exploits lexical artifacts, not true hallucination mechanism.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages