Hallucination Probe

Linear probes on GPT-2 residual stream activations, tested against distribution-matched controls.

tl;dr

A logistic regression probe achieves AUROC 0.71 on easy prompt distributions. When we control for surface features using matched-structure prompts (same syntax, different truth value), performance drops to 0.35 - below chance. The probe is detecting lexical artifacts, not a hallucination mechanism.

why this matters

Most probing studies use unconstrained distributions where factual and impossible prompts look obviously different. That makes any classifier look good. We introduced a three-tier dataset to test whether the signal survives distributional controls:

tier	n	what it tests
1 - easy	10	"What is 2+2?" vs "Who is the president of Mars?"
2 - matched	20	"Who discovered gravity?" vs "Who discovered gravity on Mars?"
3 - hard	16	"Is Pluto a planet?", "Who translated the Voynich Manuscript?"

Tier 2 is the critical control. If the probe works on tier 1 but not tier 2, the signal is surface-level.

results

layer sensitivity (5-fold CV)

Peak signal at L0 (AUROC 0.708), drops to chance at L8-L9, partial recovery at L11. Non-monotonic - early layers encode surface semantics the probe exploits, mid-layers restructure into something less linearly separable.

per-tier breakdown (LOO-CV, layer 11)

tier	accuracy	AUROC
easy	80%	0.76
matched	50%	~0.35
hard	75%	0.70

Tier 2 at chance. That's the result.

failure cases

The probe confidently misclassifies "What year did humans first land on the Moon?" (p=1.0 hallucination) and "Is Pluto a planet?" (p=1.0). It also calls "Who invented time travel?" factual (p=0.016). It tracks surface ambiguity, not factual correctness.

method

GPT-2 (124M), last-token residual stream, 768-dim per layer
logistic regression, L2 regularized (C=1.0), StandardScaler
5-fold stratified CV for aggregate metrics, LOO-CV for per-sample failure analysis
output validation: separate pipeline generates model responses and checks correctness

usage

pip install torch transformers scikit-learn matplotlib seaborn numpy tqdm

python main.py --build                                 # extract + train
python main.py --prompt "Who invented time travel?"    # probe a prompt
python main.py --heatmap                               # full dataset heatmap
python main.py --analyze                               # pca + tiers + failures
python main.py --validate                              # check actual outputs

files

activation_extractor.py    hooks into transformer blocks, pulls residual stream
dataset.py                 3-tier prompt corpus (n=46)
build_activations.py       prompts -> numpy arrays
probe.py                   logistic regression, single-layer + per-layer
evaluator.py               score a prompt across all layers
analysis.py                pca, sensitivity, tier breakdown, failure report
output_validator.py        generates model outputs, checks correctness
visualize.py               heatmap, trajectory plots, auroc bars
main.py                    cli

limitations

Small dataset (46 samples). Linear probe only. No causal evidence - correlation between probe accuracy and prompt type doesn't establish mechanism. Single model (GPT-2-small). Output validation uses substring matching, which is fragile.

what would make this stronger

Probes trained on model outputs instead of inputs. Causal interventions (activation patching). Nonlinear probes to test if additional signal exists beyond linear separability. Larger dataset with automated prompt generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hallucination Probe

tl;dr

why this matters

results

layer sensitivity (5-fold CV)

per-tier breakdown (LOO-CV, layer 11)

failure cases

method

usage

files

limitations

what would make this stronger

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
RESEARCH_SUMMARY.md		RESEARCH_SUMMARY.md
activation_extractor.py		activation_extractor.py
analysis.py		analysis.py
build_activations.py		build_activations.py
dataset.py		dataset.py
evaluator.py		evaluator.py
main.py		main.py
output_validator.py		output_validator.py
probe.py		probe.py
readme.md		readme.md
requirements.txt		requirements.txt
visualize.py		visualize.py

Folders and files

Latest commit

History

Repository files navigation

Hallucination Probe

tl;dr

why this matters

results

layer sensitivity (5-fold CV)

per-tier breakdown (LOO-CV, layer 11)

failure cases

method

usage

files

limitations

what would make this stronger

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages