Linear probes on GPT-2 residual stream activations, tested against distribution-matched controls.
A logistic regression probe achieves AUROC 0.71 on easy prompt distributions. When we control for surface features using matched-structure prompts (same syntax, different truth value), performance drops to 0.35 - below chance. The probe is detecting lexical artifacts, not a hallucination mechanism.
Most probing studies use unconstrained distributions where factual and impossible prompts look obviously different. That makes any classifier look good. We introduced a three-tier dataset to test whether the signal survives distributional controls:
| tier | n | what it tests |
|---|---|---|
| 1 - easy | 10 | "What is 2+2?" vs "Who is the president of Mars?" |
| 2 - matched | 20 | "Who discovered gravity?" vs "Who discovered gravity on Mars?" |
| 3 - hard | 16 | "Is Pluto a planet?", "Who translated the Voynich Manuscript?" |
Tier 2 is the critical control. If the probe works on tier 1 but not tier 2, the signal is surface-level.
Peak signal at L0 (AUROC 0.708), drops to chance at L8-L9, partial recovery at L11. Non-monotonic - early layers encode surface semantics the probe exploits, mid-layers restructure into something less linearly separable.
| tier | accuracy | AUROC |
|---|---|---|
| easy | 80% | 0.76 |
| matched | 50% | ~0.35 |
| hard | 75% | 0.70 |
Tier 2 at chance. That's the result.
The probe confidently misclassifies "What year did humans first land on the Moon?" (p=1.0 hallucination) and "Is Pluto a planet?" (p=1.0). It also calls "Who invented time travel?" factual (p=0.016). It tracks surface ambiguity, not factual correctness.
- GPT-2 (124M), last-token residual stream, 768-dim per layer
- logistic regression, L2 regularized (C=1.0), StandardScaler
- 5-fold stratified CV for aggregate metrics, LOO-CV for per-sample failure analysis
- output validation: separate pipeline generates model responses and checks correctness
pip install torch transformers scikit-learn matplotlib seaborn numpy tqdm
python main.py --build # extract + train
python main.py --prompt "Who invented time travel?" # probe a prompt
python main.py --heatmap # full dataset heatmap
python main.py --analyze # pca + tiers + failures
python main.py --validate # check actual outputs
activation_extractor.py hooks into transformer blocks, pulls residual stream
dataset.py 3-tier prompt corpus (n=46)
build_activations.py prompts -> numpy arrays
probe.py logistic regression, single-layer + per-layer
evaluator.py score a prompt across all layers
analysis.py pca, sensitivity, tier breakdown, failure report
output_validator.py generates model outputs, checks correctness
visualize.py heatmap, trajectory plots, auroc bars
main.py cli
Small dataset (46 samples). Linear probe only. No causal evidence - correlation between probe accuracy and prompt type doesn't establish mechanism. Single model (GPT-2-small). Output validation uses substring matching, which is fragile.
Probes trained on model outputs instead of inputs. Causal interventions (activation patching). Nonlinear probes to test if additional signal exists beyond linear separability. Larger dataset with automated prompt generation.