Releases: tmcarmichael/nn-observability
v2.4.0: corrected numbers, 57-point verification pipeline, bfloat16 fix, 35 references
v2.4.0
Fixed
- Llama 3B catch rate at 10% flag rate: 8.4% (was 7.8% from v2 data)
- Catch rate ceiling: 11-15% at 20% (was 12-15%)
- Cross-family gap: 2.9x (was 3.0x)
- Permutation F-statistic: 15.77 computed from data (was 15.87, hand-edited)
- Shuffle test: +0.014 +/- 0.019 rerun and committed as
results/shuffle_test_gpt2.json(was +0.008, uncommitted) - Llama 3B per-seed range: +0.084 to +0.102 (was +0.085 to +0.093)
- ANCOVA degrees of freedom: F(5,68) for 6 families (was F(3,56) for 4 families)
- bfloat16 crash on Mistral 7B and Llama 8B:
.float()cast on allcross_entropyandsoftmaxcalls intransformer_observe.py
Added
- 57-point numerical verification (
just verify): every data-dependent number in the paper checked against source JSONs - Content-diff checks on generated tables (
--checkmode) and macros (--checkmode) - Schema validation for paper-scope results JSONs (
just validate-results) - Provenance block in
run_model.pyoutput: model_revision, script, timestamp, device, torch_version scripts/shuffle_test.py: standalone shuffle test, 10 permutations on GPT-2 124Manalysis/lint_hardcoded.py: flags literal numbers in tex that should use macrostests/test_probe_sync.py: verifiessrc/probe.pyandscripts/run_model.pyproduce identical output (253 tests total)analysis/README.md: script table, new-model checklist, JSON schema- 8 new bibliography entries: model family papers (Qwen, Llama, Gemma, Mistral, Phi-3), Alain & Bengio 2017, Bricken et al. 2023, Groeneveld 2024 (OLMo), Honovich 2022, Min 2023 (35 total)
Changed
- All statistical tests (permutation, variance decomposition, mixed-effects) computed from results JSONs by the generator, not hardcoded
generate_tables.pyupdated for all 13 models (was 9)generate_data_macros.pysources bootstrap CI and token budget from JSONs (was hardcoded)- README rewritten
pip install -e .documented as alternative touv sync
v2.3.0: RAG and MedQA zero-shot transfer, Llama cliff, r_OC confirmed, six families
What changed since v2.2.1
Zero-shot downstream transfer
- SQuAD 2.0 RAG: WikiText-trained probe catches 11.8% of wrong answers at 20% flag rate that confidence misses. Zero-shot, no QA data in probe training.
- MedQA-USMLE: same probe catches 11.6% of wrong medical licensing answers at 20% flag rate. The model confidently produces wrong answers; standard output monitoring marks them as correct.
- TruthfulQA: aggregate catch rate matches the ceiling (13.5% at 20%), but the observer cannot discriminate within the confident-wrong subset (AUC 0.475). Boundary condition: fluent reproduction of memorized falsehoods is specifically resistant.
- The 12-15% saturation ceiling holds across language modeling, RAG, medical QA, and factual QA. Four tasks, same ceiling, same WikiText-trained probe.
Llama cliff
- Llama 3.2 1B full protocol: pcorr +0.286, matching GPT-2 and upper Qwen range.
- Llama 3.2 3B: +0.089 under identical methodology. Signal falls from the high-observability group to near the detection floor in one step.
- Architectural configuration changes between 1B (16 layers, 2048 dim) and 3B (28 layers, 3072 dim). Within-family evidence that architecture, not family identity, predicts observability.
r_OC width sweep
- 512-unit output predictor absorbs no more signal than 64-unit bottleneck (+0.130 vs +0.129 on Qwen 7B). Bottleneck limitation is dead.
Six families
- Mistral 7B (+0.313): highest clean signal in dataset. Seed agreement +0.995.
- Phi-3 Mini (+0.300): sixth family. Instruct-only variant.
- Permutation test: F=15.87, p=0.006, eta-squared 0.92. Leave-one-family-out: all p < 0.025.
Statistical hardening
- Shuffle test: trained probe on randomized labels achieves +0.008 (real: +0.334, ratio 10.7x)
- TOST equivalence: nonlinear MLP equivalent to linear within +/- 0.03 (p=0.025)
- Jonckheere-Terpstra: within-Qwen declining trend (p=0.002), small relative to between-family effect
- Cross-family control sensitivity: 49-64% confidence absorption across 9/10 models
- Qwen 14B bimodality disclosed (two probe solutions at +0.186 and +0.250)
Paper rewrite
- Abstract: 160 words, four-move structure. Closes on "Standard output monitoring marks these answers as correct. The observer does not."
- Introduction: compressed, Llama cliff, four-task contribution
- Architecture section: restructured to 120 lines, secondary stats moved to appendix
- Discussion: rewritten around categorical observability, MedQA steelman, frontier scale implication
- Limitations: three items, TruthfulQA boundary condition, WikiText memorization defense with three numbered arguments
Code
- All results committed as JSON
- Figure 1 regenerated with six families, Llama 1B solid marker
- Llama 1B excluded from family-level permutation test (different architecture inflates within-family variance)
v2.2.1: unified run_model.py, Mistral 7B fifth family
v2.2.1: Mistral 7B fifth family, unified experiment harness, methodology hardening
What changed since v2.2.0
New data
- Mistral 7B v0.3: fifth architecture family. pcorr +0.313, OC +0.156, seed agreement +0.995. Highest clean signal in the dataset (random head +0.014, no geometry inflation). Peak at L22 (69% depth), consistent with the two-thirds-depth pattern across all high-observability families.
- Five-family exclusive catch table: catch rate ranges from 7.8% (Llama 3B, pcorr +0.089) to 11.4% (Mistral 7B, pcorr +0.313) at 10% flag rate. All five families converge to 12-15% at 20% flag rate. The 3.5x pcorr gap compresses to 1.2x in catch rate at 20%, indicating a ceiling set by error structure rather than observability.
- Cross-domain transfer on Mistral: WikiText to C4 +0.155, C4 within-domain -0.010. Same asymmetry as Qwen and Llama: the signal is in the representations, the target construction requires clean text.
Paper improvements
- Exclusive catch reframing: "stable at 9-10%" replaced with sublinear saturation story across all five sections that referenced it. Abstract, introduction, architecture, related work, discussion all updated to "7-11% at 10%, converging to 12-15% at 20%."
- Flagging table rewritten: 5 models x 4 flag rates, replacing the previous 4-model single-rate table. Includes pcorr column showing the catch-rate-to-observability relationship.
- ANCOVA pseudoreplication caveat: labeled supplementary, added note that per-seed observations are correlated and the mixed-effects model is the primary test.
- Mann-Whitney replaced: U=49 p=0.0003 (pseudoreplication vulnerability) replaced with qualitative no-overlap statement (every Qwen 3B seed +0.225 to +0.288 exceeds every Llama 3B seed +0.085 to +0.093) plus forward reference to the permutation test for family-level inference.
- Split consistency: cross-family table footnoted as validation split (held-out seeds, n=6-7) with test-split confirmation (within 5%, rankings preserved). New appendix subsection "Layer selection and test-split confirmation" with actual numbers.
- Method section: "balanced by construction" corrected to "approximately balanced" (mean-zero by OLS, median near zero for large N). Mixed-effects equation now defines j (indexes seeds within model i). Seed agreement formula added (Eq. sagree).
- Random probe attribution fixed: +0.046 was from MNIST MLP, cited as if GPT-2. Replaced with actual transformer random_head values (+0.023 Qwen 3B, -0.002 Llama 3B, +0.014 Mistral 7B) in both method and signal sections. MLP binary-vs-regression comparison now explicitly attributed to MLP validation experiments.
- Table upgrades: added +/- std column to cross-family table and GPT-2 scaling table. Added OC/pcorr fraction column to GPT-2 table showing the 34% to 60% output-discard growth pattern.
- Cross-domain transfer promoted: from appendix-only to its own bold-header subsection in the architecture section, with four-family data (Qwen 7B, Qwen 14B, Mistral 7B, Llama 3B) and Gemma exception noted.
- Mistral added to cross-family table: 9 rows, 5 families.
Code
- Unified experiment harness:
scripts/run_model.pyreplaces 13 per-model scripts. Single file, no local imports, works on bare GPU pods. Handles bothmodel.model.layers(Llama, Qwen, Mistral, Gemma, Phi) andmodel.transformer.h(GPT-2). - Shared probe module:
src/probe.pywith architecture-agnostic_get_layer_list(). Used by analysis scripts and smoke tests. - r_OC width sweep script:
scripts/roc_width_sweep.pytests 64/128/256/512-unit output predictor on Qwen 7B. Ready to run. - Python 3.12 pinned:
.python-versionadded. 54/54 core tests pass. - Smoke tests:
tests/test_smoke_run_model.pyvalidates the full output JSON schema fromrun_model.py. - Pre-commit hooks: ruff lint + format on commit, version tag check on push.
- Gemma flagging marked invalid: suspected inverted observer polarity. Needs GPU recomputation.
- Legacy scripts archived: per-model scripts moved to
scripts/legacy/.
In progress (GPU)
- Phi-3 Mini full protocol (sixth family, running now)
- Llama 3.2 1B full protocol (confirms preliminary +0.250)
- Llama 3.1 8B full protocol (confirms preliminary +0.088)
- r_OC width sweep on Qwen 7B (resolves bottleneck limitation)
v2.2.0: multi-rate exclusive catches, instruct stability, abstract rewrite, safety framing, OpenAI monitorability positioning
What changed since v2.1.0
New findings
- Multi-rate exclusive catch analysis: observer catches 6-7% of errors at 5% flag rate, 9-10% at 10%, saturating near 13-15% at 20%, across all models tested
- Instruct operational stability: Qwen 7B instruct holds at 13.9% exclusive catches at 30% flag rate while base drops to 12.0%
- Multi-rate pattern confirmed at 3 of 5 Qwen scales (0.5B, 1.5B, 7B), mixed at 3B and 14B
- Llama gap corrected to 3.0x (from 2.8x) using v3 values
- Output-layer discard fraction corrected to 60% at 1.5B (from 68%)
Paper improvements
- Abstract rewritten: finding-first opener, multi-rate data, specific safety closer
- Introduction restructured: "the probe is standard; the measurement is not", split robustness and cross-family paragraphs, escalated safety framing
- Architecture section: mechanistic table moved to appendix, multi-rate exclusive catch data added, instruct stability finding added
- Discussion: closing paragraph now includes saturation data and output-layer discard trend, future work prioritizes controlled training experiment
- Limitations: tightened from 47 to 30 lines, cut redundant items
- Related work: added Guan et al. "Monitoring Monitorability" (OpenAI 2025) and Korbak et al. "Chain of Thought Monitorability" (2025), positioned as upstream constraint
- All banned vocabulary removed, section openers rewritten, rhetorical questions eliminated
- Decision quality explicitly scoped to confidence-residual loss signal in Method
- Contributions list now has forward references and specific numbers
Code
- Added
analysis/exclusive_catch_rates.py: multi-rate exclusive catch analysis across all models and flag rates - Added Mistral 7B and Phi-3 Mini data collection scripts (results pending)
- Nonlinear probe delta corrected (-0.041 for GPT-2, -0.019 for Qwen 14B)
- Orphaned LaTeX section files moved to sections/legacy/
- README fully rewritten to match paper v2.2.0
Citations
- 20 cited, 20 in bib, zero orphans
- New: Guan et al. 2025, Korbak et al. 2025
v2.1.0: Llama multi-layer sweep, split bootstrap, number audit, monitorability citations
Half the signal in standard activation probes is output confidence in disguise. After controlling for it with partial Spearman correlation, a stable linear signal remains across four transformer families and 11 model scales. Its strength varies by architecture family, not model size.
Key results
- Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds, seed agreement +0.993)
- Signal stable at rho_partial ~ +0.25 across 28x scale within Qwen 2.5 (0.5B--14B, 7-seed, matched ex/dim)
- Cross-family divergence: Llama 3.2 3B produces +0.089 vs Qwen 3B +0.263, a 3.0x gap (permutation test p = 0.014)
- 88% of variance in observability is between architecture families; 6% between scales
- Nonlinear MLP does not exceed the linear probe at matched hyperparameters on any of 8 models tested
- 9--10% of model errors are invisible to output confidence at every scale tested (GPT-2 124M through Qwen 14B)
- Instruction tuning preserves the signal at all five Qwen scales (0.5B through 14B)
- Mechanistic substrate (layer 0 attention, MLP suppression at layers 3--4) qualitatively identical between base and instruct on Qwen 7B
What changed since v2.0.0
- Llama multi-layer sweep (L0, L7, L14, L21, L27): signal absent at every depth under both linear and nonlinear probing, strongest reading (+0.148) below Qwen's noise floor
- Split-level bootstrap (Qwen 7B, 30 document resamples): rho_partial = +0.238, 95% CI [+0.215, +0.270], confirming signal is stable under data resampling
- Gap corrected to 3.0x (from 2.8x) using v3 values
- Number audit: all values verified against source JSONs, dual-protocol differences annotated
- New citations: Guan et al. "Monitoring Monitorability" (OpenAI 2025), Korbak et al. "Chain of Thought Monitorability" (2025)
- Paper prose tightened: abstract sharpened, discussion closing paragraph added, banned vocabulary removed, section openers rewritten
- Orphaned LaTeX section files moved to sections/legacy/
- Appendix stub reference removed
Reproducibility
All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. Statistical analysis reproduces from committed data without GPU:
cd analysis && python run_all.py
Full experiment reproduction:
just test # pytest suite
just check # lint + format
just reproduce # MLP + GPT-2 results (CPU/MPS/CUDA)
Architecture Predicts Linear Readability of Decision Quality in Transformers
Summary
Half the signal in standard activation probes is output confidence in disguise. After controlling for it with partial Spearman correlation, a stable linear signal remains across four transformer families and 11 model scales. Its strength varies by architecture family, not model size.
Key results
- Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds, seed agreement +0.993)
- Signal stable at rho_partial ~ +0.25 across 28x scale within Qwen 2.5 (0.5B--14B, 7-seed, matched ex/dim)
- Cross-family divergence: Llama 3.2 3B produces +0.089 vs Qwen 3B +0.263, a 2.8x gap (permutation test p = 0.014)
- 88% of variance in observability is between architecture families; 6% between scales
- Nonlinear MLP does not exceed the linear probe at matched hyperparameters on any of 8 models tested
- 9--10% of model errors are invisible to output confidence at every scale tested (GPT-2 124M through Qwen 14B)
- Instruction tuning preserves the signal at all five Qwen scales (0.5B through 14B)
- Mechanistic substrate (layer 0 attention, MLP suppression at layers 3--4) qualitatively identical between base and instruct on Qwen 7B
What changed since v1.0.0
- Qwen 2.5 scaling extended to five scales (0.5B, 1.5B, 3B, 7B, 14B) with full control batteries
- All scales at matched token budgets (350+ ex/dim, 600 for 0.5B)
- Gemma 3 1B added as fourth architecture family
- 8-model nonlinear probe comparison confirms signal is genuinely linear
- Statistical framework: mixed-effects model, exact permutation test, ANCOVA
- Token budget sensitivity analysis (7-point ex/dim sweep on Qwen 0.5B)
- Repo reorganized with analysis/, archive/, and figure generation scripts
Reproducibility
All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. Statistical analysis reproduces from committed data without GPU:
cd analysis && python run_all.py
Full experiment reproduction:
just test # pytest suite
just check # lint + format
just reproduce # MLP + GPT-2 results (CPU/MPS/CUDA)
v1.0.0: Internal Quality Signals in Transformer Activations
Summary
Frozen transformer activations contain a linearly readable decision-quality signal that survives strong controls for output confidence. The signal replicates in GPT-2, Qwen 2.5 1.5B, and Llama 3.2 1B, but diverges by family at larger scale: Qwen preserves it to 7B, while Llama largely loses it above 1B under the same evaluation protocol.
Key results
- Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds)
- Signal stable across 12x scale within GPT-2 (+0.279 to +0.290)
- Output-independent component increases with scale (+0.099 at 124M to +0.174 at 1.5B)
- 4,368 exclusive high-loss catches at 10% flag rate that confidence misses (GPT-2 124M)
- ~67% of the raw signal explained by named controls; ~33% remains unexplained
- Partial mechanistic support localizes the signal to distributed mid-layer attention (layers 5-7, GPT-2 124M only)
- Architecture-dependent scaling: Qwen preserves the signal to 7B, while Llama weakens sharply above 1B
Experimental arc
- Phases 1-3: Hand-designed observers collapse under confidence controls
- Phase 4: Learned binary heads recover signal on frozen MLP activations
- Phase 5: Transfer to GPT-2 124M with +0.99 seed agreement
- Phase 6: Catch errors confidence misses (4,368 exclusive at 10% flag rate)
- Phase 7: Outperform a 24,576-feature SAE probe (+0.290 vs +0.255)
- Phase 8: Hold stable across GPT-2 124M to 1.5B
- Phase 9: Replicate in Qwen and Llama at 1-1.5B, then diverge by family at larger scale
Reproducibility
All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. The environment is locked via uv.lock.
just test # 38 tests, ~2s
just reproduce # full reproduction, ~60 minPhase 9: Cross-family replication (GPT-2, Qwen, Llama)
The learned observer signal replicates across three independent architecture families with positive
output-controlled residuals in every case.
Summary
Phase 9 tests whether the signal from Phases 5-8 is a GPT-2-specific artifact or a broader property of pretrained
decoder-only transformers. Under the same evaluation protocol (layer sweep, three-seed battery, output-controlled
residual, negative baselines), Qwen 2.5 and Llama 3.2 both replicate the core finding. Hand-designed baselines
collapse in every family.
Also in this release: Phase 5f directional ablation (partial causal evidence), deep-merge fix for results JSON,
float16 inference, examples-per-dimension token scaling, lazy imports for test isolation, and Colab notebook for GPU
reproduction.
Cross-family results
| Model | Family | Params | Partial corr | Output-controlled | Seed agreement |
|---|---|---|---|---|---|
| GPT-2 XL | GPT-2 | 1558M | +0.290 | +0.174 | +0.952 |
| Qwen 2.5 1.5B | Qwen | 1544M | +0.284 | +0.207 | +0.982 |
| Llama 3.2 1B | Meta | 1236M | +0.250 | +0.126 | +0.999 |
What's next
Frontier scale (8B+), cross-domain transfer, and actionability (adaptive inference, abstention).
Phase 8: Stable decision-quality signal across GPT-2 scaling, with increasing output-independence
Decision-quality signal persists across GPT-2 124M to 1.5B with stable partial correlation (+0.279 to +0.290) and an output independent component that increases from +0.099 to +0.174.
Summary
This release completes the eight-phase experimental arc from structural comparison to scale characterization.
Phases 1-3 show that hand-designed activation observers collapse under partial-correlation controls. Phases 4-5
recover the signal with learned linear projections trained under binary supervision. Phase 6 shows complementary
error coverage beyond confidence. Phase 7 compares raw residual-stream observers against SAE-based probes. Phase 8
tests the result across GPT-2 124M, 355M, 774M, and 1.5B, finding stable signal strength, high seed agreement (0.88-0.95), and an output-independent component that increases across this scaling curve.
Results
| Model | Params | Peak | Partial corr | Output-controlled | Seed agreement |
|---|---|---|---|---|---|
| GPT-2 | 124M | L8 | +0.290 | +0.099 | +0.918 |
| GPT-2 Medium | 355M | L16 | +0.279 | +0.103 | +0.877 |
| GPT-2 Large | 774M | L24 | +0.286 | +0.164 | +0.901 |
| GPT-2 XL | 1558M | L34 | +0.290 | +0.174 | +0.952 |
What's next
Causal validation and cross-architecture scaling (Llama) are the active focus.