16 Apr 01:44

tmcarmichael

9b6f389

v2.4.0: corrected numbers, 57-point verification pipeline, bfloat16 fix, 35 references Latest

Latest

v2.4.0

Fixed

Llama 3B catch rate at 10% flag rate: 8.4% (was 7.8% from v2 data)
Catch rate ceiling: 11-15% at 20% (was 12-15%)
Cross-family gap: 2.9x (was 3.0x)
Permutation F-statistic: 15.77 computed from data (was 15.87, hand-edited)
Shuffle test: +0.014 +/- 0.019 rerun and committed as results/shuffle_test_gpt2.json (was +0.008, uncommitted)
Llama 3B per-seed range: +0.084 to +0.102 (was +0.085 to +0.093)
ANCOVA degrees of freedom: F(5,68) for 6 families (was F(3,56) for 4 families)
bfloat16 crash on Mistral 7B and Llama 8B: .float() cast on all cross_entropy and softmax calls in transformer_observe.py

Added

57-point numerical verification (just verify): every data-dependent number in the paper checked against source JSONs
Content-diff checks on generated tables (--check mode) and macros (--check mode)
Schema validation for paper-scope results JSONs (just validate-results)
Provenance block in run_model.py output: model_revision, script, timestamp, device, torch_version
scripts/shuffle_test.py: standalone shuffle test, 10 permutations on GPT-2 124M
analysis/lint_hardcoded.py: flags literal numbers in tex that should use macros
tests/test_probe_sync.py: verifies src/probe.py and scripts/run_model.py produce identical output (253 tests total)
analysis/README.md: script table, new-model checklist, JSON schema
8 new bibliography entries: model family papers (Qwen, Llama, Gemma, Mistral, Phi-3), Alain & Bengio 2017, Bricken et al. 2023, Groeneveld 2024 (OLMo), Honovich 2022, Min 2023 (35 total)

Changed

All statistical tests (permutation, variance decomposition, mixed-effects) computed from results JSONs by the generator, not hardcoded
generate_tables.py updated for all 13 models (was 9)
generate_data_macros.py sources bootstrap CI and token budget from JSONs (was hardcoded)
README rewritten
pip install -e . documented as alternative to uv sync

Assets 2

14 Apr 04:36

tmcarmichael

v2.3.0

472d568

v2.3.0: RAG and MedQA zero-shot transfer, Llama cliff, r_OC confirmed, six families

What changed since v2.2.1

Zero-shot downstream transfer

SQuAD 2.0 RAG: WikiText-trained probe catches 11.8% of wrong answers at 20% flag rate that confidence misses. Zero-shot, no QA data in probe training.
MedQA-USMLE: same probe catches 11.6% of wrong medical licensing answers at 20% flag rate. The model confidently produces wrong answers; standard output monitoring marks them as correct.
TruthfulQA: aggregate catch rate matches the ceiling (13.5% at 20%), but the observer cannot discriminate within the confident-wrong subset (AUC 0.475). Boundary condition: fluent reproduction of memorized falsehoods is specifically resistant.
The 12-15% saturation ceiling holds across language modeling, RAG, medical QA, and factual QA. Four tasks, same ceiling, same WikiText-trained probe.

Llama cliff

Llama 3.2 1B full protocol: pcorr +0.286, matching GPT-2 and upper Qwen range.
Llama 3.2 3B: +0.089 under identical methodology. Signal falls from the high-observability group to near the detection floor in one step.
Architectural configuration changes between 1B (16 layers, 2048 dim) and 3B (28 layers, 3072 dim). Within-family evidence that architecture, not family identity, predicts observability.

r_OC width sweep

512-unit output predictor absorbs no more signal than 64-unit bottleneck (+0.130 vs +0.129 on Qwen 7B). Bottleneck limitation is dead.

Six families

Mistral 7B (+0.313): highest clean signal in dataset. Seed agreement +0.995.
Phi-3 Mini (+0.300): sixth family. Instruct-only variant.
Permutation test: F=15.87, p=0.006, eta-squared 0.92. Leave-one-family-out: all p < 0.025.

Statistical hardening

Shuffle test: trained probe on randomized labels achieves +0.008 (real: +0.334, ratio 10.7x)
TOST equivalence: nonlinear MLP equivalent to linear within +/- 0.03 (p=0.025)
Jonckheere-Terpstra: within-Qwen declining trend (p=0.002), small relative to between-family effect
Cross-family control sensitivity: 49-64% confidence absorption across 9/10 models
Qwen 14B bimodality disclosed (two probe solutions at +0.186 and +0.250)

Paper rewrite

Abstract: 160 words, four-move structure. Closes on "Standard output monitoring marks these answers as correct. The observer does not."
Introduction: compressed, Llama cliff, four-task contribution
Architecture section: restructured to 120 lines, secondary stats moved to appendix
Discussion: rewritten around categorical observability, MedQA steelman, frontier scale implication
Limitations: three items, TruthfulQA boundary condition, WikiText memorization defense with three numbered arguments

Code

All results committed as JSON
Figure 1 regenerated with six families, Llama 1B solid marker
Llama 1B excluded from family-level permutation test (different architecture inflates within-family variance)

Assets 3

13 Apr 17:18

tmcarmichael

v2.2.1

bd75024

v2.2.1: unified run_model.py, Mistral 7B fifth family

v2.2.1: Mistral 7B fifth family, unified experiment harness, methodology hardening

What changed since v2.2.0

New data

Mistral 7B v0.3: fifth architecture family. pcorr +0.313, OC +0.156, seed agreement +0.995. Highest clean signal in the dataset (random head +0.014, no geometry inflation). Peak at L22 (69% depth), consistent with the two-thirds-depth pattern across all high-observability families.
Five-family exclusive catch table: catch rate ranges from 7.8% (Llama 3B, pcorr +0.089) to 11.4% (Mistral 7B, pcorr +0.313) at 10% flag rate. All five families converge to 12-15% at 20% flag rate. The 3.5x pcorr gap compresses to 1.2x in catch rate at 20%, indicating a ceiling set by error structure rather than observability.
Cross-domain transfer on Mistral: WikiText to C4 +0.155, C4 within-domain -0.010. Same asymmetry as Qwen and Llama: the signal is in the representations, the target construction requires clean text.

Paper improvements

Exclusive catch reframing: "stable at 9-10%" replaced with sublinear saturation story across all five sections that referenced it. Abstract, introduction, architecture, related work, discussion all updated to "7-11% at 10%, converging to 12-15% at 20%."
Flagging table rewritten: 5 models x 4 flag rates, replacing the previous 4-model single-rate table. Includes pcorr column showing the catch-rate-to-observability relationship.
ANCOVA pseudoreplication caveat: labeled supplementary, added note that per-seed observations are correlated and the mixed-effects model is the primary test.
Mann-Whitney replaced: U=49 p=0.0003 (pseudoreplication vulnerability) replaced with qualitative no-overlap statement (every Qwen 3B seed +0.225 to +0.288 exceeds every Llama 3B seed +0.085 to +0.093) plus forward reference to the permutation test for family-level inference.
Split consistency: cross-family table footnoted as validation split (held-out seeds, n=6-7) with test-split confirmation (within 5%, rankings preserved). New appendix subsection "Layer selection and test-split confirmation" with actual numbers.
Method section: "balanced by construction" corrected to "approximately balanced" (mean-zero by OLS, median near zero for large N). Mixed-effects equation now defines j (indexes seeds within model i). Seed agreement formula added (Eq. sagree).
Random probe attribution fixed: +0.046 was from MNIST MLP, cited as if GPT-2. Replaced with actual transformer random_head values (+0.023 Qwen 3B, -0.002 Llama 3B, +0.014 Mistral 7B) in both method and signal sections. MLP binary-vs-regression comparison now explicitly attributed to MLP validation experiments.
Table upgrades: added +/- std column to cross-family table and GPT-2 scaling table. Added OC/pcorr fraction column to GPT-2 table showing the 34% to 60% output-discard growth pattern.
Cross-domain transfer promoted: from appendix-only to its own bold-header subsection in the architecture section, with four-family data (Qwen 7B, Qwen 14B, Mistral 7B, Llama 3B) and Gemma exception noted.
Mistral added to cross-family table: 9 rows, 5 families.

Code

Unified experiment harness: scripts/run_model.py replaces 13 per-model scripts. Single file, no local imports, works on bare GPU pods. Handles both model.model.layers (Llama, Qwen, Mistral, Gemma, Phi) and model.transformer.h (GPT-2).
Shared probe module: src/probe.py with architecture-agnostic _get_layer_list(). Used by analysis scripts and smoke tests.
r_OC width sweep script: scripts/roc_width_sweep.py tests 64/128/256/512-unit output predictor on Qwen 7B. Ready to run.
Python 3.12 pinned: .python-version added. 54/54 core tests pass.
Smoke tests: tests/test_smoke_run_model.py validates the full output JSON schema from run_model.py.
Pre-commit hooks: ruff lint + format on commit, version tag check on push.
Gemma flagging marked invalid: suspected inverted observer polarity. Needs GPU recomputation.
Legacy scripts archived: per-model scripts moved to scripts/legacy/.

In progress (GPU)

Phi-3 Mini full protocol (sixth family, running now)
Llama 3.2 1B full protocol (confirms preliminary +0.250)
Llama 3.1 8B full protocol (confirms preliminary +0.088)
r_OC width sweep on Qwen 7B (resolves bottleneck limitation)

Assets 3

13 Apr 04:14

tmcarmichael

v2.2.0

531a9d2

v2.2.0: multi-rate exclusive catches, instruct stability, abstract rewrite, safety framing, OpenAI monitorability positioning

What changed since v2.1.0

New findings

Multi-rate exclusive catch analysis: observer catches 6-7% of errors at 5% flag rate, 9-10% at 10%, saturating near 13-15% at 20%, across all models tested
Instruct operational stability: Qwen 7B instruct holds at 13.9% exclusive catches at 30% flag rate while base drops to 12.0%
Multi-rate pattern confirmed at 3 of 5 Qwen scales (0.5B, 1.5B, 7B), mixed at 3B and 14B
Llama gap corrected to 3.0x (from 2.8x) using v3 values
Output-layer discard fraction corrected to 60% at 1.5B (from 68%)

Paper improvements

Abstract rewritten: finding-first opener, multi-rate data, specific safety closer
Introduction restructured: "the probe is standard; the measurement is not", split robustness and cross-family paragraphs, escalated safety framing
Architecture section: mechanistic table moved to appendix, multi-rate exclusive catch data added, instruct stability finding added
Discussion: closing paragraph now includes saturation data and output-layer discard trend, future work prioritizes controlled training experiment
Limitations: tightened from 47 to 30 lines, cut redundant items
Related work: added Guan et al. "Monitoring Monitorability" (OpenAI 2025) and Korbak et al. "Chain of Thought Monitorability" (2025), positioned as upstream constraint
All banned vocabulary removed, section openers rewritten, rhetorical questions eliminated
Decision quality explicitly scoped to confidence-residual loss signal in Method
Contributions list now has forward references and specific numbers

Code

Added analysis/exclusive_catch_rates.py: multi-rate exclusive catch analysis across all models and flag rates
Added Mistral 7B and Phi-3 Mini data collection scripts (results pending)
Nonlinear probe delta corrected (-0.041 for GPT-2, -0.019 for Qwen 14B)
Orphaned LaTeX section files moved to sections/legacy/
README fully rewritten to match paper v2.2.0

Citations

20 cited, 20 in bib, zero orphans
New: Guan et al. 2025, Korbak et al. 2025

Assets 3

13 Apr 02:42

tmcarmichael

v2.1.0

b455d0a

v2.1.0: Llama multi-layer sweep, split bootstrap, number audit, monitorability citations

Half the signal in standard activation probes is output confidence in disguise. After controlling for it with partial Spearman correlation, a stable linear signal remains across four transformer families and 11 model scales. Its strength varies by architecture family, not model size.

Key results

Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds, seed agreement +0.993)
Signal stable at rho_partial ~ +0.25 across 28x scale within Qwen 2.5 (0.5B--14B, 7-seed, matched ex/dim)
Cross-family divergence: Llama 3.2 3B produces +0.089 vs Qwen 3B +0.263, a 3.0x gap (permutation test p = 0.014)
88% of variance in observability is between architecture families; 6% between scales
Nonlinear MLP does not exceed the linear probe at matched hyperparameters on any of 8 models tested
9--10% of model errors are invisible to output confidence at every scale tested (GPT-2 124M through Qwen 14B)
Instruction tuning preserves the signal at all five Qwen scales (0.5B through 14B)
Mechanistic substrate (layer 0 attention, MLP suppression at layers 3--4) qualitatively identical between base and instruct on Qwen 7B

What changed since v2.0.0

Llama multi-layer sweep (L0, L7, L14, L21, L27): signal absent at every depth under both linear and nonlinear probing, strongest reading (+0.148) below Qwen's noise floor
Split-level bootstrap (Qwen 7B, 30 document resamples): rho_partial = +0.238, 95% CI [+0.215, +0.270], confirming signal is stable under data resampling
Gap corrected to 3.0x (from 2.8x) using v3 values
Number audit: all values verified against source JSONs, dual-protocol differences annotated
New citations: Guan et al. "Monitoring Monitorability" (OpenAI 2025), Korbak et al. "Chain of Thought Monitorability" (2025)
Paper prose tightened: abstract sharpened, discussion closing paragraph added, banned vocabulary removed, section openers rewritten
Orphaned LaTeX section files moved to sections/legacy/
Appendix stub reference removed

Reproducibility

All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. Statistical analysis reproduces from committed data without GPU:

cd analysis && python run_all.py

Full experiment reproduction:

just test # pytest suite
just check # lint + format
just reproduce # MLP + GPT-2 results (CPU/MPS/CUDA)

Assets 3

12 Apr 18:10

tmcarmichael

v2.0.0

ec37caa

Architecture Predicts Linear Readability of Decision Quality in Transformers

Summary

Key results

Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds, seed agreement +0.993)
Signal stable at rho_partial ~ +0.25 across 28x scale within Qwen 2.5 (0.5B--14B, 7-seed, matched ex/dim)
Cross-family divergence: Llama 3.2 3B produces +0.089 vs Qwen 3B +0.263, a 2.8x gap (permutation test p = 0.014)
88% of variance in observability is between architecture families; 6% between scales
Nonlinear MLP does not exceed the linear probe at matched hyperparameters on any of 8 models tested
9--10% of model errors are invisible to output confidence at every scale tested (GPT-2 124M through Qwen 14B)
Instruction tuning preserves the signal at all five Qwen scales (0.5B through 14B)
Mechanistic substrate (layer 0 attention, MLP suppression at layers 3--4) qualitatively identical between base and instruct on Qwen 7B

What changed since v1.0.0

Qwen 2.5 scaling extended to five scales (0.5B, 1.5B, 3B, 7B, 14B) with full control batteries
All scales at matched token budgets (350+ ex/dim, 600 for 0.5B)
Gemma 3 1B added as fourth architecture family
8-model nonlinear probe comparison confirms signal is genuinely linear
Statistical framework: mixed-effects model, exact permutation test, ANCOVA
Token budget sensitivity analysis (7-point ex/dim sweep on Qwen 0.5B)
Repo reorganized with analysis/, archive/, and figure generation scripts

Reproducibility

All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. Statistical analysis reproduces from committed data without GPU:

cd analysis && python run_all.py

Full experiment reproduction:

just test       # pytest suite
just check      # lint + format
just reproduce  # MLP + GPT-2 results (CPU/MPS/CUDA)

Assets 3

07 Apr 06:33

tmcarmichael

v1.0.0

f761af1

v1.0.0: Internal Quality Signals in Transformer Activations

Summary

Frozen transformer activations contain a linearly readable decision-quality signal that survives strong controls for output confidence. The signal replicates in GPT-2, Qwen 2.5 1.5B, and Llama 3.2 1B, but diverges by family at larger scale: Qwen preserves it to 7B, while Llama largely loses it above 1B under the same evaluation protocol.

Key results

Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds)
Signal stable across 12x scale within GPT-2 (+0.279 to +0.290)
Output-independent component increases with scale (+0.099 at 124M to +0.174 at 1.5B)
4,368 exclusive high-loss catches at 10% flag rate that confidence misses (GPT-2 124M)
~67% of the raw signal explained by named controls; ~33% remains unexplained
Partial mechanistic support localizes the signal to distributed mid-layer attention (layers 5-7, GPT-2 124M only)
Architecture-dependent scaling: Qwen preserves the signal to 7B, while Llama weakens sharply above 1B

Experimental arc

Phases 1-3: Hand-designed observers collapse under confidence controls
Phase 4: Learned binary heads recover signal on frozen MLP activations
Phase 5: Transfer to GPT-2 124M with +0.99 seed agreement
Phase 6: Catch errors confidence misses (4,368 exclusive at 10% flag rate)
Phase 7: Outperform a 24,576-feature SAE probe (+0.290 vs +0.255)
Phase 8: Hold stable across GPT-2 124M to 1.5B
Phase 9: Replicate in Qwen and Llama at 1-1.5B, then diverge by family at larger scale

Reproducibility

All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. The environment is locked via uv.lock.

just test       # 38 tests, ~2s
just reproduce  # full reproduction, ~60 min

Assets 2

06 Apr 19:22

tmcarmichael

v0.9.0

175be2f

Phase 9: Cross-family replication (GPT-2, Qwen, Llama)

The learned observer signal replicates across three independent architecture families with positive
output-controlled residuals in every case.

Summary

Phase 9 tests whether the signal from Phases 5-8 is a GPT-2-specific artifact or a broader property of pretrained
decoder-only transformers. Under the same evaluation protocol (layer sweep, three-seed battery, output-controlled
residual, negative baselines), Qwen 2.5 and Llama 3.2 both replicate the core finding. Hand-designed baselines
collapse in every family.

Also in this release: Phase 5f directional ablation (partial causal evidence), deep-merge fix for results JSON,
float16 inference, examples-per-dimension token scaling, lazy imports for test isolation, and Colab notebook for GPU
reproduction.

Cross-family results

Model	Family	Params	Partial corr	Output-controlled	Seed agreement
GPT-2 XL	GPT-2	1558M	+0.290	+0.174	+0.952
Qwen 2.5 1.5B	Qwen	1544M	+0.284	+0.207	+0.982
Llama 3.2 1B	Meta	1236M	+0.250	+0.126	+0.999

What's next

Frontier scale (8B+), cross-domain transfer, and actionability (adaptive inference, abstention).

Assets 2

06 Apr 05:49

tmcarmichael

v0.8.0

5f1711e

Phase 8: Stable decision-quality signal across GPT-2 scaling, with increasing output-independence

Decision-quality signal persists across GPT-2 124M to 1.5B with stable partial correlation (+0.279 to +0.290) and an output independent component that increases from +0.099 to +0.174.

Summary

This release completes the eight-phase experimental arc from structural comparison to scale characterization.

Phases 1-3 show that hand-designed activation observers collapse under partial-correlation controls. Phases 4-5
recover the signal with learned linear projections trained under binary supervision. Phase 6 shows complementary
error coverage beyond confidence. Phase 7 compares raw residual-stream observers against SAE-based probes. Phase 8
tests the result across GPT-2 124M, 355M, 774M, and 1.5B, finding stable signal strength, high seed agreement (0.88-0.95), and an output-independent component that increases across this scaling curve.

Results

Model	Params	Peak	Partial corr	Output-controlled	Seed agreement
GPT-2	124M	L8	+0.290	+0.099	+0.918
GPT-2 Medium	355M	L16	+0.279	+0.103	+0.877
GPT-2 Large	774M	L24	+0.286	+0.164	+0.901
GPT-2 XL	1558M	L34	+0.290	+0.174	+0.952

What's next

Causal validation and cross-architecture scaling (Llama) are the active focus.

Assets 2

Releases: tmcarmichael/nn-observability

v2.4.0: corrected numbers, 57-point verification pipeline, bfloat16 fix, 35 references

v2.4.0

Fixed

Added

Changed

Uh oh!

v2.3.0: RAG and MedQA zero-shot transfer, Llama cliff, r_OC confirmed, six families

What changed since v2.2.1

Zero-shot downstream transfer

Llama cliff

r_OC width sweep

Six families

Statistical hardening

Paper rewrite

Code

Uh oh!

v2.2.1: unified run_model.py, Mistral 7B fifth family

v2.2.1: Mistral 7B fifth family, unified experiment harness, methodology hardening

What changed since v2.2.0

New data

Paper improvements

Code

In progress (GPU)

Uh oh!

v2.2.0: multi-rate exclusive catches, instruct stability, abstract rewrite, safety framing, OpenAI monitorability positioning

What changed since v2.1.0

New findings

Paper improvements

Code

Citations

Uh oh!

v2.1.0: Llama multi-layer sweep, split bootstrap, number audit, monitorability citations

Uh oh!

Architecture Predicts Linear Readability of Decision Quality in Transformers

Summary

Key results

What changed since v1.0.0

Reproducibility

Uh oh!

v1.0.0: Internal Quality Signals in Transformer Activations

Summary

Key results

Experimental arc

Reproducibility

Uh oh!

Phase 9: Cross-family replication (GPT-2, Qwen, Llama)

Summary

Cross-family results

What's next

Uh oh!

Phase 8: Stable decision-quality signal across GPT-2 scaling, with increasing output-independence

Summary

Results

What's next

Uh oh!