Skip to content

Releases: tmcarmichael/nn-observability

v2.4.0: corrected numbers, 57-point verification pipeline, bfloat16 fix, 35 references

16 Apr 01:44

Choose a tag to compare

v2.4.0

Fixed

  • Llama 3B catch rate at 10% flag rate: 8.4% (was 7.8% from v2 data)
  • Catch rate ceiling: 11-15% at 20% (was 12-15%)
  • Cross-family gap: 2.9x (was 3.0x)
  • Permutation F-statistic: 15.77 computed from data (was 15.87, hand-edited)
  • Shuffle test: +0.014 +/- 0.019 rerun and committed as results/shuffle_test_gpt2.json (was +0.008, uncommitted)
  • Llama 3B per-seed range: +0.084 to +0.102 (was +0.085 to +0.093)
  • ANCOVA degrees of freedom: F(5,68) for 6 families (was F(3,56) for 4 families)
  • bfloat16 crash on Mistral 7B and Llama 8B: .float() cast on all cross_entropy and softmax calls in transformer_observe.py

Added

  • 57-point numerical verification (just verify): every data-dependent number in the paper checked against source JSONs
  • Content-diff checks on generated tables (--check mode) and macros (--check mode)
  • Schema validation for paper-scope results JSONs (just validate-results)
  • Provenance block in run_model.py output: model_revision, script, timestamp, device, torch_version
  • scripts/shuffle_test.py: standalone shuffle test, 10 permutations on GPT-2 124M
  • analysis/lint_hardcoded.py: flags literal numbers in tex that should use macros
  • tests/test_probe_sync.py: verifies src/probe.py and scripts/run_model.py produce identical output (253 tests total)
  • analysis/README.md: script table, new-model checklist, JSON schema
  • 8 new bibliography entries: model family papers (Qwen, Llama, Gemma, Mistral, Phi-3), Alain & Bengio 2017, Bricken et al. 2023, Groeneveld 2024 (OLMo), Honovich 2022, Min 2023 (35 total)

Changed

  • All statistical tests (permutation, variance decomposition, mixed-effects) computed from results JSONs by the generator, not hardcoded
  • generate_tables.py updated for all 13 models (was 9)
  • generate_data_macros.py sources bootstrap CI and token budget from JSONs (was hardcoded)
  • README rewritten
  • pip install -e . documented as alternative to uv sync

v2.3.0: RAG and MedQA zero-shot transfer, Llama cliff, r_OC confirmed, six families

14 Apr 04:36

Choose a tag to compare

What changed since v2.2.1

Zero-shot downstream transfer

  • SQuAD 2.0 RAG: WikiText-trained probe catches 11.8% of wrong answers at 20% flag rate that confidence misses. Zero-shot, no QA data in probe training.
  • MedQA-USMLE: same probe catches 11.6% of wrong medical licensing answers at 20% flag rate. The model confidently produces wrong answers; standard output monitoring marks them as correct.
  • TruthfulQA: aggregate catch rate matches the ceiling (13.5% at 20%), but the observer cannot discriminate within the confident-wrong subset (AUC 0.475). Boundary condition: fluent reproduction of memorized falsehoods is specifically resistant.
  • The 12-15% saturation ceiling holds across language modeling, RAG, medical QA, and factual QA. Four tasks, same ceiling, same WikiText-trained probe.

Llama cliff

  • Llama 3.2 1B full protocol: pcorr +0.286, matching GPT-2 and upper Qwen range.
  • Llama 3.2 3B: +0.089 under identical methodology. Signal falls from the high-observability group to near the detection floor in one step.
  • Architectural configuration changes between 1B (16 layers, 2048 dim) and 3B (28 layers, 3072 dim). Within-family evidence that architecture, not family identity, predicts observability.

r_OC width sweep

  • 512-unit output predictor absorbs no more signal than 64-unit bottleneck (+0.130 vs +0.129 on Qwen 7B). Bottleneck limitation is dead.

Six families

  • Mistral 7B (+0.313): highest clean signal in dataset. Seed agreement +0.995.
  • Phi-3 Mini (+0.300): sixth family. Instruct-only variant.
  • Permutation test: F=15.87, p=0.006, eta-squared 0.92. Leave-one-family-out: all p < 0.025.

Statistical hardening

  • Shuffle test: trained probe on randomized labels achieves +0.008 (real: +0.334, ratio 10.7x)
  • TOST equivalence: nonlinear MLP equivalent to linear within +/- 0.03 (p=0.025)
  • Jonckheere-Terpstra: within-Qwen declining trend (p=0.002), small relative to between-family effect
  • Cross-family control sensitivity: 49-64% confidence absorption across 9/10 models
  • Qwen 14B bimodality disclosed (two probe solutions at +0.186 and +0.250)

Paper rewrite

  • Abstract: 160 words, four-move structure. Closes on "Standard output monitoring marks these answers as correct. The observer does not."
  • Introduction: compressed, Llama cliff, four-task contribution
  • Architecture section: restructured to 120 lines, secondary stats moved to appendix
  • Discussion: rewritten around categorical observability, MedQA steelman, frontier scale implication
  • Limitations: three items, TruthfulQA boundary condition, WikiText memorization defense with three numbered arguments

Code

  • All results committed as JSON
  • Figure 1 regenerated with six families, Llama 1B solid marker
  • Llama 1B excluded from family-level permutation test (different architecture inflates within-family variance)

v2.2.1: unified run_model.py, Mistral 7B fifth family

13 Apr 17:18

Choose a tag to compare

v2.2.1: Mistral 7B fifth family, unified experiment harness, methodology hardening

What changed since v2.2.0

New data

  • Mistral 7B v0.3: fifth architecture family. pcorr +0.313, OC +0.156, seed agreement +0.995. Highest clean signal in the dataset (random head +0.014, no geometry inflation). Peak at L22 (69% depth), consistent with the two-thirds-depth pattern across all high-observability families.
  • Five-family exclusive catch table: catch rate ranges from 7.8% (Llama 3B, pcorr +0.089) to 11.4% (Mistral 7B, pcorr +0.313) at 10% flag rate. All five families converge to 12-15% at 20% flag rate. The 3.5x pcorr gap compresses to 1.2x in catch rate at 20%, indicating a ceiling set by error structure rather than observability.
  • Cross-domain transfer on Mistral: WikiText to C4 +0.155, C4 within-domain -0.010. Same asymmetry as Qwen and Llama: the signal is in the representations, the target construction requires clean text.

Paper improvements

  • Exclusive catch reframing: "stable at 9-10%" replaced with sublinear saturation story across all five sections that referenced it. Abstract, introduction, architecture, related work, discussion all updated to "7-11% at 10%, converging to 12-15% at 20%."
  • Flagging table rewritten: 5 models x 4 flag rates, replacing the previous 4-model single-rate table. Includes pcorr column showing the catch-rate-to-observability relationship.
  • ANCOVA pseudoreplication caveat: labeled supplementary, added note that per-seed observations are correlated and the mixed-effects model is the primary test.
  • Mann-Whitney replaced: U=49 p=0.0003 (pseudoreplication vulnerability) replaced with qualitative no-overlap statement (every Qwen 3B seed +0.225 to +0.288 exceeds every Llama 3B seed +0.085 to +0.093) plus forward reference to the permutation test for family-level inference.
  • Split consistency: cross-family table footnoted as validation split (held-out seeds, n=6-7) with test-split confirmation (within 5%, rankings preserved). New appendix subsection "Layer selection and test-split confirmation" with actual numbers.
  • Method section: "balanced by construction" corrected to "approximately balanced" (mean-zero by OLS, median near zero for large N). Mixed-effects equation now defines j (indexes seeds within model i). Seed agreement formula added (Eq. sagree).
  • Random probe attribution fixed: +0.046 was from MNIST MLP, cited as if GPT-2. Replaced with actual transformer random_head values (+0.023 Qwen 3B, -0.002 Llama 3B, +0.014 Mistral 7B) in both method and signal sections. MLP binary-vs-regression comparison now explicitly attributed to MLP validation experiments.
  • Table upgrades: added +/- std column to cross-family table and GPT-2 scaling table. Added OC/pcorr fraction column to GPT-2 table showing the 34% to 60% output-discard growth pattern.
  • Cross-domain transfer promoted: from appendix-only to its own bold-header subsection in the architecture section, with four-family data (Qwen 7B, Qwen 14B, Mistral 7B, Llama 3B) and Gemma exception noted.
  • Mistral added to cross-family table: 9 rows, 5 families.

Code

  • Unified experiment harness: scripts/run_model.py replaces 13 per-model scripts. Single file, no local imports, works on bare GPU pods. Handles both model.model.layers (Llama, Qwen, Mistral, Gemma, Phi) and model.transformer.h (GPT-2).
  • Shared probe module: src/probe.py with architecture-agnostic _get_layer_list(). Used by analysis scripts and smoke tests.
  • r_OC width sweep script: scripts/roc_width_sweep.py tests 64/128/256/512-unit output predictor on Qwen 7B. Ready to run.
  • Python 3.12 pinned: .python-version added. 54/54 core tests pass.
  • Smoke tests: tests/test_smoke_run_model.py validates the full output JSON schema from run_model.py.
  • Pre-commit hooks: ruff lint + format on commit, version tag check on push.
  • Gemma flagging marked invalid: suspected inverted observer polarity. Needs GPU recomputation.
  • Legacy scripts archived: per-model scripts moved to scripts/legacy/.

In progress (GPU)

  • Phi-3 Mini full protocol (sixth family, running now)
  • Llama 3.2 1B full protocol (confirms preliminary +0.250)
  • Llama 3.1 8B full protocol (confirms preliminary +0.088)
  • r_OC width sweep on Qwen 7B (resolves bottleneck limitation)

v2.2.0: multi-rate exclusive catches, instruct stability, abstract rewrite, safety framing, OpenAI monitorability positioning

13 Apr 04:14

Choose a tag to compare

What changed since v2.1.0

New findings

  • Multi-rate exclusive catch analysis: observer catches 6-7% of errors at 5% flag rate, 9-10% at 10%, saturating near 13-15% at 20%, across all models tested
  • Instruct operational stability: Qwen 7B instruct holds at 13.9% exclusive catches at 30% flag rate while base drops to 12.0%
  • Multi-rate pattern confirmed at 3 of 5 Qwen scales (0.5B, 1.5B, 7B), mixed at 3B and 14B
  • Llama gap corrected to 3.0x (from 2.8x) using v3 values
  • Output-layer discard fraction corrected to 60% at 1.5B (from 68%)

Paper improvements

  • Abstract rewritten: finding-first opener, multi-rate data, specific safety closer
  • Introduction restructured: "the probe is standard; the measurement is not", split robustness and cross-family paragraphs, escalated safety framing
  • Architecture section: mechanistic table moved to appendix, multi-rate exclusive catch data added, instruct stability finding added
  • Discussion: closing paragraph now includes saturation data and output-layer discard trend, future work prioritizes controlled training experiment
  • Limitations: tightened from 47 to 30 lines, cut redundant items
  • Related work: added Guan et al. "Monitoring Monitorability" (OpenAI 2025) and Korbak et al. "Chain of Thought Monitorability" (2025), positioned as upstream constraint
  • All banned vocabulary removed, section openers rewritten, rhetorical questions eliminated
  • Decision quality explicitly scoped to confidence-residual loss signal in Method
  • Contributions list now has forward references and specific numbers

Code

  • Added analysis/exclusive_catch_rates.py: multi-rate exclusive catch analysis across all models and flag rates
  • Added Mistral 7B and Phi-3 Mini data collection scripts (results pending)
  • Nonlinear probe delta corrected (-0.041 for GPT-2, -0.019 for Qwen 14B)
  • Orphaned LaTeX section files moved to sections/legacy/
  • README fully rewritten to match paper v2.2.0

Citations

  • 20 cited, 20 in bib, zero orphans
  • New: Guan et al. 2025, Korbak et al. 2025

v2.1.0: Llama multi-layer sweep, split bootstrap, number audit, monitorability citations

13 Apr 02:42

Choose a tag to compare

Half the signal in standard activation probes is output confidence in disguise. After controlling for it with partial Spearman correlation, a stable linear signal remains across four transformer families and 11 model scales. Its strength varies by architecture family, not model size.

Key results

  • Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds, seed agreement +0.993)
  • Signal stable at rho_partial ~ +0.25 across 28x scale within Qwen 2.5 (0.5B--14B, 7-seed, matched ex/dim)
  • Cross-family divergence: Llama 3.2 3B produces +0.089 vs Qwen 3B +0.263, a 3.0x gap (permutation test p = 0.014)
  • 88% of variance in observability is between architecture families; 6% between scales
  • Nonlinear MLP does not exceed the linear probe at matched hyperparameters on any of 8 models tested
  • 9--10% of model errors are invisible to output confidence at every scale tested (GPT-2 124M through Qwen 14B)
  • Instruction tuning preserves the signal at all five Qwen scales (0.5B through 14B)
  • Mechanistic substrate (layer 0 attention, MLP suppression at layers 3--4) qualitatively identical between base and instruct on Qwen 7B

What changed since v2.0.0

  • Llama multi-layer sweep (L0, L7, L14, L21, L27): signal absent at every depth under both linear and nonlinear probing, strongest reading (+0.148) below Qwen's noise floor
  • Split-level bootstrap (Qwen 7B, 30 document resamples): rho_partial = +0.238, 95% CI [+0.215, +0.270], confirming signal is stable under data resampling
  • Gap corrected to 3.0x (from 2.8x) using v3 values
  • Number audit: all values verified against source JSONs, dual-protocol differences annotated
  • New citations: Guan et al. "Monitoring Monitorability" (OpenAI 2025), Korbak et al. "Chain of Thought Monitorability" (2025)
  • Paper prose tightened: abstract sharpened, discussion closing paragraph added, banned vocabulary removed, section openers rewritten
  • Orphaned LaTeX section files moved to sections/legacy/
  • Appendix stub reference removed

Reproducibility

All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. Statistical analysis reproduces from committed data without GPU:

cd analysis && python run_all.py

Full experiment reproduction:

just test # pytest suite
just check # lint + format
just reproduce # MLP + GPT-2 results (CPU/MPS/CUDA)

Architecture Predicts Linear Readability of Decision Quality in Transformers

12 Apr 18:10

Choose a tag to compare

Summary

Half the signal in standard activation probes is output confidence in disguise. After controlling for it with partial Spearman correlation, a stable linear signal remains across four transformer families and 11 model scales. Its strength varies by architecture family, not model size.

Key results

  • Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds, seed agreement +0.993)
  • Signal stable at rho_partial ~ +0.25 across 28x scale within Qwen 2.5 (0.5B--14B, 7-seed, matched ex/dim)
  • Cross-family divergence: Llama 3.2 3B produces +0.089 vs Qwen 3B +0.263, a 2.8x gap (permutation test p = 0.014)
  • 88% of variance in observability is between architecture families; 6% between scales
  • Nonlinear MLP does not exceed the linear probe at matched hyperparameters on any of 8 models tested
  • 9--10% of model errors are invisible to output confidence at every scale tested (GPT-2 124M through Qwen 14B)
  • Instruction tuning preserves the signal at all five Qwen scales (0.5B through 14B)
  • Mechanistic substrate (layer 0 attention, MLP suppression at layers 3--4) qualitatively identical between base and instruct on Qwen 7B

What changed since v1.0.0

  • Qwen 2.5 scaling extended to five scales (0.5B, 1.5B, 3B, 7B, 14B) with full control batteries
  • All scales at matched token budgets (350+ ex/dim, 600 for 0.5B)
  • Gemma 3 1B added as fourth architecture family
  • 8-model nonlinear probe comparison confirms signal is genuinely linear
  • Statistical framework: mixed-effects model, exact permutation test, ANCOVA
  • Token budget sensitivity analysis (7-point ex/dim sweep on Qwen 0.5B)
  • Repo reorganized with analysis/, archive/, and figure generation scripts

Reproducibility

All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. Statistical analysis reproduces from committed data without GPU:

cd analysis && python run_all.py

Full experiment reproduction:

just test       # pytest suite
just check      # lint + format
just reproduce  # MLP + GPT-2 results (CPU/MPS/CUDA)

v1.0.0: Internal Quality Signals in Transformer Activations

07 Apr 06:33

Choose a tag to compare

Summary

Frozen transformer activations contain a linearly readable decision-quality signal that survives strong controls for output confidence. The signal replicates in GPT-2, Qwen 2.5 1.5B, and Llama 3.2 1B, but diverges by family at larger scale: Qwen preserves it to 7B, while Llama largely loses it above 1B under the same evaluation protocol.

Key results

  • Partial correlation +0.282 +/- 0.001 after confidence controls (GPT-2 124M, 20 seeds)
  • Signal stable across 12x scale within GPT-2 (+0.279 to +0.290)
  • Output-independent component increases with scale (+0.099 at 124M to +0.174 at 1.5B)
  • 4,368 exclusive high-loss catches at 10% flag rate that confidence misses (GPT-2 124M)
  • ~67% of the raw signal explained by named controls; ~33% remains unexplained
  • Partial mechanistic support localizes the signal to distributed mid-layer attention (layers 5-7, GPT-2 124M only)
  • Architecture-dependent scaling: Qwen preserves the signal to 7B, while Llama weakens sharply above 1B

Experimental arc

  • Phases 1-3: Hand-designed observers collapse under confidence controls
  • Phase 4: Learned binary heads recover signal on frozen MLP activations
  • Phase 5: Transfer to GPT-2 124M with +0.99 seed agreement
  • Phase 6: Catch errors confidence misses (4,368 exclusive at 10% flag rate)
  • Phase 7: Outperform a 24,576-feature SAE probe (+0.290 vs +0.255)
  • Phase 8: Hold stable across GPT-2 124M to 1.5B
  • Phase 9: Replicate in Qwen and Llama at 1-1.5B, then diverge by family at larger scale

Reproducibility

All results are committed as JSON in results/. Model checkpoint hashes are pinned in results/model_revisions.json. The environment is locked via uv.lock.

just test       # 38 tests, ~2s
just reproduce  # full reproduction, ~60 min

Phase 9: Cross-family replication (GPT-2, Qwen, Llama)

06 Apr 19:22

Choose a tag to compare

The learned observer signal replicates across three independent architecture families with positive
output-controlled residuals in every case.

Summary

Phase 9 tests whether the signal from Phases 5-8 is a GPT-2-specific artifact or a broader property of pretrained
decoder-only transformers. Under the same evaluation protocol (layer sweep, three-seed battery, output-controlled
residual, negative baselines), Qwen 2.5 and Llama 3.2 both replicate the core finding. Hand-designed baselines
collapse in every family.

Also in this release: Phase 5f directional ablation (partial causal evidence), deep-merge fix for results JSON,
float16 inference, examples-per-dimension token scaling, lazy imports for test isolation, and Colab notebook for GPU
reproduction.

Cross-family results

Model Family Params Partial corr Output-controlled Seed agreement
GPT-2 XL GPT-2 1558M +0.290 +0.174 +0.952
Qwen 2.5 1.5B Qwen 1544M +0.284 +0.207 +0.982
Llama 3.2 1B Meta 1236M +0.250 +0.126 +0.999

What's next

Frontier scale (8B+), cross-domain transfer, and actionability (adaptive inference, abstention).

Phase 8: Stable decision-quality signal across GPT-2 scaling, with increasing output-independence

06 Apr 05:49

Choose a tag to compare

Decision-quality signal persists across GPT-2 124M to 1.5B with stable partial correlation (+0.279 to +0.290) and an output independent component that increases from +0.099 to +0.174.

Summary

This release completes the eight-phase experimental arc from structural comparison to scale characterization.

Phases 1-3 show that hand-designed activation observers collapse under partial-correlation controls. Phases 4-5
recover the signal with learned linear projections trained under binary supervision. Phase 6 shows complementary
error coverage beyond confidence. Phase 7 compares raw residual-stream observers against SAE-based probes. Phase 8
tests the result across GPT-2 124M, 355M, 774M, and 1.5B, finding stable signal strength, high seed agreement (0.88-0.95), and an output-independent component that increases across this scaling curve.

Results

Model Params Peak Partial corr Output-controlled Seed agreement
GPT-2 124M L8 +0.290 +0.099 +0.918
GPT-2 Medium 355M L16 +0.279 +0.103 +0.877
GPT-2 Large 774M L24 +0.286 +0.164 +0.901
GPT-2 XL 1558M L34 +0.290 +0.174 +0.952

What's next

Causal validation and cross-architecture scaling (Llama) are the active focus.