From bc6dadb580ec54a40363236f5f682a131912f2e7 Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Thu, 26 Mar 2026 14:03:03 -0500
Subject: [PATCH 01/11] Non-record: Your 16 MB model is 272 MB at eval time
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Study of eval-time n-gram caching — a technique that reduces BPB from
1.11 to 0.38 while preserving strict causality, costing zero artifact
bytes, but growing the effective model to 17x the artifact limit.

Includes single-GPU ablations, 8-GPU all-reduce results (0.49 BPB in
401s, under 600s budget), alpha sweep, and a comparison of competition
eval setup vs real-world inference constraints. Proposes four rule
clarifications to align the competition with deployment realities.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                 | 163 ++++++++++++++++++
 .../submission.json                           |  31 ++++
 2 files changed, 194 insertions(+)
 create mode 100644 records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
 create mode 100644 records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/submission.json

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
new file mode 100644
index 000000000..32ea44491
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -0,0 +1,163 @@
+# Non-Record: Eval-Time N-gram Mixing and the Unbounded Model Growth Problem
+
+**Author:** abaybektursun | **Date:** 2026-03-26 | **Track:** Non-record study
+
+This submission is not a leaderboard entry. It is a study of eval-time n-gram caching — a technique that reduces BPB from 1.11 to 0.38 while preserving strict causality, costing zero artifact bytes, but growing the effective model to 17x the artifact limit at eval time. We present results, explain why this creates a dilemma for the competition, and propose rule clarifications.
+
+---
+
+## Results
+
+All runs use the PR #549 base model (~1.1194 BPB, 11L/512d, ~16MB artifact). Single GPU, stride=64, FineWeb val (62M tokens).
+
+| Config | BPB | Eval-time state | Effective model | Time |
+|--------|----:|----------------:|----------------:|-----:|
+| Neural only (int6 quantized, leaderboard) | 1.1142 | 0 MB | 16 MB | 606s |
+| Neural only (float, pre-quant) | 1.1109 | 0 MB | 16 MB | 606s |
+| Pure n-gram, no neural model | 1.0615 | 192 MB | 192 MB | 535s |
+| Fixed 7-gram, alpha=0.40 | 0.5234 | 192 MB | 208 MB | 824s |
+| Backoff 2-7, alpha=0.40 | 0.4923 | 192 MB | 208 MB | 1079s |
+| Backoff 2-7, entropy-adaptive alpha | 0.6535 | 192 MB | 208 MB | 1114s |
+| **Backoff 2-9, order-adaptive entropy** | **0.3779** | **256 MB** | **272 MB** | **1234s** |
+
+The n-gram cache alone — with no neural model — beats the 27M-parameter transformer (1.06 vs 1.11 BPB). Combined, it cuts BPB by 66%.
+
+### 8-GPU results with all-reduce sync (EXP-11)
+
+These results fit within the 600s competition eval budget. All-reduce sync cost: 1.6–2.0s total.
+
+| Config | BPB | Time | Cache | Sync cost |
+|--------|----:|-----:|-------|-----------|
+| Neural only (8-GPU) | 1.1130 | 110s | None | — |
+| Backoff 2-7, α=0.40 | 0.4941 | 401s | Global (all-reduce) | 1.6s |
+| Backoff 2-9, α=0.40 | 0.4548 | 500s | Global (all-reduce) | 1.9s |
+| Backoff 2-7, **α=0.80** | **0.3942** | 939s | Global (all-reduce) | ~2.0s |
+
+Alpha sweep (8-GPU, backoff 2-7): α=0.20 → 0.6180, α=0.40 → 0.4941, α=0.60 → 0.4263, α=0.80 → 0.3942. Higher alpha is monotonically better — the opposite of PR #727's finding. With a global cache, the n-gram is reliable enough that the model should defer to it more, not less.
+
+### What the n-gram cache is
+
+After each token is scored by the neural model, the token and its preceding context are inserted into hash tables. When a future token's context matches a previously seen n-gram, the cached frequency estimate is mixed with the neural prediction:
+
+```
+p_mix = (1 - alpha) * p_neural + alpha * p_ngram
+```
+
+The tables are built exclusively from already-scored tokens. No future tokens are accessed. Strict causality is preserved.
+
+### What the n-gram cache costs
+
+| Config | Hash table memory | Formula |
+|--------|------------------:|---------|
+| Orders 2-7 (6 orders) | 192 MB | 6 orders x 2 tables x 4M buckets x 4 bytes |
+| Orders 2-9 (8 orders) | 256 MB | 8 orders x 2 tables x 4M buckets x 4 bytes |
+| Orders 2-9, 64M buckets | 4,096 MB | 8 orders x 2 tables x 64M buckets x 4 bytes |
+
+None of this counts toward the 16MB artifact limit. The tables are empty at the start of evaluation and grow as tokens are scored. By the end of evaluation, the model that is doing the actual prediction is 16MB of neural weights plus 256MB of hash tables — **272 MB total**.
+
+---
+
+## The Dilemma
+
+The competition constrains the artifact to 16MB. The intent is clear: force creative compression of model knowledge into a small footprint. But eval-time techniques like n-gram caching, TTT, and LoRA adaptation grow the effective model far beyond 16MB during evaluation — legally, because the rules only constrain the artifact, not the eval-time state.
+
+This creates a gap between what the competition measures and what matters in practice.
+
+### Four dimensions of the gap
+
+|  | Competition | Real-world inference |
+|--|-------------|---------------------|
+| **Corpus** | Fixed 62M tokens, scored in one pass | Streaming queries, each independent |
+| **Time budget** | 600 seconds for the entire corpus | < 100ms per token, real-time |
+| **Hardware** | 8x H100 80GB (640 GB VRAM) | Often 1 GPU, sometimes CPU |
+| **Model size** | 16 MB artifact; eval-time state unconstrained | Total model must fit deployment target |
+
+Each dimension matters:
+
+**1. Inference time.** The competition allows 600 seconds to score 62M tokens. The n-gram cache exploits this by doing O(K) hash lookups per token across K orders, plus table updates after scoring. On a single GPU, our best config takes 1234s — already over budget. On 8 GPUs with all-reduce sync (EXP-11, implemented but not yet deployed), we estimate ~130s. In real-world inference, you serve one token at a time with a latency budget measured in milliseconds. There is no batch of 62M tokens to amortize over.
+
+**2. Inference hardware.** The competition provides 8x H100 with 640GB of combined VRAM. The hash tables (256 MB per GPU, synced via all-reduce) are negligible relative to this. In deployment, models run on single GPUs, edge devices, or CPUs. The 256MB of hash tables alone exceeds the 16MB artifact by 16x.
+
+**3. Competition setup.** The artifact limit constrains what you ship. But the n-gram cache ships nothing — it materializes at eval time from the scored tokens themselves. The 16MB limit was designed to constrain model capacity. The n-gram cache circumvents this by building an unbounded statistical model during evaluation, limited only by the number of hash buckets allocated.
+
+**4. Real-world evaluation.** In production, a language model scores individual prompts. Each query arrives independently. There is no corpus-level repetition to exploit. The n-gram cache's power comes entirely from within-corpus repetition — repeated documents, boilerplate, subword completion patterns, common phrases. This is **compression**, not **language modeling**. It works because FineWeb val has structure that repeats across its 62M tokens. On a stream of independent queries, the cache starts empty for each request and provides no benefit.
+
+### The core tension
+
+The competition implicitly asks: **given N bytes of model, how well can you compress natural language?**
+
+Eval-time caching answers a different question: **given N bytes of model plus unbounded eval-time memory, how well can you compress a specific fixed corpus?**
+
+These are different problems. The second has a much lower floor — any corpus with internal repetition can be compressed toward its empirical entropy by memorizing seen patterns. Our results show the gap is enormous: 1.11 BPB (neural only) vs 0.38 BPB (neural + cache). The cache contributes 2/3 of the total compression, yet costs zero artifact bytes.
+
+---
+
+## What's already legal and where the line blurs
+
+The competition already permits eval-time model growth through several mechanisms:
+
+| Technique | Eval-time state growth | Legality status |
+|-----------|----------------------:|----|
+| Sliding window eval (stride < seq_len) | KV cache, ~20 MB | Uncontroversial |
+| Test-time training (score-first TTT) | LoRA deltas, ~2 MB | Approved (PRs #549, #548) |
+| Per-document LoRA TTT (8 epochs) | LoRA deltas, ~2 MB | Approved (PR #596, 0.62 BPB) |
+| N-gram cache (backoff 2-7) | Hash tables, 192 MB | Under review |
+| N-gram cache (backoff 2-9, 64M buckets) | Hash tables, 4 GB | Under review |
+
+TTT and LoRA adaptation are already approved. They also grow the model at eval time (LoRA weights are not in the artifact), though the growth is modest (~2 MB). The n-gram cache follows the same principle — build state from scored tokens — but at 100x the scale.
+
+The question is not whether causality is preserved (it is), but whether unbounded eval-time model growth is in the spirit of the 16MB constraint.
+
+---
+
+## Proposal
+
+We suggest the competition consider one or more of the following clarifications:
+
+**Option A: Cap eval-time state.** Define a total memory budget for eval-time state (e.g., artifact + eval-time state <= 32 MB or 64 MB). This directly constrains the effective model size and aligns the competition with deployment realities.
+
+**Option B: Per-token compute budget.** Instead of a wall-clock limit for the entire corpus, define a per-token compute budget (e.g., max FLOPs per token). This prevents techniques that amortize expensive corpus-level operations.
+
+**Option C: Evaluate on independent documents.** Score each document independently with a fresh model state (no carry-over between documents). This eliminates cross-document repetition exploitation while still allowing within-document TTT and caching.
+
+**Option D: Accept eval-time growth, but measure it.** Keep current rules but require submissions to report their peak eval-time state size alongside val_bpb. This makes the tradeoff transparent: "0.38 BPB at 272 MB effective model" tells a different story than "0.38 BPB at 16 MB."
+
+We believe **Option A** or **Option D** would be the simplest to implement and the least disruptive to existing submissions.
+
+---
+
+## Surprising findings
+
+1. **Global cache vs partitioned cache:** On 8 GPUs with independent caches (as in PRs #727, #788), each GPU sees 1/8 of the tokens. This degrades BPB from ~0.49 (global) to ~0.97 (partitioned) — a 0.48 BPB gap from cache fragmentation alone. Our EXP-11 implementation solves this with all-reduce sync of hash table deltas across GPUs, giving every GPU the global cache state.
+
+2. **Entropy-adaptive alpha hurts with strong caches:** The sigmoid-gated alpha from PR #727 (which reduces n-gram weight when the neural model is confident) gives 0.65 BPB — 0.16 BPB *worse* than fixed alpha=0.40 (0.49 BPB). With a global cache, the n-gram is often more reliable than the neural model, and the entropy gate is too conservative.
+
+3. **N-gram alone beats the neural model:** Pure n-gram (no neural model at all) achieves 1.06 BPB vs 1.11 BPB for the neural model. A zero-parameter frequency table built from scored tokens predicts FineWeb better than a 27M-parameter transformer.
+
+4. **Three compression phenomena:** The n-gram cache captures (a) deterministic BPE subword completion (orders 2-4), (b) common English collocations (orders 4-6), and (c) verbatim document repetition (orders 6+). Only (c) is corpus-specific.
+
+---
+
+## Reproduction
+
+All scripts are in `experiments/eval_time_mixing/scripts/`:
+
+```bash
+# Single-GPU experiments (EXP-0, requires 1xH100 + trained model)
+python3 experiments/eval_time_mixing/scripts/eval_ngram.py \
+    --model final_model.pt --exp backoff_7
+
+# 8-GPU distributed with global cache (EXP-11)
+NGRAM_ENABLED=1 NGRAM_ORDER=9 NGRAM_ALPHA=0.40 \
+    torchrun --standalone --nproc_per_node=8 \
+    experiments/eval_time_mixing/scripts/eval_ngram_distributed.py
+
+# N-gram match analysis (qualitative)
+python3 experiments/eval_time_mixing/scripts/analyze_ngram_matches.py
+```
+
+Base model: `train_609_val_calib.py` from `records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/`.
+
+## Credits
+
+N-gram cache concept and initial implementations: [PR #727](https://github.com/openai/parameter-golf/pull/727), [PR #779](https://github.com/openai/parameter-golf/pull/779), [PR #788](https://github.com/openai/parameter-golf/pull/788). Competition design and infrastructure: OpenAI.
diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/submission.json b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/submission.json
new file mode 100644
index 000000000..5d41c7880
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/submission.json
@@ -0,0 +1,31 @@
+{
+  "author": "abaybektursun",
+  "github_id": "abaybektursun",
+  "name": "Eval-Time N-gram Mixing and the Unbounded Model Growth Problem",
+  "blurb": "Study of eval-time n-gram caching: a strictly causal technique that reduces BPB from 1.11 to 0.38 while growing the effective model from 16MB to 272MB at eval time. Presents results, analyzes the gap between competition setup and real-world inference constraints, and proposes rule clarifications.",
+  "date": "2026-03-26",
+  "track": "non_record_study",
+  "val_bpb_neural_only": 1.1109,
+  "val_bpb_best_ngram": 0.3779,
+  "val_bpb_ngram_only_no_neural": 1.0615,
+  "artifact_bytes": 15866156,
+  "eval_time_state_bytes_best": 268435456,
+  "effective_model_bytes_best": 284301612,
+  "base_model_pr": 549,
+  "base_model_record": "records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072",
+  "hardware": "1xH100 80GB SXM (single-GPU experiments)",
+  "experiments_run": [
+    "baseline",
+    "ngram_only_7",
+    "fixed_7gram",
+    "backoff_7",
+    "backoff_7_ent",
+    "backoff_9_ent_oadapt"
+  ],
+  "scripts": [
+    "experiments/eval_time_mixing/scripts/eval_ngram.py",
+    "experiments/eval_time_mixing/scripts/eval_ngram_distributed.py",
+    "experiments/eval_time_mixing/scripts/analyze_ngram_matches.py"
+  ],
+  "technique_summary": "Eval-time n-gram hash tables with multi-order backoff, linear probability mixing, strict score-first causality"
+}

From 4555280179944f63c90a7c73572a6a26d8651ee0 Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Thu, 26 Mar 2026 14:09:26 -0500
Subject: [PATCH 02/11] Simplify proposal to single option: cap eval-time state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                      | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 32ea44491..63f0e27ae 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -110,19 +110,13 @@ The question is not whether causality is preserved (it is), but whether unbounde
 
 ---
 
-## Proposal
+## Proposal: Cap eval-time state
 
-We suggest the competition consider one or more of the following clarifications:
+Define a total memory budget for eval-time state — for example, artifact + eval state <= 64 MB. This directly constrains the effective model size and aligns the competition with deployment realities. Simple to enforce: measure peak GPU memory allocation during eval.
 
-**Option A: Cap eval-time state.** Define a total memory budget for eval-time state (e.g., artifact + eval-time state <= 32 MB or 64 MB). This directly constrains the effective model size and aligns the competition with deployment realities.
+This extends the 16 MB artifact philosophy to cover the full model at inference time. A model that fits in 16 MB but needs 272 MB to run doesn't fit in 16 MB.
 
-**Option B: Per-token compute budget.** Instead of a wall-clock limit for the entire corpus, define a per-token compute budget (e.g., max FLOPs per token). This prevents techniques that amortize expensive corpus-level operations.
-
-**Option C: Evaluate on independent documents.** Score each document independently with a fresh model state (no carry-over between documents). This eliminates cross-document repetition exploitation while still allowing within-document TTT and caching.
-
-**Option D: Accept eval-time growth, but measure it.** Keep current rules but require submissions to report their peak eval-time state size alongside val_bpb. This makes the tradeoff transparent: "0.38 BPB at 272 MB effective model" tells a different story than "0.38 BPB at 16 MB."
-
-We believe **Option A** or **Option D** would be the simplest to implement and the least disruptive to existing submissions.
+This would not disqualify any currently approved techniques. KV caches (~20 MB), TTT LoRA deltas (~2 MB), and sliding window eval all fit comfortably within a 64 MB cap. It only constrains the techniques that grow the model by 10–250x during evaluation.
 
 ---
 

From c2ca0a6cce943026ca910c3fe4f594a391babcee Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Thu, 26 Mar 2026 17:07:41 -0500
Subject: [PATCH 03/11] Fix factual errors in study README
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Base model is ValCalib GPTQ (1.1142 BPB), not PR #549 (1.1194)
- Remove stale "not yet deployed" / "we estimate" for EXP-11
- Note α=0.80 (939s) exceeds 600s budget
- Fix PR #727 score to 0.9674, PR #788 to 0.9059
- Fix PR #596 BPB to 0.6430
- "Approved" → "Technique deemed legal" for closed PRs
- Add bucket sweep and per-token overhead proposal
- Replace "neural" with "base LM" throughout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                 | 73 ++++++++++++-------
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 63f0e27ae..282a8692e 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -8,39 +8,52 @@ This submission is not a leaderboard entry. It is a study of eval-time n-gram ca
 
 ## Results
 
-All runs use the PR #549 base model (~1.1194 BPB, 11L/512d, ~16MB artifact). Single GPU, stride=64, FineWeb val (62M tokens).
+All runs use the ValCalib GPTQ base model (1.1142 BPB leaderboard score, 11L/512d, ~16MB artifact, from `records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/`). Single GPU, stride=64, FineWeb val (62M tokens).
 
 | Config | BPB | Eval-time state | Effective model | Time |
 |--------|----:|----------------:|----------------:|-----:|
-| Neural only (int6 quantized, leaderboard) | 1.1142 | 0 MB | 16 MB | 606s |
-| Neural only (float, pre-quant) | 1.1109 | 0 MB | 16 MB | 606s |
-| Pure n-gram, no neural model | 1.0615 | 192 MB | 192 MB | 535s |
+| Base LM (int6 quantized, leaderboard) | 1.1142 | 0 MB | 16 MB | 606s |
+| Base LM (float, pre-quant) | 1.1109 | 0 MB | 16 MB | 606s |
+| Pure n-gram, no base LM | 1.0615 | 192 MB | 192 MB | 535s |
 | Fixed 7-gram, alpha=0.40 | 0.5234 | 192 MB | 208 MB | 824s |
 | Backoff 2-7, alpha=0.40 | 0.4923 | 192 MB | 208 MB | 1079s |
 | Backoff 2-7, entropy-adaptive alpha | 0.6535 | 192 MB | 208 MB | 1114s |
 | **Backoff 2-9, order-adaptive entropy** | **0.3779** | **256 MB** | **272 MB** | **1234s** |
 
-The n-gram cache alone — with no neural model — beats the 27M-parameter transformer (1.06 vs 1.11 BPB). Combined, it cuts BPB by 66%.
+The n-gram cache alone — with no base LM — beats the trained model (1.06 vs 1.11 BPB). Combined, it cuts BPB by 66%.
 
 ### 8-GPU results with all-reduce sync (EXP-11)
 
-These results fit within the 600s competition eval budget. All-reduce sync cost: 1.6–2.0s total.
+All-reduce sync cost: 1.6–2.0s total. The first three configs fit within the 600s competition eval budget; α=0.80 exceeds it (939s).
 
 | Config | BPB | Time | Cache | Sync cost |
 |--------|----:|-----:|-------|-----------|
-| Neural only (8-GPU) | 1.1130 | 110s | None | — |
+| Base LM (8-GPU) | 1.1130 | 110s | None | — |
 | Backoff 2-7, α=0.40 | 0.4941 | 401s | Global (all-reduce) | 1.6s |
 | Backoff 2-9, α=0.40 | 0.4548 | 500s | Global (all-reduce) | 1.9s |
-| Backoff 2-7, **α=0.80** | **0.3942** | 939s | Global (all-reduce) | ~2.0s |
+| Backoff 2-7, α=0.80 | 0.3942 | 939s | Global (all-reduce) | ~2.0s |
 
-Alpha sweep (8-GPU, backoff 2-7): α=0.20 → 0.6180, α=0.40 → 0.4941, α=0.60 → 0.4263, α=0.80 → 0.3942. Higher alpha is monotonically better — the opposite of PR #727's finding. With a global cache, the n-gram is reliable enough that the model should defer to it more, not less.
+Alpha sweep (8-GPU, backoff 2-7): α=0.20 → 0.6180, α=0.40 → 0.4941, α=0.60 → 0.4263, α=0.80 → 0.3942. Higher alpha is monotonically better — the opposite of PR #727's finding (0.9674 BPB). With a global cache, the n-gram is reliable enough that the model should defer to it more, not less. The best alpha (0.80) exceeds the time budget, so in practice α=0.40–0.60 is the operating range.
+
+### Hash collision analysis
+
+We swept bucket counts from 1M to 256M expecting more buckets = fewer collisions = better accuracy. The opposite happened:
+
+| Buckets | BPB | Table memory |
+|--------:|----:|---------:|
+| 1M | 0.5793 | 48 MB |
+| 4M | 0.6535 | 192 MB |
+| 64M | 1.0629 | 3 GB |
+| 256M | 1.1123 | 12 GB |
+
+With 256M buckets, the table is so sparse that most n-grams have count=1 and fail the `min_count ≥ 2` threshold. Collisions at smaller table sizes merge similar n-grams together, artificially boosting counts above threshold. The hash table is functioning as a lossy count-min sketch, not an exact lookup. [Standard literature](https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch) treats collisions as error to minimize. The BPB improvement depends on this interaction between hash density and the count threshold.
 
 ### What the n-gram cache is
 
-After each token is scored by the neural model, the token and its preceding context are inserted into hash tables. When a future token's context matches a previously seen n-gram, the cached frequency estimate is mixed with the neural prediction:
+After each token is scored by the base LM, the token and its preceding context are inserted into hash tables. When a future token's context matches a previously seen n-gram, the cached frequency estimate is mixed with the prediction:
 
 ```
-p_mix = (1 - alpha) * p_neural + alpha * p_ngram
+p_mix = (1 - alpha) * p_model + alpha * p_ngram
 ```
 
 The tables are built exclusively from already-scored tokens. No future tokens are accessed. Strict causality is preserved.
@@ -53,7 +66,7 @@ The tables are built exclusively from already-scored tokens. No future tokens ar
 | Orders 2-9 (8 orders) | 256 MB | 8 orders x 2 tables x 4M buckets x 4 bytes |
 | Orders 2-9, 64M buckets | 4,096 MB | 8 orders x 2 tables x 64M buckets x 4 bytes |
 
-None of this counts toward the 16MB artifact limit. The tables are empty at the start of evaluation and grow as tokens are scored. By the end of evaluation, the model that is doing the actual prediction is 16MB of neural weights plus 256MB of hash tables — **272 MB total**.
+None of this counts toward the 16MB artifact limit. The tables are empty at the start of evaluation and grow as tokens are scored. By the end of evaluation, the model that is doing the actual prediction is 16MB of weights plus 256MB of hash tables — **272 MB total**.
 
 ---
 
@@ -68,19 +81,21 @@ This creates a gap between what the competition measures and what matters in pra
 |  | Competition | Real-world inference |
 |--|-------------|---------------------|
 | **Corpus** | Fixed 62M tokens, scored in one pass | Streaming queries, each independent |
-| **Time budget** | 600 seconds for the entire corpus | < 100ms per token, real-time |
+| **Time budget** | 600 seconds for 62M tokens | < 200ms per request |
 | **Hardware** | 8x H100 80GB (640 GB VRAM) | Often 1 GPU, sometimes CPU |
 | **Model size** | 16 MB artifact; eval-time state unconstrained | Total model must fit deployment target |
 
 Each dimension matters:
 
-**1. Inference time.** The competition allows 600 seconds to score 62M tokens. The n-gram cache exploits this by doing O(K) hash lookups per token across K orders, plus table updates after scoring. On a single GPU, our best config takes 1234s — already over budget. On 8 GPUs with all-reduce sync (EXP-11, implemented but not yet deployed), we estimate ~130s. In real-world inference, you serve one token at a time with a latency budget measured in milliseconds. There is no batch of 62M tokens to amortize over.
+**1. Inference time.** The competition allows 600 seconds to score 62M tokens. The n-gram cache exploits this by doing O(K) hash lookups per token across K orders, plus table updates after scoring. On a single GPU, our best config takes 1234s. On 8 GPUs with all-reduce sync (EXP-11), backoff 2-7 takes 401s. In real-world inference, you serve one request at a time with a latency budget measured in milliseconds.
 
 **2. Inference hardware.** The competition provides 8x H100 with 640GB of combined VRAM. The hash tables (256 MB per GPU, synced via all-reduce) are negligible relative to this. In deployment, models run on single GPUs, edge devices, or CPUs. The 256MB of hash tables alone exceeds the 16MB artifact by 16x.
 
 **3. Competition setup.** The artifact limit constrains what you ship. But the n-gram cache ships nothing — it materializes at eval time from the scored tokens themselves. The 16MB limit was designed to constrain model capacity. The n-gram cache circumvents this by building an unbounded statistical model during evaluation, limited only by the number of hash buckets allocated.
 
-**4. Real-world evaluation.** In production, a language model scores individual prompts. Each query arrives independently. There is no corpus-level repetition to exploit. The n-gram cache's power comes entirely from within-corpus repetition — repeated documents, boilerplate, subword completion patterns, common phrases. This is **compression**, not **language modeling**. It works because FineWeb val has structure that repeats across its 62M tokens. On a stream of independent queries, the cache starts empty for each request and provides no benefit.
+**4. Real-world evaluation.** In production, a language model scores individual prompts. Each query arrives independently. There is no corpus-level repetition to exploit. The n-gram cache's power comes entirely from within-corpus repetition. On a stream of independent queries, the cache starts empty for each request and provides no benefit.
+
+**5. Inference speed.** The n-gram cache roughly doubles eval time (606s → 1,079s for backoff 2-7). The overhead is constant per token — it doesn't get worse as the cache fills — but a flat 2x slowdown matters when your latency budget is 50–200ms. You pay the per-token cost on every request, but you only get the BPB benefit after millions of tokens of contiguous corpus. On a 500-token prompt, you get the slowdown without the payoff.
 
 ### The core tension
 
@@ -88,7 +103,7 @@ The competition implicitly asks: **given N bytes of model, how well can you comp
 
 Eval-time caching answers a different question: **given N bytes of model plus unbounded eval-time memory, how well can you compress a specific fixed corpus?**
 
-These are different problems. The second has a much lower floor — any corpus with internal repetition can be compressed toward its empirical entropy by memorizing seen patterns. Our results show the gap is enormous: 1.11 BPB (neural only) vs 0.38 BPB (neural + cache). The cache contributes 2/3 of the total compression, yet costs zero artifact bytes.
+These are different problems. The second has a much lower floor — any corpus with internal repetition can be compressed toward its empirical entropy by memorizing seen patterns. Our results show the gap is enormous: 1.11 BPB (base LM only) vs 0.38 BPB (base LM + cache). The cache contributes 2/3 of the total compression, yet costs zero artifact bytes.
 
 ---
 
@@ -99,34 +114,40 @@ The competition already permits eval-time model growth through several mechanism
 | Technique | Eval-time state growth | Legality status |
 |-----------|----------------------:|----|
 | Sliding window eval (stride < seq_len) | KV cache, ~20 MB | Uncontroversial |
-| Test-time training (score-first TTT) | LoRA deltas, ~2 MB | Approved (PRs #549, #548) |
-| Per-document LoRA TTT (8 epochs) | LoRA deltas, ~2 MB | Approved (PR #596, 0.62 BPB) |
+| Test-time training (score-first TTT) | LoRA deltas, ~2 MB | Technique deemed legal (PRs #549, #548) |
+| Per-document LoRA TTT (8 epochs) | LoRA deltas, ~2 MB | Technique deemed legal (PR #596, 0.6430 BPB) |
 | N-gram cache (backoff 2-7) | Hash tables, 192 MB | Under review |
 | N-gram cache (backoff 2-9, 64M buckets) | Hash tables, 4 GB | Under review |
 
-TTT and LoRA adaptation are already approved. They also grow the model at eval time (LoRA weights are not in the artifact), though the growth is modest (~2 MB). The n-gram cache follows the same principle — build state from scored tokens — but at 100x the scale.
-
-The question is not whether causality is preserved (it is), but whether unbounded eval-time model growth is in the spirit of the 16MB constraint.
+TTT and LoRA adaptation follow the same principle as the n-gram cache — build state from scored tokens — though the growth is modest (~2 MB vs 192 MB). The question is not whether causality is preserved (it is), but whether unbounded eval-time model growth is in the spirit of the 16MB constraint.
 
 ---
 
-## Proposal: Cap eval-time state
+## Proposal
+
+### 1. Cap eval-time memory
 
 Define a total memory budget for eval-time state — for example, artifact + eval state <= 64 MB. This directly constrains the effective model size and aligns the competition with deployment realities. Simple to enforce: measure peak GPU memory allocation during eval.
 
 This extends the 16 MB artifact philosophy to cover the full model at inference time. A model that fits in 16 MB but needs 272 MB to run doesn't fit in 16 MB.
 
-This would not disqualify any currently approved techniques. KV caches (~20 MB), TTT LoRA deltas (~2 MB), and sliding window eval all fit comfortably within a 64 MB cap. It only constrains the techniques that grow the model by 10–250x during evaluation.
+### 2. Cap per-token overhead
+
+Require that eval-time techniques do not increase per-token latency by more than 50% over the base model forward pass on the same hardware. Not an absolute number — a ratio. Hardware-agnostic and easy to measure: run eval with and without the technique.
+
+Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×). KV cache, TTT, LoRA are all within 1.5×. This also catches two-pass rescoring mechanically.
+
+Both proposals preserve everything currently approved. KV caches (~20 MB), TTT LoRA deltas (~2 MB), and sliding window eval all fit comfortably. They only constrain techniques that grow the model by 10–250x during evaluation.
 
 ---
 
 ## Surprising findings
 
-1. **Global cache vs partitioned cache:** On 8 GPUs with independent caches (as in PRs #727, #788), each GPU sees 1/8 of the tokens. This degrades BPB from ~0.49 (global) to ~0.97 (partitioned) — a 0.48 BPB gap from cache fragmentation alone. Our EXP-11 implementation solves this with all-reduce sync of hash table deltas across GPUs, giving every GPU the global cache state.
+1. **Global cache vs partitioned cache:** On 8 GPUs with independent caches (as in PRs #727 at 0.9674 BPB, #788 at 0.9059 BPB), each GPU sees 1/8 of the tokens. This degrades BPB from ~0.49 (global) to ~0.91–0.97 (partitioned). Our EXP-11 implementation solves this with all-reduce sync of hash table deltas across GPUs.
 
-2. **Entropy-adaptive alpha hurts with strong caches:** The sigmoid-gated alpha from PR #727 (which reduces n-gram weight when the neural model is confident) gives 0.65 BPB — 0.16 BPB *worse* than fixed alpha=0.40 (0.49 BPB). With a global cache, the n-gram is often more reliable than the neural model, and the entropy gate is too conservative.
+2. **Entropy-adaptive alpha hurts with strong caches:** The sigmoid-gated alpha from PR #727 gives 0.65 BPB — 0.16 BPB *worse* than fixed alpha=0.40 (0.49 BPB). With a global cache, the n-gram is often more reliable than the base LM, and the entropy gate is too conservative.
 
-3. **N-gram alone beats the neural model:** Pure n-gram (no neural model at all) achieves 1.06 BPB vs 1.11 BPB for the neural model. A zero-parameter frequency table built from scored tokens predicts FineWeb better than a 27M-parameter transformer.
+3. **N-gram alone beats the base LM:** Pure n-gram (no base LM at all) achieves 1.06 BPB vs 1.11 BPB for the trained model. A zero-parameter frequency table built from scored tokens predicts FineWeb better than the trained model.
 
 4. **Three compression phenomena:** The n-gram cache captures (a) deterministic BPE subword completion (orders 2-4), (b) common English collocations (orders 4-6), and (c) verbatim document repetition (orders 6+). Only (c) is corpus-specific.
 

From 1f91dfe7c3701910c7ce43d66a6eb7fa8d97edc9 Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Thu, 26 Mar 2026 17:11:06 -0500
Subject: [PATCH 04/11] Fix base model reference to PR #728

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md     | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 282a8692e..c1f4e217f 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -8,7 +8,7 @@ This submission is not a leaderboard entry. It is a study of eval-time n-gram ca
 
 ## Results
 
-All runs use the ValCalib GPTQ base model (1.1142 BPB leaderboard score, 11L/512d, ~16MB artifact, from `records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/`). Single GPU, stride=64, FineWeb val (62M tokens).
+All runs use our ValCalib GPTQ base model ([PR #728](https://github.com/openai/parameter-golf/pull/728), 1.1142 BPB, 11L/512d, ~16MB artifact). Single GPU, stride=64, FineWeb val (62M tokens).
 
 | Config | BPB | Eval-time state | Effective model | Time |
 |--------|----:|----------------:|----------------:|-----:|
@@ -171,7 +171,7 @@ NGRAM_ENABLED=1 NGRAM_ORDER=9 NGRAM_ALPHA=0.40 \
 python3 experiments/eval_time_mixing/scripts/analyze_ngram_matches.py
 ```
 
-Base model: `train_609_val_calib.py` from `records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/`.
+Base model: `train_609_val_calib.py` from [PR #728](https://github.com/openai/parameter-golf/pull/728) (`records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/`).
 
 ## Credits
 

From b564a7f61ae3e6a670183ab8d94a4e10793a38bc Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Thu, 26 Mar 2026 17:19:04 -0500
Subject: [PATCH 05/11] Refine proposal: cap auxiliary state, not total GPU
 memory

Decompressed model weights alone exceed any naive GPU memory cap.
The right constraint is auxiliary state: tensors that accumulate
during eval and are not derivable from the artifact (hash tables,
TTT deltas). Not model weights, KV cache, or activations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                 | 20 +++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index c1f4e217f..356432799 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -125,11 +125,23 @@ TTT and LoRA adaptation follow the same principle as the n-gram cache — build
 
 ## Proposal
 
-### 1. Cap eval-time memory
+### 1. Cap auxiliary eval-time state
 
-Define a total memory budget for eval-time state — for example, artifact + eval state <= 64 MB. This directly constrains the effective model size and aligns the competition with deployment realities. Simple to enforce: measure peak GPU memory allocation during eval.
+An important distinction: when a 16 MB int6+compressed artifact loads into VRAM, it decompresses into ~50–100 MB of bf16 weights. Add activations, KV cache, and CUDA overhead, and the base model alone uses several hundred MB of GPU memory. So "cap total GPU memory" doesn't work — the decompressed model already exceeds any reasonable cap.
 
-This extends the 16 MB artifact philosophy to cover the full model at inference time. A model that fits in 16 MB but needs 272 MB to run doesn't fit in 16 MB.
+The right thing to constrain is **auxiliary state**: tensors that accumulate across the evaluation and are not derivable from the artifact alone. This includes:
+
+- N-gram hash tables (192–256 MB) — built from scored tokens
+- TTT LoRA deltas (~2 MB) — built from scored tokens
+- Any other state that persists across batches and grows with the corpus
+
+This does NOT include:
+
+- Model weights (deterministic decompression of the artifact)
+- KV cache (recomputed each sliding window, does not accumulate)
+- Activations (transient, discarded after each forward pass)
+
+A cap of, say, auxiliary state ≤ 32 MB would preserve everything currently approved (TTT LoRA deltas at ~2 MB, KV cache is excluded) while constraining the techniques that grow the effective model by 10–250x. Enforcement: sum the sizes of all non-model tensors that persist across batches.
 
 ### 2. Cap per-token overhead
 
@@ -137,7 +149,7 @@ Require that eval-time techniques do not increase per-token latency by more than
 
 Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×). KV cache, TTT, LoRA are all within 1.5×. This also catches two-pass rescoring mechanically.
 
-Both proposals preserve everything currently approved. KV caches (~20 MB), TTT LoRA deltas (~2 MB), and sliding window eval all fit comfortably. They only constrain techniques that grow the model by 10–250x during evaluation.
+Both proposals preserve everything currently approved and only constrain techniques that grow the model by 10–250x during evaluation.
 
 ---
 

From 638c49b75be824b691d6a8679d97b1edfdfdcd51 Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Fri, 27 Mar 2026 00:36:32 -0500
Subject: [PATCH 06/11] Fix collision explanation: P(cache_bin) inflates, not
 useful blurring
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Credit to @Eppie and Mirco (Discord) for the correct formulation.
The hash ratio is not a conditional probability — it approaches 1.0
as collision-aggregated counts fill both tables proportionally. The
BPB improvement is a measurement artifact from point-evaluating an
invalid distribution, not from useful statistical estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                      | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 356432799..530cad9e1 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -35,9 +35,11 @@ All-reduce sync cost: 1.6–2.0s total. The first three configs fit within the 6
 
 Alpha sweep (8-GPU, backoff 2-7): α=0.20 → 0.6180, α=0.40 → 0.4941, α=0.60 → 0.4263, α=0.80 → 0.3942. Higher alpha is monotonically better — the opposite of PR #727's finding (0.9674 BPB). With a global cache, the n-gram is reliable enough that the model should defer to it more, not less. The best alpha (0.80) exceeds the time budget, so in practice α=0.40–0.60 is the operating range.
 
-### Hash collision analysis
+### Hash collision analysis — the reported BPB scores are inflated
 
-We swept bucket counts from 1M to 256M expecting more buckets = fewer collisions = better accuracy. The opposite happened:
+**Update:** our original explanation of the collision mechanism was incomplete. Credit to @Eppie ([comment](https://github.com/openai/parameter-golf/issues/677#issuecomment-4139902162)) for identifying the probability validity issue, and to Mirco on Discord for the `P(cache_bin)` formulation.
+
+We swept bucket counts from 1M to 256M:
 
 | Buckets | BPB | Table memory |
 |--------:|----:|---------:|
@@ -46,7 +48,13 @@ We swept bucket counts from 1M to 256M expecting more buckets = fewer collisions
 | 64M | 1.0629 | 3 GB |
 | 256M | 1.1123 | 12 GB |
 
-With 256M buckets, the table is so sparse that most n-grams have count=1 and fail the `min_count ≥ 2` threshold. Collisions at smaller table sizes merge similar n-grams together, artificially boosting counts above threshold. The hash table is functioning as a lossy count-min sketch, not an exact lookup. [Standard literature](https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch) treats collisions as error to minimize. The BPB improvement depends on this interaction between hash density and the count threshold.
+The hash ratio `full_table[hash(ctx, tok)] / ctx_table[hash(ctx)]` is not a conditional probability. The two tables use different hash functions mapping to the same number of buckets. With 1M buckets and 62M tokens, each bucket averages ~62 entries in both tables. The ratio of two similarly-populated buckets approaches 1.0. This is `P(cache_bin)` — a collision-aggregated hash ratio, not `P(tok | ctx)`.
+
+The blend `(1-α) * p_model + α * P(cache_bin)` with `P(cache_bin) ≈ 1.0` pushes the correct token's probability up. But the blend is only computed for the correct token. If you computed it for all 1024 tokens, each would also get `P(cache_bin) ≈ 1.0`. The distribution would sum to far more than 1. After renormalization, the n-gram contribution washes out.
+
+The 1-bucket extreme makes this obvious: `P(cache_bin) = T/T = 1.0` for every lookup. Perfect (fake) score.
+
+The reported BPB numbers are not achievable by a valid compressor. With collision-free tables and proper normalization, n-grams would provide at most a modest improvement from genuine corpus repetition.
 
 ### What the n-gram cache is
 

From 36ef8283faf8f199ae029cdf3aa29542b1402a77 Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Fri, 27 Mar 2026 08:30:18 -0500
Subject: [PATCH 07/11] Add distribution verification as primary fix, three-fix
 proposal
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fix 1: verify sum(probs) ≈ 1.0 at every scored position
Fix 2: cap auxiliary eval-time state ≤ 32 MB
Fix 3: cap per-token overhead ≤ 1.5× base model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                 | 30 +++++++++----------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 530cad9e1..2664e3755 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -133,31 +133,31 @@ TTT and LoRA adaptation follow the same principle as the n-gram cache — build
 
 ## Proposal
 
-### 1. Cap auxiliary eval-time state
+### 1. Verify the distribution sums to 1
 
-An important distinction: when a 16 MB int6+compressed artifact loads into VRAM, it decompresses into ~50–100 MB of bf16 weights. Add activations, KV cache, and CUDA overhead, and the base model alone uses several hundred MB of GPU memory. So "cap total GPU memory" doesn't work — the decompressed model already exceeds any reasonable cap.
+The most fundamental fix. Require the model to produce a full probability vector over all K tokens at every scored position. The eval script verifies `sum(probs) ≈ 1.0` before scoring:
 
-The right thing to constrain is **auxiliary state**: tensors that accumulate across the evaluation and are not derivable from the artifact alone. This includes:
+```python
+probs = model.predict(context)        # shape: [vocab_size]
+assert abs(probs.sum() - 1.0) < 1e-4  # verify
+nll = -torch.log(probs[correct_token])
+```
 
-- N-gram hash tables (192–256 MB) — built from scored tokens
-- TTT LoRA deltas (~2 MB) — built from scored tokens
-- Any other state that persists across batches and grows with the corpus
+One `torch.sum` per position. Cost: 1–2 seconds for 62M tokens. Negligible.
 
-This does NOT include:
+This catches every invalid distribution: hash-ratio inflation (sum ≈ 410), single-token hacks (sum = K), any post-softmax modification that doesn't renormalize. It passes everything valid: softmax outputs, linear interpolation of valid distributions, Dirichlet-Multinomial, TTT, LoRA, GPTQ. Not n-gram specific. A general invariant the eval should enforce.
 
-- Model weights (deterministic decompression of the artifact)
-- KV cache (recomputed each sliding window, does not accumulate)
-- Activations (transient, discarded after each forward pass)
+### 2. Cap auxiliary eval-time state
 
-A cap of, say, auxiliary state ≤ 32 MB would preserve everything currently approved (TTT LoRA deltas at ~2 MB, KV cache is excluded) while constraining the techniques that grow the effective model by 10–250x. Enforcement: sum the sizes of all non-model tensors that persist across batches.
+Even with valid distributions, the model can grow unboundedly at eval time. Constrain **auxiliary state**: tensors that accumulate during eval and are not derivable from the artifact alone (hash tables, TTT LoRA deltas, anything that persists across batches). Not model weights (deterministic decompression of artifact), not KV cache (recomputed each window), not activations (transient).
 
-### 2. Cap per-token overhead
+A cap of auxiliary state ≤ 32 MB preserves everything currently approved (TTT LoRA at ~2 MB) while constraining techniques that grow the effective model by 10–250x.
 
-Require that eval-time techniques do not increase per-token latency by more than 50% over the base model forward pass on the same hardware. Not an absolute number — a ratio. Hardware-agnostic and easy to measure: run eval with and without the technique.
+### 3. Cap per-token overhead
 
-Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×). KV cache, TTT, LoRA are all within 1.5×. This also catches two-pass rescoring mechanically.
+Eval-time techniques must not increase per-token latency by more than 50% over the base model forward pass. Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×). Catches two-pass rescoring mechanically.
 
-Both proposals preserve everything currently approved and only constrain techniques that grow the model by 10–250x during evaluation.
+All three fixes preserve everything currently approved.
 
 ---
 

From 4051b85e120623672fc600a3be469c58707bcf94 Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Fri, 27 Mar 2026 08:32:16 -0500
Subject: [PATCH 08/11] Add causality enforcement as fix #0, promote bucket
 sweep to case section

Causality is assumed but not enforced by the eval harness.
Two-pass rescoring violates it. Should be explicit.

Bucket sweep moved from experimental details to the main argument
since it proves the BPB scores are inflated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md     | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 2664e3755..6846648a0 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -133,6 +133,10 @@ TTT and LoRA adaptation follow the same principle as the n-gram cache — build
 
 ## Proposal
 
+### 0. Enforce causality explicitly
+
+The competition assumes causality but does not enforce it. The FAQ says you can only train on tokens "you've already evaluated your model on," but the eval harness does not verify this. Two-pass rescoring (PRs #846, #853, #868, #870, #881, #888) violates causality: pass 2 rescores token #100 using a cache built from tokens #101 through #62M. This should be an explicit, enforced constraint, not an honor-system rule.
+
 ### 1. Verify the distribution sums to 1
 
 The most fundamental fix. Require the model to produce a full probability vector over all K tokens at every scored position. The eval script verifies `sum(probs) ≈ 1.0` before scoring:

From ecd339f91571c72ea3c4b57603ea130a597b3377 Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Fri, 27 Mar 2026 08:34:33 -0500
Subject: [PATCH 09/11] Fix causality language: rules should state it, not eval
 harness enforce it

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md       | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 6846648a0..89ff42b28 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -135,7 +135,7 @@ TTT and LoRA adaptation follow the same principle as the n-gram cache — build
 
 ### 0. Enforce causality explicitly
 
-The competition assumes causality but does not enforce it. The FAQ says you can only train on tokens "you've already evaluated your model on," but the eval harness does not verify this. Two-pass rescoring (PRs #846, #853, #868, #870, #881, #888) violates causality: pass 2 rescores token #100 using a cache built from tokens #101 through #62M. This should be an explicit, enforced constraint, not an honor-system rule.
+The competition assumes causality but the rules don't state it as an explicit requirement. The FAQ says you can only train on tokens "you've already evaluated your model on," but this is guidance, not a formal rule. Two-pass rescoring (PRs #846, #853, #868, #870, #881, #888) violates causality: pass 2 rescores token #100 using a cache built from tokens #101 through #62M. Causality should be a stated rule, not an implicit assumption.
 
 ### 1. Verify the distribution sums to 1
 

From 9671d3d9d8e7f1bb98e4ac52cd84ebd612a5b3aa Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Fri, 27 Mar 2026 08:48:30 -0500
Subject: [PATCH 10/11] Restructure proposals: essential (1,2) vs
 deployment-aligned (3,4)

Add causality and distribution validity to real-world comparison.
Explain how unbounded eval-time state can be exploited even with
valid distributions and causality (self-distillation, ensembling,
neural cache).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                 | 30 +++++++++++++------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 89ff42b28..03e752415 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -92,6 +92,8 @@ This creates a gap between what the competition measures and what matters in pra
 | **Time budget** | 600 seconds for 62M tokens | < 200ms per request |
 | **Hardware** | 8x H100 80GB (640 GB VRAM) | Often 1 GPU, sometimes CPU |
 | **Model size** | 16 MB artifact; eval-time state unconstrained | Total model must fit deployment target |
+| **Causality** | Assumed, not enforced | Physical fact |
+| **Distribution validity** | Not checked | Required for generation |
 
 Each dimension matters:
 
@@ -105,6 +107,10 @@ Each dimension matters:
 
 **5. Inference speed.** The n-gram cache roughly doubles eval time (606s → 1,079s for backoff 2-7). The overhead is constant per token — it doesn't get worse as the cache fills — but a flat 2x slowdown matters when your latency budget is 50–200ms. You pay the per-token cost on every request, but you only get the BPB benefit after millions of tokens of contiguous corpus. On a 500-token prompt, you get the slowdown without the payoff.
 
+**6. Causality is not optional.** In real-world inference, causality is a physical fact — you can't use tokens you haven't generated yet. In the competition, it's an assumption that isn't enforced. Two-pass rescoring scores every token twice: once to build a cache, then again using that cache. Pass 2 rescores token #100 with a cache containing tokens #101 through #62M. No real system works this way.
+
+**7. Probabilities must sum to 1.** A language model assigns a probability to every possible next token. Those probabilities must sum to 1 — that's what makes them probabilities. Current n-gram implementations blend a hash ratio into the correct token's probability without adjusting the other 1,023 tokens. The distribution sums to far more than 1. The BPB metric trusts that it's receiving a valid probability, but it isn't. In deployment, a model that outputs invalid distributions can't be used for generation, sampling, or compression.
+
 ### The core tension
 
 The competition implicitly asks: **given N bytes of model, how well can you compress natural language?**
@@ -133,13 +139,15 @@ TTT and LoRA adaptation follow the same principle as the n-gram cache — build
 
 ## Proposal
 
-### 0. Enforce causality explicitly
+#### The two essential fixes:
+
+### 1. Make causality an explicit rule
 
-The competition assumes causality but the rules don't state it as an explicit requirement. The FAQ says you can only train on tokens "you've already evaluated your model on," but this is guidance, not a formal rule. Two-pass rescoring (PRs #846, #853, #868, #870, #881, #888) violates causality: pass 2 rescores token #100 using a cache built from tokens #101 through #62M. Causality should be a stated rule, not an implicit assumption.
+The competition assumes causality but the rules don't state it as a formal requirement. The FAQ says you can only train on tokens "you've already evaluated your model on," but this is guidance, not a formal rule. Two-pass rescoring (PRs #846, #853, #868, #870, #881, #888) violates causality: pass 2 rescores token #100 using a cache built from tokens #101 through #62M. Causality should be a stated rule, not an implicit assumption.
 
-### 1. Verify the distribution sums to 1
+### 2. Verify the distribution sums to 1
 
-The most fundamental fix. Require the model to produce a full probability vector over all K tokens at every scored position. The eval script verifies `sum(probs) ≈ 1.0` before scoring:
+Require the model to produce a full probability vector over all K tokens at every scored position. The eval script verifies `sum(probs) ≈ 1.0` before scoring:
 
 ```python
 probs = model.predict(context)        # shape: [vocab_size]
@@ -151,17 +159,21 @@ One `torch.sum` per position. Cost: 1–2 seconds for 62M tokens. Negligible.
 
 This catches every invalid distribution: hash-ratio inflation (sum ≈ 410), single-token hacks (sum = K), any post-softmax modification that doesn't renormalize. It passes everything valid: softmax outputs, linear interpolation of valid distributions, Dirichlet-Multinomial, TTT, LoRA, GPTQ. Not n-gram specific. A general invariant the eval should enforce.
 
-### 2. Cap auxiliary eval-time state
+These two fixes solve the immediate problem. But they leave a structural gap: nothing prevents the model from growing unboundedly during eval with valid distributions and preserved causality. Someone could train a second, larger model during eval via self-distillation (outputs go through softmax, score-first, valid and causal). Or load 8 copies of the model and ensemble them via divergent TTT. Or store 63 GB of hidden states and use cross-attention as a neural cache. All valid. All causal. All far beyond 16 MB.
+
+#### Worth considering if the competition wants to reflect deployment:
+
+### 3. Cap auxiliary eval-time state
 
-Even with valid distributions, the model can grow unboundedly at eval time. Constrain **auxiliary state**: tensors that accumulate during eval and are not derivable from the artifact alone (hash tables, TTT LoRA deltas, anything that persists across batches). Not model weights (deterministic decompression of artifact), not KV cache (recomputed each window), not activations (transient).
+Constrain **auxiliary state**: tensors that accumulate during eval and are not derivable from the artifact alone (hash tables, TTT LoRA deltas, anything that persists across batches). Not model weights (deterministic decompression of artifact), not KV cache (recomputed each window), not activations (transient).
 
 A cap of auxiliary state ≤ 32 MB preserves everything currently approved (TTT LoRA at ~2 MB) while constraining techniques that grow the effective model by 10–250x.
 
-### 3. Cap per-token overhead
+### 4. Cap per-token overhead
 
-Eval-time techniques must not increase per-token latency by more than 50% over the base model forward pass. Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×). Catches two-pass rescoring mechanically.
+Eval-time techniques must not increase per-token latency by more than 50% over the base model forward pass. Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×).
 
-All three fixes preserve everything currently approved.
+All fixes preserve everything currently approved. Fixes 1 and 2 are the most urgent. Fixes 3 and 4 address the structural problem that persists even with honest scores.
 
 ---
 

From 37aa1892cd4607a4a20a2027067b8f86b0244da4 Mon Sep 17 00:00:00 2001
From: Abay Bektursun <abaybektursun@gmail.com>
Date: Fri, 27 Mar 2026 09:30:45 -0500
Subject: [PATCH 11/11] Simplify README to pointer, fix submission.json
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

README now points to blog + PR instead of maintaining a third copy.
submission.json: fix base_model_pr 549→728, update name and blurb.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../README.md                                 | 215 +-----------------
 .../submission.json                           |   6 +-
 2 files changed, 12 insertions(+), 209 deletions(-)

diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
index 03e752415..b5869db16 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/README.md
@@ -1,214 +1,17 @@
-# Non-Record: Eval-Time N-gram Mixing and the Unbounded Model Growth Problem
+# Non-Record: The N-gram BPB Scores Are Not Real
 
 **Author:** abaybektursun | **Date:** 2026-03-26 | **Track:** Non-record study
 
-This submission is not a leaderboard entry. It is a study of eval-time n-gram caching — a technique that reduces BPB from 1.11 to 0.38 while preserving strict causality, costing zero artifact bytes, but growing the effective model to 17x the artifact limit at eval time. We present results, explain why this creates a dilemma for the competition, and propose rule clarifications.
+N-gram caching in Parameter Golf claims sub-0.5 BPB. The scores come from an invalid probability distribution that sums to ~410, not 1. This study presents the proof, experimental evidence, and proposed fixes.
 
----
+Full analysis: [abay.tech/posts/eval-time-model-growth](https://abay.tech/posts/eval-time-model-growth)
 
-## Results
-
-All runs use our ValCalib GPTQ base model ([PR #728](https://github.com/openai/parameter-golf/pull/728), 1.1142 BPB, 11L/512d, ~16MB artifact). Single GPU, stride=64, FineWeb val (62M tokens).
-
-| Config | BPB | Eval-time state | Effective model | Time |
-|--------|----:|----------------:|----------------:|-----:|
-| Base LM (int6 quantized, leaderboard) | 1.1142 | 0 MB | 16 MB | 606s |
-| Base LM (float, pre-quant) | 1.1109 | 0 MB | 16 MB | 606s |
-| Pure n-gram, no base LM | 1.0615 | 192 MB | 192 MB | 535s |
-| Fixed 7-gram, alpha=0.40 | 0.5234 | 192 MB | 208 MB | 824s |
-| Backoff 2-7, alpha=0.40 | 0.4923 | 192 MB | 208 MB | 1079s |
-| Backoff 2-7, entropy-adaptive alpha | 0.6535 | 192 MB | 208 MB | 1114s |
-| **Backoff 2-9, order-adaptive entropy** | **0.3779** | **256 MB** | **272 MB** | **1234s** |
-
-The n-gram cache alone — with no base LM — beats the trained model (1.06 vs 1.11 BPB). Combined, it cuts BPB by 66%.
-
-### 8-GPU results with all-reduce sync (EXP-11)
-
-All-reduce sync cost: 1.6–2.0s total. The first three configs fit within the 600s competition eval budget; α=0.80 exceeds it (939s).
-
-| Config | BPB | Time | Cache | Sync cost |
-|--------|----:|-----:|-------|-----------|
-| Base LM (8-GPU) | 1.1130 | 110s | None | — |
-| Backoff 2-7, α=0.40 | 0.4941 | 401s | Global (all-reduce) | 1.6s |
-| Backoff 2-9, α=0.40 | 0.4548 | 500s | Global (all-reduce) | 1.9s |
-| Backoff 2-7, α=0.80 | 0.3942 | 939s | Global (all-reduce) | ~2.0s |
-
-Alpha sweep (8-GPU, backoff 2-7): α=0.20 → 0.6180, α=0.40 → 0.4941, α=0.60 → 0.4263, α=0.80 → 0.3942. Higher alpha is monotonically better — the opposite of PR #727's finding (0.9674 BPB). With a global cache, the n-gram is reliable enough that the model should defer to it more, not less. The best alpha (0.80) exceeds the time budget, so in practice α=0.40–0.60 is the operating range.
-
-### Hash collision analysis — the reported BPB scores are inflated
-
-**Update:** our original explanation of the collision mechanism was incomplete. Credit to @Eppie ([comment](https://github.com/openai/parameter-golf/issues/677#issuecomment-4139902162)) for identifying the probability validity issue, and to Mirco on Discord for the `P(cache_bin)` formulation.
-
-We swept bucket counts from 1M to 256M:
-
-| Buckets | BPB | Table memory |
-|--------:|----:|---------:|
-| 1M | 0.5793 | 48 MB |
-| 4M | 0.6535 | 192 MB |
-| 64M | 1.0629 | 3 GB |
-| 256M | 1.1123 | 12 GB |
-
-The hash ratio `full_table[hash(ctx, tok)] / ctx_table[hash(ctx)]` is not a conditional probability. The two tables use different hash functions mapping to the same number of buckets. With 1M buckets and 62M tokens, each bucket averages ~62 entries in both tables. The ratio of two similarly-populated buckets approaches 1.0. This is `P(cache_bin)` — a collision-aggregated hash ratio, not `P(tok | ctx)`.
-
-The blend `(1-α) * p_model + α * P(cache_bin)` with `P(cache_bin) ≈ 1.0` pushes the correct token's probability up. But the blend is only computed for the correct token. If you computed it for all 1024 tokens, each would also get `P(cache_bin) ≈ 1.0`. The distribution would sum to far more than 1. After renormalization, the n-gram contribution washes out.
-
-The 1-bucket extreme makes this obvious: `P(cache_bin) = T/T = 1.0` for every lookup. Perfect (fake) score.
-
-The reported BPB numbers are not achievable by a valid compressor. With collision-free tables and proper normalization, n-grams would provide at most a modest improvement from genuine corpus repetition.
-
-### What the n-gram cache is
-
-After each token is scored by the base LM, the token and its preceding context are inserted into hash tables. When a future token's context matches a previously seen n-gram, the cached frequency estimate is mixed with the prediction:
-
-```
-p_mix = (1 - alpha) * p_model + alpha * p_ngram
-```
-
-The tables are built exclusively from already-scored tokens. No future tokens are accessed. Strict causality is preserved.
-
-### What the n-gram cache costs
-
-| Config | Hash table memory | Formula |
-|--------|------------------:|---------|
-| Orders 2-7 (6 orders) | 192 MB | 6 orders x 2 tables x 4M buckets x 4 bytes |
-| Orders 2-9 (8 orders) | 256 MB | 8 orders x 2 tables x 4M buckets x 4 bytes |
-| Orders 2-9, 64M buckets | 4,096 MB | 8 orders x 2 tables x 64M buckets x 4 bytes |
-
-None of this counts toward the 16MB artifact limit. The tables are empty at the start of evaluation and grow as tokens are scored. By the end of evaluation, the model that is doing the actual prediction is 16MB of weights plus 256MB of hash tables — **272 MB total**.
-
----
-
-## The Dilemma
-
-The competition constrains the artifact to 16MB. The intent is clear: force creative compression of model knowledge into a small footprint. But eval-time techniques like n-gram caching, TTT, and LoRA adaptation grow the effective model far beyond 16MB during evaluation — legally, because the rules only constrain the artifact, not the eval-time state.
-
-This creates a gap between what the competition measures and what matters in practice.
-
-### Four dimensions of the gap
-
-|  | Competition | Real-world inference |
-|--|-------------|---------------------|
-| **Corpus** | Fixed 62M tokens, scored in one pass | Streaming queries, each independent |
-| **Time budget** | 600 seconds for 62M tokens | < 200ms per request |
-| **Hardware** | 8x H100 80GB (640 GB VRAM) | Often 1 GPU, sometimes CPU |
-| **Model size** | 16 MB artifact; eval-time state unconstrained | Total model must fit deployment target |
-| **Causality** | Assumed, not enforced | Physical fact |
-| **Distribution validity** | Not checked | Required for generation |
-
-Each dimension matters:
-
-**1. Inference time.** The competition allows 600 seconds to score 62M tokens. The n-gram cache exploits this by doing O(K) hash lookups per token across K orders, plus table updates after scoring. On a single GPU, our best config takes 1234s. On 8 GPUs with all-reduce sync (EXP-11), backoff 2-7 takes 401s. In real-world inference, you serve one request at a time with a latency budget measured in milliseconds.
-
-**2. Inference hardware.** The competition provides 8x H100 with 640GB of combined VRAM. The hash tables (256 MB per GPU, synced via all-reduce) are negligible relative to this. In deployment, models run on single GPUs, edge devices, or CPUs. The 256MB of hash tables alone exceeds the 16MB artifact by 16x.
-
-**3. Competition setup.** The artifact limit constrains what you ship. But the n-gram cache ships nothing — it materializes at eval time from the scored tokens themselves. The 16MB limit was designed to constrain model capacity. The n-gram cache circumvents this by building an unbounded statistical model during evaluation, limited only by the number of hash buckets allocated.
-
-**4. Real-world evaluation.** In production, a language model scores individual prompts. Each query arrives independently. There is no corpus-level repetition to exploit. The n-gram cache's power comes entirely from within-corpus repetition. On a stream of independent queries, the cache starts empty for each request and provides no benefit.
-
-**5. Inference speed.** The n-gram cache roughly doubles eval time (606s → 1,079s for backoff 2-7). The overhead is constant per token — it doesn't get worse as the cache fills — but a flat 2x slowdown matters when your latency budget is 50–200ms. You pay the per-token cost on every request, but you only get the BPB benefit after millions of tokens of contiguous corpus. On a 500-token prompt, you get the slowdown without the payoff.
-
-**6. Causality is not optional.** In real-world inference, causality is a physical fact — you can't use tokens you haven't generated yet. In the competition, it's an assumption that isn't enforced. Two-pass rescoring scores every token twice: once to build a cache, then again using that cache. Pass 2 rescores token #100 with a cache containing tokens #101 through #62M. No real system works this way.
-
-**7. Probabilities must sum to 1.** A language model assigns a probability to every possible next token. Those probabilities must sum to 1 — that's what makes them probabilities. Current n-gram implementations blend a hash ratio into the correct token's probability without adjusting the other 1,023 tokens. The distribution sums to far more than 1. The BPB metric trusts that it's receiving a valid probability, but it isn't. In deployment, a model that outputs invalid distributions can't be used for generation, sampling, or compression.
-
-### The core tension
-
-The competition implicitly asks: **given N bytes of model, how well can you compress natural language?**
-
-Eval-time caching answers a different question: **given N bytes of model plus unbounded eval-time memory, how well can you compress a specific fixed corpus?**
-
-These are different problems. The second has a much lower floor — any corpus with internal repetition can be compressed toward its empirical entropy by memorizing seen patterns. Our results show the gap is enormous: 1.11 BPB (base LM only) vs 0.38 BPB (base LM + cache). The cache contributes 2/3 of the total compression, yet costs zero artifact bytes.
-
----
-
-## What's already legal and where the line blurs
-
-The competition already permits eval-time model growth through several mechanisms:
-
-| Technique | Eval-time state growth | Legality status |
-|-----------|----------------------:|----|
-| Sliding window eval (stride < seq_len) | KV cache, ~20 MB | Uncontroversial |
-| Test-time training (score-first TTT) | LoRA deltas, ~2 MB | Technique deemed legal (PRs #549, #548) |
-| Per-document LoRA TTT (8 epochs) | LoRA deltas, ~2 MB | Technique deemed legal (PR #596, 0.6430 BPB) |
-| N-gram cache (backoff 2-7) | Hash tables, 192 MB | Under review |
-| N-gram cache (backoff 2-9, 64M buckets) | Hash tables, 4 GB | Under review |
-
-TTT and LoRA adaptation follow the same principle as the n-gram cache — build state from scored tokens — though the growth is modest (~2 MB vs 192 MB). The question is not whether causality is preserved (it is), but whether unbounded eval-time model growth is in the spirit of the 16MB constraint.
-
----
-
-## Proposal
-
-#### The two essential fixes:
-
-### 1. Make causality an explicit rule
-
-The competition assumes causality but the rules don't state it as a formal requirement. The FAQ says you can only train on tokens "you've already evaluated your model on," but this is guidance, not a formal rule. Two-pass rescoring (PRs #846, #853, #868, #870, #881, #888) violates causality: pass 2 rescores token #100 using a cache built from tokens #101 through #62M. Causality should be a stated rule, not an implicit assumption.
-
-### 2. Verify the distribution sums to 1
-
-Require the model to produce a full probability vector over all K tokens at every scored position. The eval script verifies `sum(probs) ≈ 1.0` before scoring:
-
-```python
-probs = model.predict(context)        # shape: [vocab_size]
-assert abs(probs.sum() - 1.0) < 1e-4  # verify
-nll = -torch.log(probs[correct_token])
-```
-
-One `torch.sum` per position. Cost: 1–2 seconds for 62M tokens. Negligible.
-
-This catches every invalid distribution: hash-ratio inflation (sum ≈ 410), single-token hacks (sum = K), any post-softmax modification that doesn't renormalize. It passes everything valid: softmax outputs, linear interpolation of valid distributions, Dirichlet-Multinomial, TTT, LoRA, GPTQ. Not n-gram specific. A general invariant the eval should enforce.
-
-These two fixes solve the immediate problem. But they leave a structural gap: nothing prevents the model from growing unboundedly during eval with valid distributions and preserved causality. Someone could train a second, larger model during eval via self-distillation (outputs go through softmax, score-first, valid and causal). Or load 8 copies of the model and ensemble them via divergent TTT. Or store 63 GB of hidden states and use cross-attention as a neural cache. All valid. All causal. All far beyond 16 MB.
-
-#### Worth considering if the competition wants to reflect deployment:
-
-### 3. Cap auxiliary eval-time state
-
-Constrain **auxiliary state**: tensors that accumulate during eval and are not derivable from the artifact alone (hash tables, TTT LoRA deltas, anything that persists across batches). Not model weights (deterministic decompression of artifact), not KV cache (recomputed each window), not activations (transient).
-
-A cap of auxiliary state ≤ 32 MB preserves everything currently approved (TTT LoRA at ~2 MB) while constraining techniques that grow the effective model by 10–250x.
-
-### 4. Cap per-token overhead
-
-Eval-time techniques must not increase per-token latency by more than 50% over the base model forward pass. Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×).
-
-All fixes preserve everything currently approved. Fixes 1 and 2 are the most urgent. Fixes 3 and 4 address the structural problem that persists even with honest scores.
-
----
-
-## Surprising findings
-
-1. **Global cache vs partitioned cache:** On 8 GPUs with independent caches (as in PRs #727 at 0.9674 BPB, #788 at 0.9059 BPB), each GPU sees 1/8 of the tokens. This degrades BPB from ~0.49 (global) to ~0.91–0.97 (partitioned). Our EXP-11 implementation solves this with all-reduce sync of hash table deltas across GPUs.
-
-2. **Entropy-adaptive alpha hurts with strong caches:** The sigmoid-gated alpha from PR #727 gives 0.65 BPB — 0.16 BPB *worse* than fixed alpha=0.40 (0.49 BPB). With a global cache, the n-gram is often more reliable than the base LM, and the entropy gate is too conservative.
-
-3. **N-gram alone beats the base LM:** Pure n-gram (no base LM at all) achieves 1.06 BPB vs 1.11 BPB for the trained model. A zero-parameter frequency table built from scored tokens predicts FineWeb better than the trained model.
-
-4. **Three compression phenomena:** The n-gram cache captures (a) deterministic BPE subword completion (orders 2-4), (b) common English collocations (orders 4-6), and (c) verbatim document repetition (orders 6+). Only (c) is corpus-specific.
-
----
-
-## Reproduction
-
-All scripts are in `experiments/eval_time_mixing/scripts/`:
-
-```bash
-# Single-GPU experiments (EXP-0, requires 1xH100 + trained model)
-python3 experiments/eval_time_mixing/scripts/eval_ngram.py \
-    --model final_model.pt --exp backoff_7
-
-# 8-GPU distributed with global cache (EXP-11)
-NGRAM_ENABLED=1 NGRAM_ORDER=9 NGRAM_ALPHA=0.40 \
-    torchrun --standalone --nproc_per_node=8 \
-    experiments/eval_time_mixing/scripts/eval_ngram_distributed.py
-
-# N-gram match analysis (qualitative)
-python3 experiments/eval_time_mixing/scripts/analyze_ngram_matches.py
-```
-
-Base model: `train_609_val_calib.py` from [PR #728](https://github.com/openai/parameter-golf/pull/728) (`records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/`).
+PR discussion: [#886](https://github.com/openai/parameter-golf/pull/886)
 
 ## Credits
 
-N-gram cache concept and initial implementations: [PR #727](https://github.com/openai/parameter-golf/pull/727), [PR #779](https://github.com/openai/parameter-golf/pull/779), [PR #788](https://github.com/openai/parameter-golf/pull/788). Competition design and infrastructure: OpenAI.
+- [@Eppie](https://github.com/openai/parameter-golf/issues/677#issuecomment-4139902162) for identifying the probability validity issue
+- Mirco (Discord) for the `P(cache_bin)` formulation
+- N-gram cache concept: [PR #727](https://github.com/openai/parameter-golf/pull/727), [PR #779](https://github.com/openai/parameter-golf/pull/779), [PR #788](https://github.com/openai/parameter-golf/pull/788)
+- Base model: [PR #728](https://github.com/openai/parameter-golf/pull/728)
+- Code: `experiments/eval_time_mixing/`
diff --git a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/submission.json b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/submission.json
index 5d41c7880..eebcd8d50 100644
--- a/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/submission.json
+++ b/records/track_non_record_16mb/2026-03-26_EvalTime_NGram_ModelGrowth_Study/submission.json
@@ -1,8 +1,8 @@
 {
   "author": "abaybektursun",
   "github_id": "abaybektursun",
-  "name": "Eval-Time N-gram Mixing and the Unbounded Model Growth Problem",
-  "blurb": "Study of eval-time n-gram caching: a strictly causal technique that reduces BPB from 1.11 to 0.38 while growing the effective model from 16MB to 272MB at eval time. Presents results, analyzes the gap between competition setup and real-world inference constraints, and proposes rule clarifications.",
+  "name": "The N-gram BPB Scores Are Not Real",
+  "blurb": "N-gram caching claims sub-0.5 BPB but the scores come from an invalid probability distribution (sums to ~410, not 1). The hash ratio P(cache_bin) is not a conditional probability. Bucket sweep confirms: collision-free tables give baseline-level BPB. Proposes distribution verification and causality enforcement.",
   "date": "2026-03-26",
   "track": "non_record_study",
   "val_bpb_neural_only": 1.1109,
@@ -11,7 +11,7 @@
   "artifact_bytes": 15866156,
   "eval_time_state_bytes_best": 268435456,
   "effective_model_bytes_best": 284301612,
-  "base_model_pr": 549,
+  "base_model_pr": 728,
   "base_model_record": "records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072",
   "hardware": "1xH100 80GB SXM (single-GPU experiments)",
   "experiments_run": [