RFC: How to Clean Up All the Parameter Golf Submissions by abaybektursun · Pull Request #886 · openai/parameter-golf

abaybektursun · 2026-03-26T19:03:36Z

📄 Full article with charts: abay.tech/posts/eval-time-model-growth

The distribution doesn't sum to 1

N-gram caching recently pushed reported scores below 0.5 BPB. We dug into the numbers and found a cool catch: the probability distribution sums to ~410, not 1.

The blend (1-α) * p_model + α * P(cache_bin) is only computed for the correct token. The other 1,023 tokens are never checked. If they were, the distribution would sum to ~410, not 1.0.

Why: the cache stores two hash tables per n-gram order: one counts how often each context appears, one counts how often each (context, token) pair appears. Their ratio — full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] — is meant to approximate P(tok | ctx). But because context-only and context+token hash to independent bucket indices, the ratio doesn't track token frequency. With 1M buckets and 62M tokens, each bucket averages ~62 entries. The ratio of two similarly-populated buckets approaches 1.0 for ALL tokens. This is P(cache_bin), not P(tok | ctx).

The 1-bucket proof: P(cache_bin) = T/T = 1.0 for every lookup. With α = 1, BPB = 0. Perfect score — which tells us the metric isn't measuring what we think.

For any token the model predicts better than uniform (p > 1/K), renormalization strictly decreases its probability. The n-gram contribution doesn't just wash out — it actively hurts.

Credit to @Eppie (#677 comment) for identifying the probability validity issue, and to Mirco on Discord for the P(cache_bin) formulation.

Bucket sweep (empirical confirmation)

Buckets	Reported BPB	Table memory
1M	0.5793	48 MB
4M	0.6535	192 MB
64M	1.0629	3 GB
256M	1.1123	12 GB

All configs use backoff 2-7 with entropy-adaptive α. 256M buckets (near collision-free) scores 1.1123, near the float baseline (1.1109). The "improvement" tracks collision density, not prediction quality.

The n-gram-only configuration — hash tables with no neural model — reports 1.0615, below the neural baseline at 1.1109. A frequency table with no learned parameters appears to outcompress a trained language model. This is only possible because the number being reported is not measuring compression.

Two-pass rescoring compounds the problem

PRs #846, #853, #868, #870, #881, #888 use two passes: pass 1 scores tokens and builds the cache, pass 2 rescores ALL tokens using the complete cache. This violates causality and compounds the distribution issue.

A separate question: what should the competition measure?

The distribution issue above is a measurement bug — it applies regardless of what anyone thinks the competition should optimize for. What follows is a design conversation. Reasonable people can disagree.

Even with valid distributions and preserved causality, the model can grow unboundedly at eval time. Someone could train a second, larger model via self-distillation, ensemble 8 copies via divergent TTT, or store 63 GB of hidden states as a neural cache. All valid. All causal. All far beyond 16 MB. The competition gives evaluation 8×H100 and 600s for a 16 MB model. In deployment, inference is constrained by hardware cost, latency, and concurrent users. Whether the competition should reflect those realities is an interesting design choice.

Proposed fixes

@0hq @cocohearts @valerio-oai

1. Verify the distribution sums to 1 `fixing the measurement`

probs = model.predict(context)        # shape: [vocab_size]
assert abs(probs.sum() - 1.0) < 1e-4  # verify
nll = -torch.log(probs[correct_token])

One torch.sum per position. 1–2 seconds for 62M tokens. Catches every invalid distribution. Passes everything valid. Not n-gram specific.

2. Make causality an explicit rule `aligning with reality`

The FAQ says you can only train on tokens "you've already evaluated your model on." Two-pass rescoring violates this. Making it a stated rule would clarify the intent.

3. Cap auxiliary eval-time state `aligning with reality`

Constrain auxiliary state: tensors that accumulate during eval and are not derivable from the artifact alone. Not model weights, not KV cache, not activations. A cap of ≤ 32 MB preserves everything currently approved (TTT LoRA at ~2 MB).

4. Cap per-token overhead `aligning with reality`

Eval-time techniques must not increase per-token latency by more than 50% over the base model forward pass. Base LM on 8×H100 takes 110s. A 1.5× cap means 165s max. The n-gram cache takes 401s (3.6×).

What do you think?

Experimental appendix

All BPB numbers below are from an invalid distribution. They measure how much P(cache_bin) inflates the correct token's probability, not compression quality.

Single GPU (stride=64, FineWeb val, 62M tokens):

Config	Reported BPB	Eval-time state	Effective model
Base LM (int6 quantized)	1.1142	0 MB	16 MB
Base LM (float, pre-quant)	1.1109	0 MB	16 MB
N-gram only (no base LM)	1.0615	192 MB	192 MB
Backoff 2-7, α=0.40	0.4923	192 MB	208 MB
Backoff 2-9, order-adaptive	0.3779	256 MB	272 MB

8×H100 with all-reduce sync (first three under 600s budget):

Config	Reported BPB	Time	Sync overhead
Base LM	1.1130	110s	—
Backoff 2-7, α=0.40	0.4941	401s	1.6s
Backoff 2-9, α=0.40	0.4548	500s	1.9s
Backoff 2-7, α=0.80	0.3942	939s	~2.0s

Alpha sweep — higher α = more weight on inflated ratio = lower reported BPB. Tracks α, not prediction quality.

Order scaling — saturates around order 9–12. Each additional order changes which hash ratio is used for each token.

Stride decomposition — the artifact magnitude (~0.62 BPB) is independent of sliding window stride.

Base model: PR #728. Reproduction scripts: experiments/eval_time_mixing/scripts/.

Test plan

Bucket sweep (EXP-5): proves scores are from invalid distribution
Single-GPU experiments (EXP-0): 7 configs
8-GPU all-reduce experiments (EXP-11): 4 configs + alpha sweep
Order scaling (EXP-1), stride decomposition (EXP-6)
Maintainer input on distribution check + causality rule

🤖 Generated with Claude Code

Study of eval-time n-gram caching — a technique that reduces BPB from 1.11 to 0.38 while preserving strict causality, costing zero artifact bytes, but growing the effective model to 17x the artifact limit. Includes single-GPU ablations, 8-GPU all-reduce results (0.49 BPB in 401s, under 600s budget), alpha sweep, and a comparison of competition eval setup vs real-world inference constraints. Proposes four rule clarifications to align the competition with deployment realities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

robinojw · 2026-03-26T19:49:31Z

This puts a clear name on something I've been navigating by feel: the line between approved eval-time learning and unbounded model growth is quantitative, not qualitative, and right now nobody knows where it is.
A concrete ruling would help. The 64MB cap proposed here seems right it preserves everything currently approved and gives competitors a budget to engineer against instead of guessing at intent.

- Base model is ValCalib GPTQ (1.1142 BPB), not PR openai#549 (1.1194) - Remove stale "not yet deployed" / "we estimate" for EXP-11 - Note α=0.80 (939s) exceeds 600s budget - Fix PR openai#727 score to 0.9674, PR openai#788 to 0.9059 - Fix PR openai#596 BPB to 0.6430 - "Approved" → "Technique deemed legal" for closed PRs - Add bucket sweep and per-token overhead proposal - Replace "neural" with "base LM" throughout Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Decompressed model weights alone exceed any naive GPU memory cap. The right constraint is auxiliary state: tensors that accumulate during eval and are not derivable from the artifact (hash tables, TTT deltas). Not model weights, KV cache, or activations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EthanYangTW · 2026-03-26T23:19:49Z

strong support for this RFC. The fact that a frequency table with zero training beat every trained model just proves this thing is measuring dataset memorization, not language modeling quality. We've been pushing neural improvements — GPTQ, QAT, architecture even novel stuff, and it's demoralizing to see lookup tables dominate.

abaybektursun · 2026-03-27T05:35:51Z

Correction: our original explanation of why hash collisions help was wrong. Credit to @Eppie (#677 comment) for identifying the probability validity issue, and to Mirco on Discord for the P(cache_bin) formulation.

Our bucket sweep data is correct, but the mechanism is different from what we originally described. The hash ratio full_table[hash(ctx, tok)] / ctx_table[hash(ctx)] is not a conditional probability — it's a collision-aggregated ratio that approaches 1.0 as tables fill. The blend inflates the correct token's probability without renormalizing the other 1023 tokens. The BPB improvement is primarily a measurement artifact from point-evaluating an invalid distribution, not from useful statistical estimation.

PR body and README updated to reflect this.

@Eppie

Credit to @Eppie and Mirco (Discord) for the correct formulation. The hash ratio is not a conditional probability — it approaches 1.0 as collision-aggregated counts fill both tables proportionally. The BPB improvement is a measurement artifact from point-evaluating an invalid distribution, not from useful statistical estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix 1: verify sum(probs) ≈ 1.0 at every scored position Fix 2: cap auxiliary eval-time state ≤ 32 MB Fix 3: cap per-token overhead ≤ 1.5× base model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Causality is assumed but not enforced by the eval harness. Two-pass rescoring violates it. Should be explicit. Bucket sweep moved from experimental details to the main argument since it proves the BPB scores are inflated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ce it Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add causality and distribution validity to real-world comparison. Explain how unbounded eval-time state can be exploited even with valid distributions and causality (self-distillation, ensembling, neural cache). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

README now points to blog + PR instead of maintaining a third copy. submission.json: fix base_model_pr 549→728, update name and blurb. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun and others added 2 commits March 26, 2026 14:03

Simplify proposal to single option: cap eval-time state

4555280

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun marked this pull request as ready for review March 26, 2026 19:10

abaybektursun changed the title ~~Non-record: Your 16 MB model is 272 MB at eval time~~ RFC: The leaderboard is optimizing for compression, not language modeling Mar 26, 2026

abaybektursun changed the title ~~RFC: The leaderboard is optimizing for compression, not language modeling~~ RFC: A framework for deciding the n-gram question Mar 26, 2026

abaybektursun mentioned this pull request Mar 26, 2026

Illegal submissions megathread #677

Open

abaybektursun and others added 3 commits March 26, 2026 17:07

Fix base model reference to PR openai#728

1f91dfe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TimPietrusky mentioned this pull request Mar 27, 2026

Record: Order-16 Frozen N-gram Oracle + Learned Gate + TTT — val_bpb 0.0274 (3-seed mean) #945

Open

6 tasks

abaybektursun and others added 4 commits March 27, 2026 08:30

Fix causality language: rules should state it, not eval harness enfor…

ecd339f

…ce it Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~RFC: A framework for deciding the n-gram question~~ RFC: The n-gram BPB scores are not real Mar 27, 2026

Simplify README to pointer, fix submission.json

37aa189

README now points to blog + PR instead of maintaining a third copy. submission.json: fix base_model_pr 549→728, update name and blurb. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~RFC: The n-gram BPB scores are not real~~ RFC: N-gram scores need a distribution validity check Mar 27, 2026

abaybektursun changed the title ~~RFC: N-gram scores need a distribution validity check~~ RFC: N-gram scores are invalid, and the eval setup needs constraints Mar 27, 2026

abaybektursun changed the title ~~RFC: N-gram scores are invalid, and the eval setup needs constraints~~ RFC: How to clean up all the Parameter Golf submissions Mar 27, 2026

AnirudhRahul mentioned this pull request Mar 27, 2026

Record: 0.0214 bpb - Low Eval-Time Memory Regime: Packed Training N-gram Artifact + Learned Gate (No Phrase Cache) #962

Open

9 tasks

abaybektursun changed the title ~~RFC: How to clean up all the Parameter Golf submissions~~ RFC: N-gram distributions don't sum to 1 — proposed fixes Mar 27, 2026

abaybektursun changed the title ~~RFC: N-gram distributions don't sum to 1 — proposed fixes~~ RFC: How to Clean Up All the Parameter Golf Submissions Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: How to Clean Up All the Parameter Golf Submissions#886

RFC: How to Clean Up All the Parameter Golf Submissions#886
abaybektursun wants to merge 11 commits intoopenai:mainfrom
abaybektursun:nonrecord/eval-time-model-growth-study

abaybektursun commented Mar 26, 2026 •

edited

Loading

Uh oh!

robinojw commented Mar 26, 2026

Uh oh!

EthanYangTW commented Mar 26, 2026

Uh oh!

abaybektursun commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abaybektursun commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The distribution doesn't sum to 1

Bucket sweep (empirical confirmation)

Two-pass rescoring compounds the problem

A separate question: what should the competition measure?

Proposed fixes

1. Verify the distribution sums to 1 fixing the measurement

2. Make causality an explicit rule aligning with reality

3. Cap auxiliary eval-time state aligning with reality

4. Cap per-token overhead aligning with reality

Experimental appendix

Test plan

Uh oh!

robinojw commented Mar 26, 2026

Uh oh!

EthanYangTW commented Mar 26, 2026

Uh oh!

abaybektursun commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abaybektursun commented Mar 26, 2026 •

edited

Loading

1. Verify the distribution sums to 1 `fixing the measurement`

2. Make causality an explicit rule `aligning with reality`

3. Cap auxiliary eval-time state `aligning with reality`

4. Cap per-token overhead `aligning with reality`