Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val_bpb 0.0990) by simon-marcus · Pull Request #881 · openai/parameter-golf

simon-marcus · 2026-03-26T18:23:01Z

Summary — WaterLOO

val_bpb: 0.0990 (3-seed mean, std 0.00002) | ~15.87 MB artifact (15,861,664 to 15,886,796 bytes across the 3 seeds) | 144-145s eval time

This is the first of two more conservative follow-ups to our BROADSIDE record submission. If BROADSIDE was the version that sailed straight into the interpretive weather and dared anyone to blink, this submission -- WaterLOO -- is the slightly more cautious friend that can still party with the best of 'em.

BROADSIDE illuminated an important (if vulgar) reality: if you decouple the neural forward pass from the n-gram scoring and then rescore the validation stream against a complete cache, the current two-pass frontier stops looking like a frontier and starts looking like a rest stop. The awkward question, of course, is how much of that gain came from the architecture and how much came from the fact that every token, in the aggressive version, was enjoying the company of its own (context,target) count.

This submission answers that question in the cleanest way we know how: keep the fast full-cache machinery, keep the full-rescore architecture, but subtract each token's own contribution before matching and before computing its n-gram probability. In other words, the cache stays global and the scoring stays cheap, but the most obvious self-inclusion path is excised.

The result is 0.0990 BPB over three seeds. That's a touch worse than BROADSIDE's 0.0935, because of course it is; if you take away free candy, candy consumption declines. But it is still dramatically ahead of the currently visible two-pass frontier: better than PR #853's 0.1315 by 0.0325 BPB and better than PR #846's 0.1435 by 0.0444 BPB. Which is to say, this meaningfully advances the frontier.

What's new

	PR #846	PR #853	BROADSIDE (#870)	This
N-gram orders	2-9	2-12	2-12	2-12
Chunks rescored	15/63	50/237	All	All
Self-inclusion	yes	yes	yes	no (leave-one-out)
Eval time	339s	508s	158s	144s
val_bpb	0.1434	0.1315	0.0935	0.0990

Per-seed results

Seed	val_bpb
1337	0.09897
42	0.09897
2025	0.09902

Mean: 0.09898524
Std: 0.00002

Why WaterLOO

BROADSIDE built the full cache and then scored each token directly against it. That means each token's own (context,target) occurrence was present in the relevant hash bucket when the token was rescored. WaterLOO is the same ship with one big gun removed: it keeps the global cache and the full-rescore architecture, but each token is forced to check its own contribution at the door.

Concretely, in Pass 2 it performs leave-one-out scoring:

subtract 1 from the token's context count
subtract 1 from the token's (context,target) count
then apply the same backoff, min_count, entropy-adaptive alpha, and order multipliers as before

The practical effect is simple: the method still benefits from a globally warm cache, but each token no longer gets to vote for itself. That's a stricter and, we think, more defensible interpretation of the two-pass idea.

Why this matters

The important fact here is not that BROADSIDE got 0.0935. The important fact is that after removing the most obvious self-inclusion mechanism, the score is still 0.0990.

That suggests the main gain is not some tiny accounting trick, but a structural improvement:

fast vectorized cache construction via np.bincount
full-stream coverage instead of only rescoring a selected prefix
enough eval headroom that the cache can be built once and used everywhere rather than only where the clock can tolerate it

In other words, the broadside still lands even after you take some of the powder out of the cannon.

Test plan

3-seed validation (1337, 42, 2025) with mean 0.0990, std 0.00002
Artifact size under cap: ~15.83 MB
Training time: 600s on 8xH100 SXM
Eval time: ~144-145s, well under the 600s eval budget
Pass 1 scores all tokens before Pass 2 rescoring
Leave-one-out scoring removes each token's own direct cache contribution
No validation data accessed during training

Positioning

BROADSIDE (#870) is the "push the interpretation until the interpretation pushes back" submission. This PR is the first of two more conservative submissions behind it.

The basic thesis is that the frontier has moved from whether two-pass n-gram rescoring works to how much caution you want to bake into it. BROADSIDE says: give every token the complete cache and let God (openai??) sort it out. This one says: fine, but each token has to leave its own shoes at the door.

If the organizers are comfortable with the full BROADSIDE reading, that submission should stand on its own merits. If they want a stricter posture around self-inclusion while still accepting two-pass rescoring in principle, this submission exists to show that the underlying architecture remains emphatically SOTA.

…_bpb 0.0990)

Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val…

c996cee

…_bpb 0.0990)

simon-marcus mentioned this pull request Mar 26, 2026

Record: BROADSIDE — Full-Rescore N-gram Cache (val_bpb 0.0935) #870

Open

7 tasks

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun mentioned this pull request Mar 26, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val_bpb 0.0990)#881

Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val_bpb 0.0990)#881
simon-marcus wants to merge 1 commit intoopenai:mainfrom
simon-marcus:submission/waterloo-loo-0.0990

simon-marcus commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simon-marcus commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary — WaterLOO

What's new

Per-seed results

Why WaterLOO

Why this matters

Test plan

Positioning

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simon-marcus commented Mar 26, 2026 •

edited

Loading