Skip to content

Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val_bpb 0.0990)#881

Open
simon-marcus wants to merge 1 commit intoopenai:mainfrom
simon-marcus:submission/waterloo-loo-0.0990
Open

Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val_bpb 0.0990)#881
simon-marcus wants to merge 1 commit intoopenai:mainfrom
simon-marcus:submission/waterloo-loo-0.0990

Conversation

@simon-marcus
Copy link

@simon-marcus simon-marcus commented Mar 26, 2026

Summary — WaterLOO

val_bpb: 0.0990 (3-seed mean, std 0.00002) | ~15.87 MB artifact (15,861,664 to 15,886,796 bytes across the 3 seeds) | 144-145s eval time

This is the first of two more conservative follow-ups to our BROADSIDE record submission. If BROADSIDE was the version that sailed straight into the interpretive weather and dared anyone to blink, this submission -- WaterLOO -- is the slightly more cautious friend that can still party with the best of 'em.

BROADSIDE illuminated an important (if vulgar) reality: if you decouple the neural forward pass from the n-gram scoring and then rescore the validation stream against a complete cache, the current two-pass frontier stops looking like a frontier and starts looking like a rest stop. The awkward question, of course, is how much of that gain came from the architecture and how much came from the fact that every token, in the aggressive version, was enjoying the company of its own (context,target) count.

This submission answers that question in the cleanest way we know how: keep the fast full-cache machinery, keep the full-rescore architecture, but subtract each token's own contribution before matching and before computing its n-gram probability. In other words, the cache stays global and the scoring stays cheap, but the most obvious self-inclusion path is excised.

The result is 0.0990 BPB over three seeds. That's a touch worse than BROADSIDE's 0.0935, because of course it is; if you take away free candy, candy consumption declines. But it is still dramatically ahead of the currently visible two-pass frontier: better than PR #853's 0.1315 by 0.0325 BPB and better than PR #846's 0.1435 by 0.0444 BPB. Which is to say, this meaningfully advances the frontier.

What's new

PR #846 PR #853 BROADSIDE (#870) This
N-gram orders 2-9 2-12 2-12 2-12
Chunks rescored 15/63 50/237 All All
Self-inclusion yes yes yes no (leave-one-out)
Eval time 339s 508s 158s 144s
val_bpb 0.1434 0.1315 0.0935 0.0990

Per-seed results

Seed val_bpb
1337 0.09897
42 0.09897
2025 0.09902

Mean: 0.09898524
Std: 0.00002

Why WaterLOO

BROADSIDE built the full cache and then scored each token directly against it. That means each token's own (context,target) occurrence was present in the relevant hash bucket when the token was rescored. WaterLOO is the same ship with one big gun removed: it keeps the global cache and the full-rescore architecture, but each token is forced to check its own contribution at the door.

Concretely, in Pass 2 it performs leave-one-out scoring:

  • subtract 1 from the token's context count
  • subtract 1 from the token's (context,target) count
  • then apply the same backoff, min_count, entropy-adaptive alpha, and order multipliers as before

The practical effect is simple: the method still benefits from a globally warm cache, but each token no longer gets to vote for itself. That's a stricter and, we think, more defensible interpretation of the two-pass idea.

Why this matters

The important fact here is not that BROADSIDE got 0.0935. The important fact is that after removing the most obvious self-inclusion mechanism, the score is still 0.0990.

That suggests the main gain is not some tiny accounting trick, but a structural improvement:

  • fast vectorized cache construction via np.bincount
  • full-stream coverage instead of only rescoring a selected prefix
  • enough eval headroom that the cache can be built once and used everywhere rather than only where the clock can tolerate it

In other words, the broadside still lands even after you take some of the powder out of the cannon.

Test plan

  • 3-seed validation (1337, 42, 2025) with mean 0.0990, std 0.00002
  • Artifact size under cap: ~15.83 MB
  • Training time: 600s on 8xH100 SXM
  • Eval time: ~144-145s, well under the 600s eval budget
  • Pass 1 scores all tokens before Pass 2 rescoring
  • Leave-one-out scoring removes each token's own direct cache contribution
  • No validation data accessed during training

Positioning

BROADSIDE (#870) is the "push the interpretation until the interpretation pushes back" submission. This PR is the first of two more conservative submissions behind it.

The basic thesis is that the frontier has moved from whether two-pass n-gram rescoring works to how much caution you want to bake into it. BROADSIDE says: give every token the complete cache and let God (openai??) sort it out. This one says: fine, but each token has to leave its own shoes at the door.

If the organizers are comfortable with the full BROADSIDE reading, that submission should stand on its own merits. If they want a stricter posture around self-inclusion while still accepting two-pass rescoring in principle, this submission exists to show that the underlying architecture remains emphatically SOTA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant