Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val_bpb 0.0990)#881
Open
simon-marcus wants to merge 1 commit intoopenai:mainfrom
Open
Record: WaterLOO — Full-Rescore N-gram Cache with Self-Exclusion (val_bpb 0.0990)#881simon-marcus wants to merge 1 commit intoopenai:mainfrom
simon-marcus wants to merge 1 commit intoopenai:mainfrom
Conversation
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary — WaterLOO
val_bpb: 0.0990 (3-seed mean, std 0.00002) | ~15.87 MB artifact (15,861,664 to 15,886,796 bytes across the 3 seeds) | 144-145s eval time
This is the first of two more conservative follow-ups to our BROADSIDE record submission. If BROADSIDE was the version that sailed straight into the interpretive weather and dared anyone to blink, this submission -- WaterLOO -- is the slightly more cautious friend that can still party with the best of 'em.
BROADSIDE illuminated an important (if vulgar) reality: if you decouple the neural forward pass from the n-gram scoring and then rescore the validation stream against a complete cache, the current two-pass frontier stops looking like a frontier and starts looking like a rest stop. The awkward question, of course, is how much of that gain came from the architecture and how much came from the fact that every token, in the aggressive version, was enjoying the company of its own
(context,target)count.This submission answers that question in the cleanest way we know how: keep the fast full-cache machinery, keep the full-rescore architecture, but subtract each token's own contribution before matching and before computing its n-gram probability. In other words, the cache stays global and the scoring stays cheap, but the most obvious self-inclusion path is excised.
The result is 0.0990 BPB over three seeds. That's a touch worse than BROADSIDE's
0.0935, because of course it is; if you take away free candy, candy consumption declines. But it is still dramatically ahead of the currently visible two-pass frontier: better than PR #853's0.1315by0.0325BPB and better than PR #846's0.1435by0.0444BPB. Which is to say, this meaningfully advances the frontier.What's new
Per-seed results
Mean: 0.09898524
Std: 0.00002
Why WaterLOO
BROADSIDE built the full cache and then scored each token directly against it. That means each token's own
(context,target)occurrence was present in the relevant hash bucket when the token was rescored. WaterLOO is the same ship with one big gun removed: it keeps the global cache and the full-rescore architecture, but each token is forced to check its own contribution at the door.Concretely, in Pass 2 it performs leave-one-out scoring:
1from the token's context count1from the token's(context,target)countmin_count, entropy-adaptive alpha, and order multipliers as beforeThe practical effect is simple: the method still benefits from a globally warm cache, but each token no longer gets to vote for itself. That's a stricter and, we think, more defensible interpretation of the two-pass idea.
Why this matters
The important fact here is not that BROADSIDE got
0.0935. The important fact is that after removing the most obvious self-inclusion mechanism, the score is still0.0990.That suggests the main gain is not some tiny accounting trick, but a structural improvement:
np.bincountIn other words, the broadside still lands even after you take some of the powder out of the cannon.
Test plan
0.0990, std0.0000215.83 MBPositioning
BROADSIDE (#870) is the "push the interpretation until the interpretation pushes back" submission. This PR is the first of two more conservative submissions behind it.
The basic thesis is that the frontier has moved from whether two-pass n-gram rescoring works to how much caution you want to bake into it. BROADSIDE says: give every token the complete cache and let God (openai??) sort it out. This one says: fine, but each token has to leave its own shoes at the door.
If the organizers are comfortable with the full BROADSIDE reading, that submission should stand on its own merits. If they want a stricter posture around self-inclusion while still accepting two-pass rescoring in principle, this submission exists to show that the underlying architecture remains emphatically SOTA.