Normalized N-gram + Bayesian First-Match (val_bpb 0.3922)#972
Closed
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Closed
Normalized N-gram + Bayesian First-Match (val_bpb 0.3922)#972Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Conversation
68dfd02 to
a999142
Compare
a999142 to
d55045e
Compare
AnirudhRahul
pushed a commit
to AnirudhRahul/parameter-golf
that referenced
this pull request
Mar 27, 2026
Correct the eval-time n-gram posterior to normalize by the summed hashed-vocab mass and update the recorded metrics. The honest rerun lands at 1.5134 BPB, showing the earlier 0.3922 result came from the flawed normalization path. Made-with: Cursor
2 tasks
|
#978 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Normalized N-gram + Bayesian First-Match + Pre-Enrichment + XSA
val_bpb: 0.3922 (full-vocab 1024-token normalized n-gram, Bayesian first-match, fixed 0.5 blend)
| 1.1478 (sliding window) | 14.94 MB | 8xH100 SXM, 600s
Progress
v11 is intentionally higher than v10. I replaced standard single-token scoring with
full-vocab 1024-token normalized distributions. The 0.12 BPP increase measures the
collision premium — the portion of n-gram gain from inflated pseudo-probabilities
rather than genuine statistical signal.
Key Contributions
Full-Vocab 1024-Token Normalized Scoring
For each scored position and each n-gram order, look up counts for all 1024
vocabulary tokens and normalize to sum to 1.0, instead of computing a single
pair_count / ctx_countratio for only the target token.Bayesian First-Match with Neural Prior
p_local = (raw_correct + beta * p_neural) / (ctx_count + beta)with beta=2.0.Neural prior contributes 2 pseudo-counts. Low-evidence contexts smoothed toward
neural prediction rather than overfit to sparse counts.
Collision Premium Analysis
A/B Mixing Experiments
Once distributions are normalized, simple mixing outperforms sophisticated approaches.
Two-Phase Shared N-gram Cache
Phase 1 (parallel): each GPU scores its share of sliding windows.
Phase 2 (global): all scored data gathered, sorted by position, single global cache built sequentially.
GELU Pre-Enrichment (512→768→512)
Wider nonlinear transformation before transformer blocks.
XSA (Exclusive Self Attention) on Last 4 Layers
Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir).
Additional Techniques
What Didn't Work (on valid distributions)
CTW recursive mixing: compounding across 10 orders dilutes neural signal (2.5326).
Dirichlet mixing (Record: Compliance-First Packed Causal Memory + Dirichlet Mixing — val_bpb 0.01654407 (3-seed mean) #944 concentrations): tuned for packed cache, wrong for incremental (0.3171).
Count-confidence gating (ctx/(ctx+12)): attenuates weak normalized signal (0.4942).
Count-confidence gating (ctx/(ctx+50)): too conservative, near-neural (0.7041).
Compliance
Reproduction
All defaults baked in. No env vars needed.
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py
8xH100 SXM, 600s training + ~193s eval.
Tunable env vars:
CTW_BETA=2.0,CTW_BLEND=0.5,NG_MIN=1Key Metrics
Credits
Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0)
Update Log