Normalized N-gram + Bayesian First-Match (val_bpb 0.3922) by Idan3011 · Pull Request #972 · openai/parameter-golf

Idan3011 · 2026-03-27T18:35:45Z

Normalized N-gram + Bayesian First-Match + Pre-Enrichment + XSA

val_bpb: 0.3922 (full-vocab 1024-token normalized n-gram, Bayesian first-match, fixed 0.5 blend)
| 1.1478 (sliding window) | 14.94 MB | 8xH100 SXM, 600s

Progress

	v1	v2	v3	v4	v5	v6	v7	v8	v9	v10	v11 (this)
val_bpb	1.1855	1.1709	1.1668	1.1629	1.0689	0.9784	0.9408	0.9393	0.2995	0.2722	0.3922
Eval method	sliding	sliding	sliding	sliding	5-gram	2-7 backoff	2-11 backoff	+PE conf	shared cache
+phrase cache	normalized

v11 is intentionally higher than v10. I replaced standard single-token scoring with
full-vocab 1024-token normalized distributions. The 0.12 BPP increase measures the
collision premium — the portion of n-gram gain from inflated pseudo-probabilities
rather than genuine statistical signal.

Key Contributions

Full-Vocab 1024-Token Normalized Scoring
For each scored position and each n-gram order, look up counts for all 1024
vocabulary tokens and normalize to sum to 1.0, instead of computing a single
pair_count / ctx_count ratio for only the target token.

Vectorized [chunk, 1024] gather per order — GPU stays saturated
First-match-wins backoff: orders 11→10→...→2, highest match wins
Score-first: tables updated AFTER scoring each chunk
Phase 2 eval time: ~193s

Bayesian First-Match with Neural Prior
p_local = (raw_correct + beta * p_neural) / (ctx_count + beta) with beta=2.0.
Neural prior contributes 2 pseudo-counts. Low-evidence contexts smoothed toward
neural prediction rather than overfit to sparse counts.

Collision Premium Analysis

Standard scoring (PR Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #810): 0.2722 BPP
Normalized scoring (this PR): 0.3922 BPP
Collision premium: 0.120 BPP
256M-bucket experiment (near collision-free): n-gram gain drops to near-zero (1.1123 vs 1.1109 float)
Remaining 0.756 BPP gain (1.1478 → 0.3922) is genuine n-gram signal

A/B Mixing Experiments

Config	val_bpb	Finding
Fixed 0.5 blend	0.3922	Best — less gating = better
Count-confidence (gain=12)	0.4942	Confidence gating attenuates real signal
Count-confidence (gain=50)	0.7041	Too conservative
Dirichlet mixing (#944 style)	0.3171	Wrong for incremental cache
CTW recursive (10 orders)	2.5326	Compounding across orders kills neural signal

Once distributions are normalized, simple mixing outperforms sophisticated approaches.

Two-Phase Shared N-gram Cache
Phase 1 (parallel): each GPU scores its share of sliding windows.
Phase 2 (global): all scored data gathered, sorted by position, single global cache built sequentially.

GELU Pre-Enrichment (512→768→512)
Wider nonlinear transformation before transformer blocks.

XSA (Exclusive Self Attention) on Last 4 Layers
Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir).

Additional Techniques

SmearGate: Per-dim gate blending each token with previous token.
BigramHash (2048x128): Hash-table embedding for token bigrams.
EMA (decay=0.997) on GPU: 37% faster training (64.7ms vs 101ms/step).
Int6 QAT + lzma: 14.94 MB artifact.
Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections.
Training: Muon+AdamW, WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.

What Didn't Work (on valid distributions)

CTW recursive mixing: compounding across 10 orders dilutes neural signal (2.5326).
Dirichlet mixing (Record: Compliance-First Packed Causal Memory + Dirichlet Mixing — val_bpb 0.01654407 (3-seed mean) #944 concentrations): tuned for packed cache, wrong for incremental (0.3171).
Count-confidence gating (ctx/(ctx+12)): attenuates weak normalized signal (0.4942).

Count-confidence gating (ctx/(ctx+50)): too conservative, near-neural (0.7041).

Compliance

Score-first: n-gram cache updated AFTER scoring each chunk
Backward-looking: cache at position p contains only tokens 0..p-1
No oracle selection: blend weight is fixed 0.5, never depends on ground truth
No training data access during eval
No two-pass rescoring
Normalized distributions: n-gram probabilities computed across all 1024 tokens

Reproduction

All defaults baked in. No env vars needed.
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
torchrun --standalone --nproc_per_node=8 train_gpt.py
8xH100 SXM, 600s training + ~193s eval.

Tunable env vars: CTW_BETA=2.0, CTW_BLEND=0.5, NG_MIN=1

Key Metrics

Metric	Value
val_bpb (normalized n-gram)	0.3922
Sliding window val_bpb	1.1478
Post-quant val_bpb (standard)	1.1690
Pre-quant val_bpb	1.1646
Quant gap	0.004
Collision premium vs PR Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #810	0.120 BPP
Training time	600,031ms (9,268 steps at 64.7ms)
Eval time	193,472ms
Peak memory	13,058 MiB
Artifact size	14,942,971 bytes
Model parameters	25,254,992

Credits

Muon optimizer — modded-nanogpt baseline (kellerjordan)
SmearGate + BigramHash — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (@aquariouseworkman)
XSA — arXiv:2603.09078; GQA-aware PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 (@unnir)
EMA + GPTQ-lite + warmdown tuning — PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush)
N-gram eval cache — concept PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659 (@deanbrr); fixed 5-gram PR Podracing: 1.0461 BPB (3-seed mean) — 5-gram eval + LeakyReLU² #706 (@newjordan); multi-order entropy-adaptive PR
Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0)
Shared GPU n-gram cache — PR Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS #796 (@Robby955); PR Record: X-WING — Shared N-gram Tables + Cubric (val_bpb=0.5644) #800 (@newjordan); PR Record: Chunk-Based N-gram Backoff + Score-First TTT (0.295 BPB) #809 (@AayushBaniya2006)
Dirichlet mixing inspiration — PR Record: Compliance-First Packed Causal Memory + Dirichlet Mixing — val_bpb 0.01654407 (3-seed mean) #944 (@aamodbhatt)
256M-bucket collision analysis — competition Issue ⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140 discussion
Context Tree Weighting theory — Willems, Shtarkov, Tjalkens (1995)
GELU Pre-Enrichment — original to this submission
EMA on GPU — original to this submission
Full-vocab normalized n-gram scoring — original to this submission
Collision premium quantification — original to this submission

Update Log

v1 (1.1855): int8+zlib, MLP 2x, seq 1024
v2 (1.1709): int6 QAT + lzma, MLP 3x, SWA, seq 2048
v3 (1.1668): + SmearGate + BigramHash + EMA + wider pre-enrichment
v4 (1.1629): + XSA on last 4 layers
v5 (1.0689): + EMA on GPU (64ms/step) + 5-gram eval cache
v6 (0.9784): + multi-order backoff 2-7 + entropy-adaptive alpha
v7 (0.9408): + extended to orders 2-11 + steeper alpha
v8 (0.9393): + pre-enrichment confidence modulation
v9 (0.2995): + two-phase shared cache + per-order adaptive alpha (3-seed: 0.2995)
v10 (0.2722): + long phrase cache (lengths 48, 36, 28, 20, 16)
v11 (0.3922): full-vocab normalized n-gram scoring + Bayesian first-match + collision premium analysis

Correct the eval-time n-gram posterior to normalize by the summed hashed-vocab mass and update the recorded metrics. The honest rerun lands at 1.5134 BPB, showing the earlier 0.3922 result came from the flawed normalization path. Made-with: Cursor

AnirudhRahul · 2026-03-27T19:46:45Z

#978
^Reran this and I think the bpb results are off because your target distribution wasn't normalized correctly

Idan3011 force-pushed the normalized-ngram branch 2 times, most recently from 68dfd02 to a999142 Compare March 27, 2026 18:44

submission: Normalized N-gram + Bayesian First-Match (val_bpb=0.3922)

d55045e

Idan3011 force-pushed the normalized-ngram branch from a999142 to d55045e Compare March 27, 2026 18:47

AnirudhRahul mentioned this pull request Mar 27, 2026

Review: Rerun of #972 with actual full-vocab normalization #978

Open

2 tasks

Idan3011 closed this Mar 27, 2026

Idan3011 deleted the normalized-ngram branch March 27, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalized N-gram + Bayesian First-Match (val_bpb 0.3922)#972

Normalized N-gram + Bayesian First-Match (val_bpb 0.3922)#972
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:normalized-ngram

Idan3011 commented Mar 27, 2026 •

edited

Loading

Uh oh!

AnirudhRahul commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Idan3011 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Normalized N-gram + Bayesian First-Match + Pre-Enrichment + XSA

Progress

Key Contributions

Additional Techniques

What Didn't Work (on valid distributions)

Compliance

Reproduction

Key Metrics

Credits

Update Log

Uh oh!

AnirudhRahul commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Idan3011 commented Mar 27, 2026 •

edited

Loading

AnirudhRahul commented Mar 27, 2026 •

edited

Loading