Skip to content

Normalized N-gram + Bayesian First-Match (val_bpb 0.3922)#972

Closed
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:normalized-ngram
Closed

Normalized N-gram + Bayesian First-Match (val_bpb 0.3922)#972
Idan3011 wants to merge 1 commit intoopenai:mainfrom
Idan3011:normalized-ngram

Conversation

@Idan3011
Copy link
Copy Markdown

@Idan3011 Idan3011 commented Mar 27, 2026

Normalized N-gram + Bayesian First-Match + Pre-Enrichment + XSA

val_bpb: 0.3922 (full-vocab 1024-token normalized n-gram, Bayesian first-match, fixed 0.5 blend)
| 1.1478 (sliding window) | 14.94 MB | 8xH100 SXM, 600s

Progress

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 (this)
val_bpb 1.1855 1.1709 1.1668 1.1629 1.0689 0.9784 0.9408 0.9393 0.2995 0.2722 0.3922
Eval method sliding sliding sliding sliding 5-gram 2-7 backoff 2-11 backoff +PE conf shared cache
+phrase cache normalized

v11 is intentionally higher than v10. I replaced standard single-token scoring with
full-vocab 1024-token normalized distributions. The 0.12 BPP increase measures the
collision premium — the portion of n-gram gain from inflated pseudo-probabilities
rather than genuine statistical signal.

Key Contributions

Full-Vocab 1024-Token Normalized Scoring
For each scored position and each n-gram order, look up counts for all 1024
vocabulary tokens and normalize to sum to 1.0, instead of computing a single
pair_count / ctx_count ratio for only the target token.

  • Vectorized [chunk, 1024] gather per order — GPU stays saturated
  • First-match-wins backoff: orders 11→10→...→2, highest match wins
  • Score-first: tables updated AFTER scoring each chunk
  • Phase 2 eval time: ~193s

Bayesian First-Match with Neural Prior
p_local = (raw_correct + beta * p_neural) / (ctx_count + beta) with beta=2.0.
Neural prior contributes 2 pseudo-counts. Low-evidence contexts smoothed toward
neural prediction rather than overfit to sparse counts.

Collision Premium Analysis

A/B Mixing Experiments

Config val_bpb Finding
Fixed 0.5 blend 0.3922 Best — less gating = better
Count-confidence (gain=12) 0.4942 Confidence gating attenuates real signal
Count-confidence (gain=50) 0.7041 Too conservative
Dirichlet mixing (#944 style) 0.3171 Wrong for incremental cache
CTW recursive (10 orders) 2.5326 Compounding across orders kills neural signal

Once distributions are normalized, simple mixing outperforms sophisticated approaches.

Two-Phase Shared N-gram Cache
Phase 1 (parallel): each GPU scores its share of sliding windows.
Phase 2 (global): all scored data gathered, sorted by position, single global cache built sequentially.

GELU Pre-Enrichment (512→768→512)
Wider nonlinear transformation before transformer blocks.

XSA (Exclusive Self Attention) on Last 4 Layers
Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir).

Additional Techniques

  • SmearGate: Per-dim gate blending each token with previous token.
  • BigramHash (2048x128): Hash-table embedding for token bigrams.
  • EMA (decay=0.997) on GPU: 37% faster training (64.7ms vs 101ms/step).
  • Int6 QAT + lzma: 14.94 MB artifact.
  • Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections.
  • Training: Muon+AdamW, WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.

What Didn't Work (on valid distributions)

@Idan3011 Idan3011 force-pushed the normalized-ngram branch 2 times, most recently from 68dfd02 to a999142 Compare March 27, 2026 18:44
AnirudhRahul pushed a commit to AnirudhRahul/parameter-golf that referenced this pull request Mar 27, 2026
Correct the eval-time n-gram posterior to normalize by the summed hashed-vocab mass and update the recorded metrics. The honest rerun lands at 1.5134 BPB, showing the earlier 0.3922 result came from the flawed normalization path.

Made-with: Cursor
@AnirudhRahul
Copy link
Copy Markdown

AnirudhRahul commented Mar 27, 2026

#978
^Reran this and I think the bpb results are off because your target distribution wasn't normalized correctly

@Idan3011 Idan3011 closed this Mar 27, 2026
@Idan3011 Idan3011 deleted the normalized-ngram branch March 27, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants