Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed) by Idan3011 · Pull Request #810 · openai/parameter-golf

Idan3011 · 2026-03-26T04:39:04Z

Two-Phase Shared N-gram Cache + EMA-GPU + Pre-Enrichment + XSA

val_bpb: 0.2995 (3-seed mean, std 0.0016) | 14.94 MB | 8xH100 SXM, 600s

3-Seed Results

Seed	Steps	Sliding BPB	val_bpb	Artifact
1337	9,268	1.1478	0.3001	14,942,971
42	9,318	1.1468	0.2977	14,922,769
3011	9,322	1.1463	0.3008	14,939,305
Mean	—	1.1470	0.2995	—
Std	—	—	0.0016	—

Progress

	v1	v2	v3	v4	v5	v6	v7	v8	v9 (this)
val_bpb	1.1855	1.1709	1.1668	1.1629	1.0689	0.9784	0.9408	0.9393	0.2995
Eval method	sliding	sliding	sliding	sliding	5-gram	2-7 backoff	2-11 backoff	+PE conf	shared
cache
Steps (600s)	8,004	6,423	5,373	5,636	9,312	9,268	9,268	9,268	9,268
Step time	75ms	93ms	112ms	106ms	64ms	65ms	65ms	65ms	65ms

Key Contributions

Two-Phase N-gram Eval with Global Cache

Phase 1 (parallel, all GPUs): Each GPU processes its share of sliding windows. Computes model target
probabilities, entropy, and pre-enrichment delta.

Phase 2 (global, sequential): All scored data gathered via all_gather, sorted by position. Single global n-gram
cache built sequentially in 16K-token chunks. Every scored token sees the FULL 62M-token cache.

Verified: 1xGPU produces identical result (0.3001) — no multi-GPU artifacts.

Per-Order Adaptive Alpha (primary source of BPB gain)

Each n-gram order gets its own entropy center and weight multiplier:

Per-order entropy centers (high orders trusted at lower entropy):

  {11: 2.5, 10: 2.7, 9: 3.0, 8: 3.0, 7: 3.0, 6: 3.2, 5: 3.5, 4: 3.8, 3: 4.2, 2: 4.5}

Per-order weights (high orders boosted, low orders suppressed):

  {11: 2.0, 10: 2.0, 9: 2.0, 8: 2.0, 7: 2.0, 6: 1.88, 5: 1.88, 4: 1.0, 3: 0.45, 2: 0.30}

Base alpha: 0.20 + 0.55 * sigmoid(2 * (H - center)) per order.

Pre-Enrichment Confidence Modulation

Uses the pre-enrichment layer's transformation magnitude as a confidence signal. High delta = model uncertain =
trust n-gram more. Modulates alpha by (0.5 + 1.0 * pe_conf).

EMA on GPU (37% faster training)

EMA state kept on GPU during training. Step time: 64.7ms (vs 101ms before). 9,268 steps in 600s — 57% more
gradient updates.

GELU Pre-Enrichment (512→768→512)

Wider nonlinear transformation before the residual stream: embedding → BigramHash add → SmearGate →
Linear(512→768) → GELU → Linear(768→512) → RMS Norm → transformer blocks

XSA (Exclusive Self Attention) on Last 4 Layers

Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir). Zero
parameters.

Additional Techniques

SmearGate: Per-dim gate blending each token with previous token.
BigramHash (2048x128): Hash-table embedding for token bigrams.
EMA (decay=0.997): Quant gap 0.004.
Int6 QAT + lzma: 14.94 MB artifact.

Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections. Training: Muon+AdamW,
WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.

What Didn't Work

Log-odds mixing: n-gram probabilities near zero create catastrophic logits.
SSE post-correction: Online bias learning always pushes predictions toward 1.0.
BigramHash confidence signal: Embedding norm didn't correlate with prediction accuracy.
Orders 12-13: No improvement over 2-11.
Single entropy-adaptive alpha for all orders: per-order centers and weights are critical.
Frontier stack (LeakyReLU², Partial RoPE, LN Scale): Stacked together = regression.
Encoder recurrence: 900x quant error amplification.

Compliance

Score-first: n-gram cache updated AFTER scoring each 16K-token chunk
Backward-looking: cache at position p contains only tokens 0..p-1
No oracle selection: alpha depends on model entropy and n-gram order, never on ground truth
No training data access during eval
Verified: 1xGPU result identical to 8xGPU (0.3001)
All artifacts under 16,000,000 bytes
GPTQ calibration within training budget

Reproduction

All defaults baked in. No env vars needed (SEED=1337 default).

  python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
  torchrun --standalone --nproc_per_node=8 train_gpt.py

8xH100 SXM, 600s training + ~204s eval. For other seeds: SEED=42, SEED=3011.

Key Metrics

Metric	Value
val_bpb (3-seed mean)	0.2995
val_bpb std	0.0016
Sliding window val_bpb	1.1470
Post-quant val_bpb (standard)	1.1690
Pre-quant val_bpb	1.1646
Quant gap	0.004
Training time	600s (9,268 steps at 64.7ms)
Eval time	~204s
Peak memory	13,058 MiB
Artifact size	~14.94 MB
Model parameters	25,254,992

Credits

Muon optimizer — modded-nanogpt baseline (kellerjordan)
SmearGate + BigramHash — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (@aquariouseworkman)
XSA — arXiv:2603.09078; GQA-aware PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 (@unnir)
EMA + GPTQ-lite + warmdown tuning — PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush)
N-gram eval cache — concept PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659 (@deanbrr); fixed 5-gram PR Podracing: 1.0461 BPB (3-seed mean) — 5-gram eval + LeakyReLU² #706 (@newjordan); multi-order
entropy-adaptive PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0)
Shared GPU n-gram cache — PR Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS #796 (@Robby955); chunk-synchronized PR Record: X-WING — Shared N-gram Tables + Cubric (val_bpb=0.5644) #800 (@newjordan); PR Record: Chunk-Based N-gram Backoff + Score-First TTT (0.295 BPB) #809
(@AayushBaniya2006)
Per-order adaptive alpha — PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798 (@travispchen); Cubric scaling PR Record: X-WING — Shared N-gram Tables + Cubric (val_bpb=0.5644) #800 (@newjordan)
Overtone init — modded-nanogpt baseline
GELU Pre-Enrichment — original to this submission
EMA on GPU — original to this submission
Pre-Enrichment Confidence Modulation — original to this submission

Update Log

v1 (1.1855): int8+zlib, MLP 2x, seq 1024
v2 (1.1709): int6 QAT + lzma, MLP 3x, SWA, seq 2048
v3 (1.1668): + SmearGate + BigramHash + EMA + wider pre-enrichment
v4 (1.1629): + XSA on last 4 layers
v5 (1.0689): + EMA on GPU (64ms/step) + 5-gram eval cache
v6 (0.9784): + multi-order backoff 2-7 + entropy-adaptive alpha
v7 (0.9408): + extended to orders 2-11 + steeper alpha
v8 (0.9393): + pre-enrichment confidence modulation
v9 (0.2995): + two-phase shared cache + per-order adaptive alpha (3-seed validated)

Idan3011 force-pushed the submission branch from 7e07f4d to 7a03447 Compare March 26, 2026 04:40

Record: val_bpb=0.9393

7ff8cf7

Idan3011 force-pushed the submission branch from 7a03447 to 7ff8cf7 Compare March 26, 2026 04:43

Record: two-phase shared n-gram cache (val_bpb=0.3001, verified 1xGPU)

c504e5c

Idan3011 changed the title ~~Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)~~ Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.3001) Mar 26, 2026

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Record: 3-seed validated val_bpb=0.2995 (std 0.0016)

003757b

Idan3011 changed the title ~~Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.3001)~~ Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed) Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed)#810

Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed)#810
Idan3011 wants to merge 3 commits intoopenai:mainfrom
Idan3011:submission

Idan3011 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Idan3011 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two-Phase Shared N-gram Cache + EMA-GPU + Pre-Enrichment + XSA

3-Seed Results

Progress

Key Contributions

Two-Phase N-gram Eval with Global Cache

Per-Order Adaptive Alpha (primary source of BPB gain)

Pre-Enrichment Confidence Modulation

EMA on GPU (37% faster training)

GELU Pre-Enrichment (512→768→512)

XSA (Exclusive Self Attention) on Last 4 Layers

Additional Techniques

What Didn't Work

Compliance

Reproduction

Key Metrics

Credits

Update Log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Idan3011 commented Mar 26, 2026 •

edited

Loading