Skip to content

Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed)#810

Open
Idan3011 wants to merge 3 commits intoopenai:mainfrom
Idan3011:submission
Open

Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed)#810
Idan3011 wants to merge 3 commits intoopenai:mainfrom
Idan3011:submission

Conversation

@Idan3011
Copy link

@Idan3011 Idan3011 commented Mar 26, 2026

Two-Phase Shared N-gram Cache + EMA-GPU + Pre-Enrichment + XSA

val_bpb: 0.2995 (3-seed mean, std 0.0016) | 14.94 MB | 8xH100 SXM, 600s


3-Seed Results

Seed Steps Sliding BPB val_bpb Artifact
1337 9,268 1.1478 0.3001 14,942,971
42 9,318 1.1468 0.2977 14,922,769
3011 9,322 1.1463 0.3008 14,939,305
Mean 1.1470 0.2995
Std 0.0016

Progress

v1 v2 v3 v4 v5 v6 v7 v8 v9 (this)
val_bpb 1.1855 1.1709 1.1668 1.1629 1.0689 0.9784 0.9408 0.9393 0.2995
Eval method sliding sliding sliding sliding 5-gram 2-7 backoff 2-11 backoff +PE conf shared
cache
Steps (600s) 8,004 6,423 5,373 5,636 9,312 9,268 9,268 9,268 9,268
Step time 75ms 93ms 112ms 106ms 64ms 65ms 65ms 65ms 65ms

Key Contributions

Two-Phase N-gram Eval with Global Cache

Phase 1 (parallel, all GPUs): Each GPU processes its share of sliding windows. Computes model target
probabilities, entropy, and pre-enrichment delta.

Phase 2 (global, sequential): All scored data gathered via all_gather, sorted by position. Single global n-gram
cache built sequentially in 16K-token chunks. Every scored token sees the FULL 62M-token cache.

Verified: 1xGPU produces identical result (0.3001) — no multi-GPU artifacts.

Per-Order Adaptive Alpha (primary source of BPB gain)

Each n-gram order gets its own entropy center and weight multiplier:

Per-order entropy centers (high orders trusted at lower entropy):

  {11: 2.5, 10: 2.7, 9: 3.0, 8: 3.0, 7: 3.0, 6: 3.2, 5: 3.5, 4: 3.8, 3: 4.2, 2: 4.5}

Per-order weights (high orders boosted, low orders suppressed):

  {11: 2.0, 10: 2.0, 9: 2.0, 8: 2.0, 7: 2.0, 6: 1.88, 5: 1.88, 4: 1.0, 3: 0.45, 2: 0.30}

Base alpha: 0.20 + 0.55 * sigmoid(2 * (H - center)) per order.

Pre-Enrichment Confidence Modulation

Uses the pre-enrichment layer's transformation magnitude as a confidence signal. High delta = model uncertain =
trust n-gram more. Modulates alpha by (0.5 + 1.0 * pe_conf).

EMA on GPU (37% faster training)

EMA state kept on GPU during training. Step time: 64.7ms (vs 101ms before). 9,268 steps in 600s — 57% more
gradient updates.

GELU Pre-Enrichment (512→768→512)

Wider nonlinear transformation before the residual stream: embedding → BigramHash add → SmearGate →
Linear(512→768) → GELU → Linear(768→512) → RMS Norm → transformer blocks

XSA (Exclusive Self Attention) on Last 4 Layers

Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir). Zero
parameters.


Additional Techniques

  • SmearGate: Per-dim gate blending each token with previous token.
  • BigramHash (2048x128): Hash-table embedding for token bigrams.
  • EMA (decay=0.997): Quant gap 0.004.
  • Int6 QAT + lzma: 14.94 MB artifact.

Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections. Training: Muon+AdamW,
WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.


What Didn't Work

  • Log-odds mixing: n-gram probabilities near zero create catastrophic logits.
  • SSE post-correction: Online bias learning always pushes predictions toward 1.0.
  • BigramHash confidence signal: Embedding norm didn't correlate with prediction accuracy.
  • Orders 12-13: No improvement over 2-11.
  • Single entropy-adaptive alpha for all orders: per-order centers and weights are critical.
  • Frontier stack (LeakyReLU², Partial RoPE, LN Scale): Stacked together = regression.
  • Encoder recurrence: 900x quant error amplification.

Compliance

  • Score-first: n-gram cache updated AFTER scoring each 16K-token chunk
  • Backward-looking: cache at position p contains only tokens 0..p-1
  • No oracle selection: alpha depends on model entropy and n-gram order, never on ground truth
  • No training data access during eval
  • Verified: 1xGPU result identical to 8xGPU (0.3001)
  • All artifacts under 16,000,000 bytes
  • GPTQ calibration within training budget

Reproduction

All defaults baked in. No env vars needed (SEED=1337 default).

  python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
  torchrun --standalone --nproc_per_node=8 train_gpt.py

8xH100 SXM, 600s training + ~204s eval. For other seeds: SEED=42, SEED=3011.


Key Metrics

Metric Value
val_bpb (3-seed mean) 0.2995
val_bpb std 0.0016
Sliding window val_bpb 1.1470
Post-quant val_bpb (standard) 1.1690
Pre-quant val_bpb 1.1646
Quant gap 0.004
Training time 600s (9,268 steps at 64.7ms)
Eval time ~204s
Peak memory 13,058 MiB
Artifact size ~14.94 MB
Model parameters 25,254,992

Credits


Update Log

  • v1 (1.1855): int8+zlib, MLP 2x, seq 1024
  • v2 (1.1709): int6 QAT + lzma, MLP 3x, SWA, seq 2048
  • v3 (1.1668): + SmearGate + BigramHash + EMA + wider pre-enrichment
  • v4 (1.1629): + XSA on last 4 layers
  • v5 (1.0689): + EMA on GPU (64ms/step) + 5-gram eval cache
  • v6 (0.9784): + multi-order backoff 2-7 + entropy-adaptive alpha
  • v7 (0.9408): + extended to orders 2-11 + steeper alpha
  • v8 (0.9393): + pre-enrichment confidence modulation
  • v9 (0.2995): + two-phase shared cache + per-order adaptive alpha (3-seed validated)

@Idan3011 Idan3011 changed the title Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.3001) Mar 26, 2026
@Idan3011 Idan3011 changed the title Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.3001) Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed) Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant