Skip to content

Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS#796

Open
Robby955 wants to merge 2 commits intoopenai:mainfrom
Robby955:record/prefill-7gram-ebls-0.6567
Open

Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS#796
Robby955 wants to merge 2 commits intoopenai:mainfrom
Robby955:record/prefill-7gram-ebls-0.6567

Conversation

@Robby955
Copy link

@Robby955 Robby955 commented Mar 26, 2026

Summary

val_bpb: 0.4374 (3-seed mean, std 0.0003) | ~15.99 MB | 8xH100 SXM | 560s train + ~330s eval

Major update from our initial 0.6567 submission. Two key innovations on top of the community n-gram framework:

  1. Distributed cache pre-fill (-0.31 BPB): each GPU rank pre-populates 15-gram hash tables with ALL preceding token positions via vectorized numpy before scoring begins. Makes 8-GPU eval mathematically identical to single-GPU sequential. No NCCL needed.

  2. Order-adaptive entropy gating (-0.18 BPB): per-order entropy thresholds — 15-gram matches trusted aggressively (center=2.5), bigrams only when model is confused (center=4.5). Continuous sigmoid interpolation across all 14 orders. Inspired by @travispchen's per-order thresholds in PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798.

3-seed results

Seed Sliding + 15-gram BPB Artifact bytes
1337 0.43706735 15,994,785
2024 0.43738561 15,949,881
2025 0.43768394 15,992,965
Mean 0.4374 (std 0.0003)

Ablation

Config BPB Delta
Neural only (EBLS + GPTQ) 1.1425
+ 7-gram prefill, uniform threshold 0.6565 -0.486
+ extend to 15-gram 0.6189 -0.038
+ order-adaptive gating 0.4374 -0.182

Why distributed pre-fill matters

Without pre-fill, ranks 1-7 start with empty n-gram caches during multi-GPU sliding window eval. They never see the tokens before their assigned window. This costs ~0.31 BPB (0.96 → 0.65). Pre-fill fixes this by having each rank hash all preceding positions into its tables before scoring — identical results to sequential eval, just parallelized.

Architecture

EBLS (Empirical Bayes Layer Sharing): 3 shared transformer blocks looped 3x + 2 unique = 11 layers. Per-virtual-layer LoRA rank 8. The Bayesian intuition: shared blocks are the prior, LoRA deviations are the empirical Bayes corrections.

512d, 8 heads, 4 KV heads (GQA), MLP 3x LeakyReLU(0.5)^2, XSA-all(11), Val-GPTQ int6 + LZMA preset 9.

Compliance

  • Training: 560s on 8xH100 SXM (within 600s)
  • Eval: ~330s on 8xH100 SXM (within 600s)
  • Artifacts under 16,000,000 bytes (max: 15,994,785)
  • Script: 1,451 lines (under 1,500)
  • No TTT, no oracle, no training data at eval time
  • Cache strictly backward-looking — updated only after scoring
  • Pre-fill uses val_tokens[0..pos-1] only, no future data

Legality

Pre-fill produces identical n-gram tables as single-GPU sequential eval — it's an implementation optimization, not a new information source. @valerio-oai confirmed on PR #659 that n-gram caching "is not illegal" and suggested entropy-based gating as the legal path. We welcome discussion on this.

Credits

This builds on a lot of community work and I want to make sure everyone gets credit:

Our novel contributions: distributed cache pre-fill, 15-gram extension, order-adaptive entropy gating with continuous interpolation, EBLS architecture.

Feedback, questions, and corrections welcome — happy to discuss any aspect of the approach.

3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003)
8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB.

Key innovation: distributed cache pre-fill using pure numpy.
Each GPU rank pre-populates n-gram hash tables with ALL preceding
token positions before scoring, producing results mathematically
identical to single-GPU sequential evaluation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ptive gating

3-seed validated (seeds 1337, 2024, 2025, std 0.0003).
Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB)
and order-adaptive entropy gating (-0.18 BPB).
@Robby955 Robby955 changed the title Record: 0.6567 BPB — Prefill Cache + 7-Gram Entropy-Adaptive + EBLS Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS Mar 26, 2026
@hypery11
Copy link

nice 🔥🔥🔥🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants