Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS#796
Open
Robby955 wants to merge 2 commits intoopenai:mainfrom
Open
Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS#796Robby955 wants to merge 2 commits intoopenai:mainfrom
Robby955 wants to merge 2 commits intoopenai:mainfrom
Conversation
3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003) 8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB. Key innovation: distributed cache pre-fill using pure numpy. Each GPU rank pre-populates n-gram hash tables with ALL preceding token positions before scoring, producing results mathematically identical to single-GPU sequential evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 tasks
…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).
|
nice 🔥🔥🔥🔥 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb: 0.4374 (3-seed mean, std 0.0003) | ~15.99 MB | 8xH100 SXM | 560s train + ~330s eval
Major update from our initial 0.6567 submission. Two key innovations on top of the community n-gram framework:
Distributed cache pre-fill (-0.31 BPB): each GPU rank pre-populates 15-gram hash tables with ALL preceding token positions via vectorized numpy before scoring begins. Makes 8-GPU eval mathematically identical to single-GPU sequential. No NCCL needed.
Order-adaptive entropy gating (-0.18 BPB): per-order entropy thresholds — 15-gram matches trusted aggressively (center=2.5), bigrams only when model is confused (center=4.5). Continuous sigmoid interpolation across all 14 orders. Inspired by @travispchen's per-order thresholds in PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798.
3-seed results
Ablation
Why distributed pre-fill matters
Without pre-fill, ranks 1-7 start with empty n-gram caches during multi-GPU sliding window eval. They never see the tokens before their assigned window. This costs ~0.31 BPB (0.96 → 0.65). Pre-fill fixes this by having each rank hash all preceding positions into its tables before scoring — identical results to sequential eval, just parallelized.
Architecture
EBLS (Empirical Bayes Layer Sharing): 3 shared transformer blocks looped 3x + 2 unique = 11 layers. Per-virtual-layer LoRA rank 8. The Bayesian intuition: shared blocks are the prior, LoRA deviations are the empirical Bayes corrections.
512d, 8 heads, 4 KV heads (GQA), MLP 3x LeakyReLU(0.5)^2, XSA-all(11), Val-GPTQ int6 + LZMA preset 9.
Compliance
Legality
Pre-fill produces identical n-gram tables as single-GPU sequential eval — it's an implementation optimization, not a new information source. @valerio-oai confirmed on PR #659 that n-gram caching "is not illegal" and suggested entropy-based gating as the legal path. We welcome discussion on this.
Credits
This builds on a lot of community work and I want to make sure everyone gets credit:
Our novel contributions: distributed cache pre-fill, 15-gram extension, order-adaptive entropy gating with continuous interpolation, EBLS architecture.
Feedback, questions, and corrections welcome — happy to discuss any aspect of the approach.