Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS by Robby955 · Pull Request #796 · openai/parameter-golf

Robby955 · 2026-03-26T01:41:00Z

Summary

val_bpb: 0.4374 (3-seed mean, std 0.0003) | ~15.99 MB | 8xH100 SXM | 560s train + ~330s eval

Major update from our initial 0.6567 submission. Two key innovations on top of the community n-gram framework:

Distributed cache pre-fill (-0.31 BPB): each GPU rank pre-populates 15-gram hash tables with ALL preceding token positions via vectorized numpy before scoring begins. Makes 8-GPU eval mathematically identical to single-GPU sequential. No NCCL needed.
Order-adaptive entropy gating (-0.18 BPB): per-order entropy thresholds — 15-gram matches trusted aggressively (center=2.5), bigrams only when model is confused (center=4.5). Continuous sigmoid interpolation across all 14 orders. Inspired by @travispchen's per-order thresholds in PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798.

3-seed results

Seed	Sliding + 15-gram BPB	Artifact bytes
1337	0.43706735	15,994,785
2024	0.43738561	15,949,881
2025	0.43768394	15,992,965
Mean	0.4374 (std 0.0003)

Ablation

Config	BPB	Delta
Neural only (EBLS + GPTQ)	1.1425	—
+ 7-gram prefill, uniform threshold	0.6565	-0.486
+ extend to 15-gram	0.6189	-0.038
+ order-adaptive gating	0.4374	-0.182

Why distributed pre-fill matters

Without pre-fill, ranks 1-7 start with empty n-gram caches during multi-GPU sliding window eval. They never see the tokens before their assigned window. This costs ~0.31 BPB (0.96 → 0.65). Pre-fill fixes this by having each rank hash all preceding positions into its tables before scoring — identical results to sequential eval, just parallelized.

Architecture

EBLS (Empirical Bayes Layer Sharing): 3 shared transformer blocks looped 3x + 2 unique = 11 layers. Per-virtual-layer LoRA rank 8. The Bayesian intuition: shared blocks are the prior, LoRA deviations are the empirical Bayes corrections.

512d, 8 heads, 4 KV heads (GQA), MLP 3x LeakyReLU(0.5)^2, XSA-all(11), Val-GPTQ int6 + LZMA preset 9.

Compliance

Training: 560s on 8xH100 SXM (within 600s)
Eval: ~330s on 8xH100 SXM (within 600s)
Artifacts under 16,000,000 bytes (max: 15,994,785)
Script: 1,451 lines (under 1,500)
No TTT, no oracle, no training data at eval time
Cache strictly backward-looking — updated only after scoring
Pre-fill uses val_tokens[0..pos-1] only, no future data

Legality

Pre-fill produces identical n-gram tables as single-GPU sequential eval — it's an implementation optimization, not a new information source. @valerio-oai confirmed on PR #659 that n-gram caching "is not illegal" and suggested entropy-based gating as the legal path. We welcome discussion on this.

Credits

This builds on a lot of community work and I want to make sure everyone gets credit:

@deanbrr (PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659) — original n-gram cache concept
@valerio-oai — legality ruling + entropy suggestion
@newjordan (PR Podracing: 1.0461 BPB (3-seed mean) #674) — first legal implementation
@lukacf (PR Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish) #702) — multi-order backoff + entropy-adaptive sigmoid
@Asukabot0 (PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727) — 7-gram, first sub-1.0 BPB
@hypery11 (PR Record: 11L + order-adaptive 9-gram backoff (mean val_bpb=0.9059) #788) — 9-gram extension
@travispchen (PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798) — per-order entropy thresholds (directly inspired our adaptive gating)
@raahilshah (PR Record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning (val_bpb: 1.1171) #634) — XSA-all
@parinzee (PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493) — LeakyReLU(0.5)^2
@signalrush (PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414) — base GPTQ + EMA + warmdown stack

Our novel contributions: distributed cache pre-fill, 15-gram extension, order-adaptive entropy gating with continuous interpolation, EBLS architecture.

Feedback, questions, and corrections welcome — happy to discuss any aspect of the approach.

3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003) 8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB. Key innovation: distributed cache pre-fill using pure numpy. Each GPU rank pre-populates n-gram hash tables with ALL preceding token positions before scoring, producing results mathematically identical to single-GPU sequential evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).

hypery11 · 2026-03-26T06:35:56Z

nice 🔥🔥🔥🔥

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

MatoTeziTanka mentioned this pull request Mar 26, 2026

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache #769

Open

10 tasks

Update record: 0.4374 BPB — 15-gram + distributed prefill + order-ada…

d19036d

…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).

Robby955 changed the title ~~Record: 0.6567 BPB — Prefill Cache + 7-Gram Entropy-Adaptive + EBLS~~ Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS Mar 26, 2026

Idan3011 mentioned this pull request Mar 26, 2026

Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed) #810

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS#796

Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS#796
Robby955 wants to merge 2 commits intoopenai:mainfrom
Robby955:record/prefill-7gram-ebls-0.6567

Robby955 commented Mar 26, 2026 •

edited

Loading

Uh oh!

hypery11 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Robby955 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-seed results

Ablation

Why distributed pre-fill matters

Architecture

Compliance

Legality

Credits

Uh oh!

hypery11 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Robby955 commented Mar 26, 2026 •

edited

Loading