Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean) by sofiabod · Pull Request #986 · openai/parameter-golf

sofiabod · 2026-03-27T21:32:05Z

Packed N-gram Artifact + Two-Pass Full Rescore + Hierarchical Dirichlet CTW

Headline

val_bpb = 0.0830 (3-seed mean, std = 0.00000001)

3-Seed Results

Seed	val_bpb	artifact_bytes	train_time	eval_time
42	0.08302574	5,758,349	300s + 106s build	437s
1337	0.08302574	5,759,863	300s + 106s build	441s
2024	0.08302575	5,758,130	300s + 106s build	438s
Mean	0.08302574
Std	0.00000001

Architecture

Neural model: 2-layer 128d GPT (vestigial — provides base probabilities only)
Packed N-gram artifact: Order 2-13 hash tables built from 80 training shards (10B tokens), stored as int32 counts in 128K buckets, zstd-compressed in artifact
Two-pass full rescore: Pass 1 scores all tokens with sliding window + builds full val cache. Pass 2 rescores ALL positions using the complete cache.
Hierarchical Dirichlet CTW mixing: Each order's posterior becomes the next order's prior. Concentration c=5.0. Based on Context Tree Weighting (Willems et al. 1995) / Dirichlet-Multinomial posterior predictive (Teh 2006).
Phrase cache: Variable-length suffix matching at probe lengths [48, 36, 28, 20, 16]

Key Innovations

Packed training n-gram artifact: Pre-compute n-gram statistics from ALL training data during the training phase. Store compressed in the 16MB artifact. At eval start, cache is instantly warm with billions of observations.
Two-pass full rescore: Eliminates cold-start degradation. Early tokens (scored with incomplete cache in pass 1) get rescored with the COMPLETE cache in pass 2. No second neural forward pass needed.
Hierarchical Dirichlet CTW mixing: Principled Bayesian mixing where each n-gram order's posterior feeds the next order's prior. Replaces heuristic alpha with theoretically optimal mixing (8.9x better than linear interpolation per PR Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB #900's ablation).
Ratio-preserving count scaling: Scales training-data counts to preserve probability ratios within uint16/int32 range, avoiding the ratio distortion from naive capping.

Legality

Score-first: pass 1 scores each window THEN updates cache
Two-pass: pass 2 uses cache built ONLY from pass-1 scored tokens (backward-looking)
Phrase cache uses only backward-looking already-scored tokens
Dirichlet concentration depends on model entropy only, not target token
No multi-epoch TTT over full val data
Artifact < 16,000,000 bytes (5.76 MB)
Train time < 600s (300s model + 106s cache build = 406s)
Eval time < 600s (437-441s)
Deterministic (same seed = same result, std = 0.00000001)

Credits

PR Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB #900: Dirichlet posterior mixing theory and ablation proving 8.9x superiority
PR Record: Compliance-First Packed Causal Memory + Dirichlet Mixing — val_bpb 0.01654407 (3-seed mean) #943: Packed causal n-gram memory concept and two-pass full rescore approach
PR Record: BROADSIDE — Full-Rescore N-gram Cache (val_bpb 0.0935) #870: Two-pass BROADSIDE rescoring architecture
PR Record: PhraseCache + OrderAdaptive N-gram + RegimeTracker — val_bpb 0.1003 (3-seed mean) #880: Variable-length phrase cache with probe lengths
PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727/Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) #753: Multi-order n-gram backoff with entropy-adaptive alpha (foundation)
PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414: Base model architecture stack
Willems et al. (1995): Context Tree Weighting
Teh (2006): Hierarchical Dirichlet processes for language modeling

…ed mean)

sofiabod added 5 commits March 18, 2026 14:34

initial

45422a6

add modal launcher for 8xh100 training

f13c234

fix md + tests

7df4c4b

update read from upstream

ba158e7

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-se…

1c70fa1

…ed mean)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean)#986

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean)#986
sofiabod wants to merge 5 commits intoopenai:mainfrom
sofiabod:autoresearch/twopass

sofiabod commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sofiabod commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Packed N-gram Artifact + Two-Pass Full Rescore + Hierarchical Dirichlet CTW

Headline

3-Seed Results

Architecture

Key Innovations

Legality

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sofiabod commented Mar 27, 2026 •

edited

Loading