Skip to content

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean)#986

Open
sofiabod wants to merge 5 commits intoopenai:mainfrom
sofiabod:autoresearch/twopass
Open

Record: Packed N-gram + Two-Pass Dirichlet CTW — val_bpb 0.0830 (3-seed mean)#986
sofiabod wants to merge 5 commits intoopenai:mainfrom
sofiabod:autoresearch/twopass

Conversation

@sofiabod
Copy link
Copy Markdown

@sofiabod sofiabod commented Mar 27, 2026

Packed N-gram Artifact + Two-Pass Full Rescore + Hierarchical Dirichlet CTW

Headline

val_bpb = 0.0830 (3-seed mean, std = 0.00000001)

3-Seed Results

Seed val_bpb artifact_bytes train_time eval_time
42 0.08302574 5,758,349 300s + 106s build 437s
1337 0.08302574 5,759,863 300s + 106s build 441s
2024 0.08302575 5,758,130 300s + 106s build 438s
Mean 0.08302574
Std 0.00000001

Architecture

  • Neural model: 2-layer 128d GPT (vestigial — provides base probabilities only)
  • Packed N-gram artifact: Order 2-13 hash tables built from 80 training shards (10B tokens), stored as int32 counts in 128K buckets, zstd-compressed in artifact
  • Two-pass full rescore: Pass 1 scores all tokens with sliding window + builds full val cache. Pass 2 rescores ALL positions using the complete cache.
  • Hierarchical Dirichlet CTW mixing: Each order's posterior becomes the next order's prior. Concentration c=5.0. Based on Context Tree Weighting (Willems et al. 1995) / Dirichlet-Multinomial posterior predictive (Teh 2006).
  • Phrase cache: Variable-length suffix matching at probe lengths [48, 36, 28, 20, 16]

Key Innovations

  1. Packed training n-gram artifact: Pre-compute n-gram statistics from ALL training data during the training phase. Store compressed in the 16MB artifact. At eval start, cache is instantly warm with billions of observations.

  2. Two-pass full rescore: Eliminates cold-start degradation. Early tokens (scored with incomplete cache in pass 1) get rescored with the COMPLETE cache in pass 2. No second neural forward pass needed.

  3. Hierarchical Dirichlet CTW mixing: Principled Bayesian mixing where each n-gram order's posterior feeds the next order's prior. Replaces heuristic alpha with theoretically optimal mixing (8.9x better than linear interpolation per PR Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB #900's ablation).

  4. Ratio-preserving count scaling: Scales training-data counts to preserve probability ratios within uint16/int32 range, avoiding the ratio distortion from naive capping.

Legality

  • Score-first: pass 1 scores each window THEN updates cache
  • Two-pass: pass 2 uses cache built ONLY from pass-1 scored tokens (backward-looking)
  • Phrase cache uses only backward-looking already-scored tokens
  • Dirichlet concentration depends on model entropy only, not target token
  • No multi-epoch TTT over full val data
  • Artifact < 16,000,000 bytes (5.76 MB)
  • Train time < 600s (300s model + 106s cache build = 406s)
  • Eval time < 600s (437-441s)
  • Deterministic (same seed = same result, std = 0.00000001)

Credits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant