Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900
Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB#900Robby955 wants to merge 4 commits intoopenai:mainfrom
Conversation
3-seed validated (std 0.000003): s1337: 0.11967683 (435s eval, 14.91MB) s2024: 0.11968156 (455s eval, 14.84MB) s2025: 0.11967545 (441s eval, 14.80MB) Dirichlet-Multinomial posterior predictive applied at two levels: - N-gram backoff (orders 2-15, c=5.0) - Phrase suffix matching (probes=[20,16], c=2.0) Ablation: removing Dirichlet from phrase mixing degrades BPB 8.9x.
….5e-6) - Optimized phrase-level concentration from 2.0 to 1.0 via sweep - Added phrase concentration landscape table (convex, min at 1.0) - Expanded compression theory section (CTW connection, match-length scaling, OBCL decomposition) - Updated 3-seed results: s1337=0.11807, s2024=0.11807, s2025=0.11806 - Longer matches need less smoothing: c* decreases from ~50 (bigrams) to 1.0 (phrases)
Log files now match claimed BPB (0.11807). All numbers are exact from verified pod runs, not approximations.
Per-order concentrations learned via Bayesian Online Concentration Learning range from 50.0 (bigrams) to 1.86 (14-grams). Improves from 0.11807 to 0.11559 (-0.00248 BPB). 3-seed mean 0.11559, std 3.8e-6.
|
Moving discussion here so as to not clutter the other thread. It's late where I live and I'm a bit tired, so I had Claude write my response for me: Your theoretical argument is correct — the Dirichlet-Multinomial posterior predictive produces a valid distribution when I implemented your exact formula (orders 2-15, per-order concentrations [50.0, 50.0, 6.95, 2.98, 2.05, 2.05, 2.05, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86, 1.86], phrase probes at 20 and 16, 4M n-gram buckets, 1M phrase buckets, phrase concentration 1.0) and computed
These should all be exactly 1.0. The choice of prior (uniform vs neural softmax) doesn't affect whether the sum equals 1 — that depends entirely on whether With 4M buckets, each of the 1024 lookups You can verify this yourself — at any position after warmup, run your full hierarchical Dirichlet update (all orders + phrase) for all 1024 vocab tokens instead of just the correct one, and print Me again: I just saw your arithmetic encoder. It uses exact counts via defaultdict(), which is different from what is implemented in this PR. |
|
Another way to put it: in the discussion in #677 you mention that everything is properly normalized because of this: The problem is that the requirement is not that it needs to be normalized over the hash buckets y, but over all possible output tokens Example: For the extreme case of only 1 cache bin (maximal collisions): And then the problem is clear: the probability for the i-th token The issue becomes very clear if you then think about the forward pass and how you would now sample the next token. All TLDR: it's a bug in the evaluation |
|
Closing this PR. The unnormalized point-estimate scoring exploits hash collision artifacts — we confirmed this ourselves when implementing full-distribution normalization (partition functions averaged ~950 instead of 1.0 across the vocab). The Dirichlet-Multinomial framework is sound with exact counts, but hash table collisions invalidate the score in practice. Some findings from this work that may be useful to others:
Pivoting to neural track. Will write up the collision-concentration analysis separately. |
Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.11559 BPB
val_bpb: 0.11559 (3-seed mean, std 3.8e-6) | ~14.9 MB | 8xH100 SXM
What changed from previous version (0.11807)
Per-order concentration learning via Bayesian Online Concentration Learning (OBCL). Instead of a single c=5.0 for all n-gram orders, each order gets its own concentration learned from a posterior over a 50-point log-spaced grid [0.5, 50.0]:
This 27x spread in optimal concentration across orders is explained by the exponential decrease in hash collision rate with increasing match length.
3-seed validation
Approach
Same Dirichlet-Multinomial formula at every level:
Key ablations
All ablation deltas exceed 200 sigma (3-seed std 3.8e-6).
Compliance
Legality
N-gram caching ruled "directionally legal" by @valerio-oai (Issue #677). Single-pass, score-first, causal. We also maintain a separate neural-only submission (PR #734, 1.1198 BPB).
See README.md for full details, concentration landscapes, compression theory connection, and credits.