Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Two-Level Dirichlet Posterior + Per-Order OBCL + Phrase Cache

**val_bpb: 0.11556** (3-seed mean, std 0.0000057) | **~15.1 MB** | 8xH100 SXM

## Results (8xH100 80GB SXM, Rancho Cordova CA)

| Seed | Val BPB | Eval Time |
|------|---------|-----------|
| 1337 | 0.11555061 | 419s |
| 42 | 0.11556435 | 370s |
| 2025 | 0.11555875 | 359s |
| **Mean** | **0.11556 (std 0.0000057)** | |

## Architecture

EBLS: 3 shared transformer blocks looped 3x + 2 unique = 11 effective layers.
512d, 8 heads, 4 KV heads (GQA), MLP 3x with LeakyReLU(0.5)², per-virtual-layer LoRA rank 8.

## Key Techniques

- **Two-level Dirichlet smoothing** with per-order OBCL concentrations (50.0 for bigrams → 1.86 for 14-grams)
- **Phrase suffix matching** at probe lengths [20, 16] with Dirichlet concentration 1.0
- **15-gram backoff** (orders 2-15, 4M hash buckets)
- **Complementary training** (alpha=0.50, orders 2-5)
- **GPTQ int6 + LZMA** compression
- **EMA 0.997 + SWA** weight averaging
- **XSA** on all 11 layers

## Credits

Built on the shoulders of the community:
- @signalrush (PR #414) — GPTQ + EMA + warmdown foundation
- @Robby955 (PR #900) — Dirichlet smoothing, OBCL, phrase cache
- @himanshudongre (PR #846) — two-pass rescoring concept
- @deanbrr (PR #659) — original N-gram cache concept
- @newjordan (PR #674) — first legal implementation
- @pentxayc (PR #803) — complementary training

## Run Command

```bash
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 MAX_WALLCLOCK_SECONDS=560 XSA_LAST_N=11 \
WARMDOWN_ITERS=4000 CLIP_RANGE=31 COMPRESSOR=lzma \
NUM_KV_HEADS=4 EVAL_STRIDE=64 \
GPTQ_ENABLED=1 GPTQ_CALIB_BATCHES=64 GPTQ_CALIB_SOURCE=val \
GPTQ_BLOCK_SIZE=128 SWA_ENABLED=1 LATE_QAT_THRESHOLD=0.15 \
COMP_ENABLED=1 COMP_ALPHA=0.50 COMP_ORDER=5 COMP_WARMUP=200 COMP_MIN_COUNT=3 \
NGRAM_CACHE=1 NGRAM_ORDER=15 NGRAM_MIN_ORDER=2 \
NGRAM_BUCKETS=4194304 NGRAM_DIRICHLET=1 NGRAM_CONCENTRATION=5.0 \
NGRAM_TEMPERATURE=1.0 \
NGRAM_PER_ORDER_CONC="50.0,50.0,6.95,2.98,2.05,2.05,2.05,1.86,1.86,1.86,1.86,1.86,1.86,1.86" \
PHRASE_CACHE=1 PHRASE_BUCKETS=1048576 PHRASE_PROBE_LENGTHS=20,16 \
PHRASE_DIRICHLET=1 PHRASE_CONCENTRATION=1.0 PHRASE_MIN_COUNT=1 \
NCCL_TIMEOUT=3600 SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "Two-Level Dirichlet Posterior + Per-Order OBCL + Phrase Cache (3-seed)",
"val_bpb": 0.1156,
"bytes_total": 15077877,
"blurb": "Two-level Dirichlet-Multinomial posterior mixing with per-order OBCL concentrations (50.0 for bigrams to 1.86 for 14-grams) and phrase suffix matching (probes at 20,16 tokens). 15-gram backoff with neural base measure. Complementary training (alpha=0.50). EBLS architecture (3 shared x 3 loops + 2 unique = 11L). GPTQ int6 + LZMA. 3-seed mean: 0.11556 (std 0.0000057). Based on techniques from PRs #414, #900, #846.",
"author": "Nathan Maine",
"github_id": "NathanMaine",
"date": "2026-03-27"
}
1,568 changes: 1,568 additions & 0 deletions records/track_10min_16mb/2026-03-27_Dirichlet_Ngram_Phrase_Cache/train_gpt.py

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
=== Parameter Golf v2 — N-gram Cache Submission ===
Seed: 1337
N-gram: orders 2-15, Dirichlet=1
Phrase: probes=20,16, Dirichlet=1
Complementary: alpha=0.50, order=5

W0327 04:23:51.748000 2469 torch/distributed/run.py:803]
W0327 04:23:51.748000 2469 torch/distributed/run.py:803] *****************************************
W0327 04:23:51.748000 2469 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 04:23:51.748000 2469 torch/distributed/run.py:803] *****************************************
logs/1375010a-c533-46df-9444-a96d2f458e76.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
mixed_precision: clip_range=31 (int6) compressor=lzma
model_params:27124848
XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
VRL:True active_layers:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size:8 grad_accum_steps:1 sdp:flash=True
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:560.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
comp_train:enabled orders=2-5 alpha=0.5 warmup=200
step:0/20000 val_loss:6.9301 val_bpb:4.1044 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9313 train_time:172ms step_avg:172.18ms
step:2/20000 train_loss:8.7049 train_time:289ms step_avg:144.47ms
step:3/20000 train_loss:7.8658 train_time:402ms step_avg:134.10ms
step:4/20000 train_loss:7.1343 train_time:515ms step_avg:128.67ms
step:5/20000 train_loss:6.9742 train_time:628ms step_avg:125.64ms
step:6/20000 train_loss:7.0034 train_time:751ms step_avg:125.09ms
step:7/20000 train_loss:6.9378 train_time:862ms step_avg:123.15ms
step:8/20000 train_loss:6.7761 train_time:974ms step_avg:121.71ms
step:9/20000 train_loss:6.4129 train_time:1085ms step_avg:120.57ms
step:10/20000 train_loss:6.0878 train_time:1196ms step_avg:119.56ms
step:500/20000 train_loss:1.7359 train_time:67826ms step_avg:135.65ms
step:1000/20000 train_loss:1.6029 train_time:141960ms step_avg:141.96ms
step:1500/20000 train_loss:1.5659 train_time:215456ms step_avg:143.64ms
step:2000/20000 train_loss:1.4602 train_time:289461ms step_avg:144.73ms
step:2500/20000 train_loss:1.5130 train_time:363344ms step_avg:145.34ms
step:3000/20000 train_loss:1.5011 train_time:436909ms step_avg:145.64ms
swa:start step:3100
late_qat:enabled step:3248 scale:0.1499
step:3500/20000 train_loss:1.4806 train_time:510217ms step_avg:145.78ms
step:3827/20000 val_loss:1.9690 val_bpb:1.1661 train_time:560026ms step_avg:146.34ms
stopping_early: wallclock_cap train_time:560026ms step:3827/20000
peak memory allocated: 22534 MiB reserved: 22604 MiB
swa:applying 15 snapshots, blending with EMA (0.50/0.50)
DIAGNOSTIC post_ema val_loss:1.9689 val_bpb:1.1661 eval_time:2097ms
Serialized model: 106449565 bytes Code: 87629 bytes
gptq:collecting hessians batches=64 source=val
gptq:hessians collected layers=68 time=11.1s
gptq:pre_prune artifact=14990248 target=15907371
Saved quantized model to final_int6_model.pt
Serialized model int63+lzma: 14990248 bytes
Total submission size: 15077877 bytes
final_int6_roundtrip val_loss:1.9773 val_bpb:1.1711 exact:1.17105855 eval_time:34030ms
ngram_cache:enabled orders=2-15 dirichlet=True concentration=5.0 temperature=1.0 entropy=True min_count=2 buckets=4194304 order_mults=none alpha_max=0.95
phrase_cache:enabled probes=[20, 16] dirichlet=True conc=1.0 alpha=0.9 buckets=1048576
ngram_prefill:rank1 pre-filled 7754688 positions in 19.0s
phrase_prefill:rank1 pre-filled 7754688 positions in 3.1s
ngram_prefill:rank2 pre-filled 15507392 positions in 48.7s
ngram_prefill:rank3 pre-filled 23260096 positions in 56.0s
phrase_prefill:rank2 pre-filled 15507392 positions in 7.5s
phrase_prefill:rank3 pre-filled 23260096 positions in 10.4s
ngram_prefill:rank4 pre-filled 31012800 positions in 99.6s
phrase_prefill:rank4 pre-filled 31012800 positions in 14.5s
ngram_prefill:rank5 pre-filled 38765504 positions in 126.1s
phrase_prefill:rank5 pre-filled 38765504 positions in 17.8s
ngram_prefill:rank6 pre-filled 46518208 positions in 147.0s
phrase_prefill:rank6 pre-filled 46518208 positions in 21.3s
ngram_prefill:rank7 pre-filled 54270912 positions in 179.8s
phrase_prefill:rank7 pre-filled 54270912 positions in 25.7s
final_int6_sliding_window val_loss:0.1951 val_bpb:0.1156 exact:0.11555061 stride:64 eval_time:418945ms
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
=== Parameter Golf v2 — N-gram Cache Submission ===
Seed: 2025
N-gram: orders 2-15, Dirichlet=1
Phrase: probes=20,16, Dirichlet=1
Complementary: alpha=0.50, order=5

W0327 05:08:36.831000 75454 torch/distributed/run.py:803]
W0327 05:08:36.831000 75454 torch/distributed/run.py:803] *****************************************
W0327 05:08:36.831000 75454 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 05:08:36.831000 75454 torch/distributed/run.py:803] *****************************************
logs/a5507699-4bee-4690-a084-705102c8d096.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
mixed_precision: clip_range=31 (int6) compressor=lzma
model_params:27124848
XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
VRL:True active_layers:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size:8 grad_accum_steps:1 sdp:flash=True
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:560.000
seed:2025
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
comp_train:enabled orders=2-5 alpha=0.5 warmup=200
step:0/20000 val_loss:6.9306 val_bpb:4.1047 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9326 train_time:171ms step_avg:170.93ms
step:2/20000 train_loss:8.8346 train_time:287ms step_avg:143.36ms
step:3/20000 train_loss:7.9183 train_time:406ms step_avg:135.18ms
step:4/20000 train_loss:7.0638 train_time:522ms step_avg:130.55ms
step:5/20000 train_loss:7.0104 train_time:635ms step_avg:126.93ms
step:6/20000 train_loss:6.9854 train_time:748ms step_avg:124.73ms
step:7/20000 train_loss:6.7751 train_time:861ms step_avg:123.02ms
step:8/20000 train_loss:6.6528 train_time:975ms step_avg:121.82ms
step:9/20000 train_loss:6.3819 train_time:1092ms step_avg:121.29ms
step:10/20000 train_loss:6.0564 train_time:1206ms step_avg:120.61ms
step:500/20000 train_loss:1.7383 train_time:68843ms step_avg:137.69ms
step:1000/20000 train_loss:1.6012 train_time:143398ms step_avg:143.40ms
step:1500/20000 train_loss:1.5667 train_time:217428ms step_avg:144.95ms
step:2000/20000 train_loss:1.4609 train_time:292012ms step_avg:146.01ms
step:2500/20000 train_loss:1.5132 train_time:365560ms step_avg:146.22ms
step:3000/20000 train_loss:1.5019 train_time:439885ms step_avg:146.63ms
swa:start step:3050
late_qat:enabled step:3204 scale:0.1499
step:3500/20000 train_loss:1.4818 train_time:515064ms step_avg:147.16ms
step:3809/20000 val_loss:1.9709 val_bpb:1.1673 train_time:560029ms step_avg:147.03ms
stopping_early: wallclock_cap train_time:560029ms step:3809/20000
peak memory allocated: 22528 MiB reserved: 22568 MiB
swa:applying 16 snapshots, blending with EMA (0.50/0.50)
DIAGNOSTIC post_ema val_loss:1.9709 val_bpb:1.1673 eval_time:2094ms
Serialized model: 106449565 bytes Code: 87629 bytes
gptq:collecting hessians batches=64 source=val
gptq:hessians collected layers=68 time=10.8s
gptq:pre_prune artifact=15245452 target=15907371
Saved quantized model to final_int6_model.pt
Serialized model int63+lzma: 15245452 bytes
Total submission size: 15333081 bytes
final_int6_roundtrip val_loss:1.9795 val_bpb:1.1723 exact:1.17234936 eval_time:6449ms
ngram_cache:enabled orders=2-15 dirichlet=True concentration=5.0 temperature=1.0 entropy=True min_count=2 buckets=4194304 order_mults=none alpha_max=0.95
phrase_cache:enabled probes=[20, 16] dirichlet=True conc=1.0 alpha=0.9 buckets=1048576
ngram_prefill:rank1 pre-filled 7754688 positions in 20.3s
phrase_prefill:rank1 pre-filled 7754688 positions in 3.2s
ngram_prefill:rank2 pre-filled 15507392 positions in 46.0s
phrase_prefill:rank2 pre-filled 15507392 positions in 6.9s
ngram_prefill:rank3 pre-filled 23260096 positions in 73.0s
phrase_prefill:rank3 pre-filled 23260096 positions in 10.8s
ngram_prefill:rank4 pre-filled 31012800 positions in 94.8s
ngram_prefill:rank5 pre-filled 38765504 positions in 99.7s
phrase_prefill:rank4 pre-filled 31012800 positions in 14.4s
phrase_prefill:rank5 pre-filled 38765504 positions in 17.8s
ngram_prefill:rank6 pre-filled 46518208 positions in 132.6s
ngram_prefill:rank7 pre-filled 54270912 positions in 139.0s
phrase_prefill:rank6 pre-filled 46518208 positions in 21.3s
phrase_prefill:rank7 pre-filled 54270912 positions in 24.6s
final_int6_sliding_window val_loss:0.1951 val_bpb:0.1156 exact:0.11555875 stride:64 eval_time:359137ms
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
=== Parameter Golf v2 — N-gram Cache Submission ===
Seed: 42
N-gram: orders 2-15, Dirichlet=1
Phrase: probes=20,16, Dirichlet=1
Complementary: alpha=0.50, order=5

W0327 04:47:15.636000 74537 torch/distributed/run.py:803]
W0327 04:47:15.636000 74537 torch/distributed/run.py:803] *****************************************
W0327 04:47:15.636000 74537 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 04:47:15.636000 74537 torch/distributed/run.py:803] *****************************************
logs/2546d581-ae68-43b6-8dc3-3ae0e4280c14.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
mixed_precision: clip_range=31 (int6) compressor=lzma
model_params:27124848
XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
VRL:True active_layers:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size:8 grad_accum_steps:1 sdp:flash=True
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:560.000
seed:42
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
comp_train:enabled orders=2-5 alpha=0.5 warmup=200
step:0/20000 val_loss:6.9301 val_bpb:4.1044 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9323 train_time:167ms step_avg:166.95ms
step:2/20000 train_loss:8.6971 train_time:282ms step_avg:141.07ms
step:3/20000 train_loss:7.8409 train_time:398ms step_avg:132.79ms
step:4/20000 train_loss:7.0864 train_time:511ms step_avg:127.84ms
step:5/20000 train_loss:6.9765 train_time:625ms step_avg:124.92ms
step:6/20000 train_loss:6.9884 train_time:738ms step_avg:122.94ms
step:7/20000 train_loss:6.9120 train_time:854ms step_avg:121.97ms
step:8/20000 train_loss:6.7496 train_time:966ms step_avg:120.73ms
step:9/20000 train_loss:6.4399 train_time:1079ms step_avg:119.86ms
step:10/20000 train_loss:6.1220 train_time:1190ms step_avg:118.95ms
step:500/20000 train_loss:1.7425 train_time:68801ms step_avg:137.60ms
step:1000/20000 train_loss:1.6038 train_time:143170ms step_avg:143.17ms
step:1500/20000 train_loss:1.5671 train_time:217569ms step_avg:145.05ms
step:2000/20000 train_loss:1.4598 train_time:292781ms step_avg:146.39ms
step:2500/20000 train_loss:1.5127 train_time:367300ms step_avg:146.92ms
swa:start step:3000
step:3000/20000 train_loss:1.5014 train_time:442248ms step_avg:147.42ms
late_qat:enabled step:3179 scale:0.1496
step:3500/20000 train_loss:1.4821 train_time:518015ms step_avg:148.00ms
step:3784/20000 val_loss:1.9722 val_bpb:1.1681 train_time:560072ms step_avg:148.01ms
stopping_early: wallclock_cap train_time:560072ms step:3784/20000
peak memory allocated: 22528 MiB reserved: 22568 MiB
swa:applying 16 snapshots, blending with EMA (0.50/0.50)
DIAGNOSTIC post_ema val_loss:1.9723 val_bpb:1.1681 eval_time:2096ms
Serialized model: 106449565 bytes Code: 87629 bytes
gptq:collecting hessians batches=64 source=val
gptq:hessians collected layers=68 time=11.0s
gptq:pre_prune artifact=15262616 target=15907371
Saved quantized model to final_int6_model.pt
Serialized model int63+lzma: 15262616 bytes
Total submission size: 15350245 bytes
final_int6_roundtrip val_loss:1.9809 val_bpb:1.1732 exact:1.17317850 eval_time:6489ms
ngram_cache:enabled orders=2-15 dirichlet=True concentration=5.0 temperature=1.0 entropy=True min_count=2 buckets=4194304 order_mults=none alpha_max=0.95
phrase_cache:enabled probes=[20, 16] dirichlet=True conc=1.0 alpha=0.9 buckets=1048576
ngram_prefill:rank1 pre-filled 7754688 positions in 22.2s
phrase_prefill:rank1 pre-filled 7754688 positions in 3.2s
ngram_prefill:rank2 pre-filled 15507392 positions in 42.1s
phrase_prefill:rank2 pre-filled 15507392 positions in 6.7s
ngram_prefill:rank3 pre-filled 23260096 positions in 64.3s
phrase_prefill:rank3 pre-filled 23260096 positions in 10.7s
ngram_prefill:rank4 pre-filled 31012800 positions in 90.3s
phrase_prefill:rank4 pre-filled 31012800 positions in 14.8s
ngram_prefill:rank5 pre-filled 38765504 positions in 112.6s
ngram_prefill:rank6 pre-filled 46518208 positions in 122.0s
phrase_prefill:rank5 pre-filled 38765504 positions in 18.1s
phrase_prefill:rank6 pre-filled 46518208 positions in 21.7s
ngram_prefill:rank7 pre-filled 54270912 positions in 152.4s
phrase_prefill:rank7 pre-filled 54270912 positions in 24.7s
final_int6_sliding_window val_loss:0.1951 val_bpb:0.1156 exact:0.11556435 stride:64 eval_time:370090ms
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Order-20 Dirichlet Posterior + Phrase Cache

**val_bpb: 0.11545** (3-seed mean, std 0.0000010) | **~15.1 MB** | 8xH100 SXM

Extends n-gram backoff from order 15 to order 20, improving over our PR #948 (0.11556 BPB).

## Results (8xH100 80GB SXM, Montréal CA, 747 TFLOPS)

| Seed | Val BPB | Eval Time |
|------|---------|-----------|
| 1337 | 0.11544435 | 459s |
| 42 | 0.11546433 | 435s |
| 2025 | 0.11544736 | 438s |
| **Mean** | **0.11545 (std 0.0000010)** | |

## What changed from PR #948

- `NGRAM_ORDER=20` (was 15)
- Added 5 more per-order concentrations: all 1.86 (matching the pattern for high-order matches)
- Everything else identical

## Ablation (1xH100, 200 steps, Kansas City MO)

| Config | BPB | Delta |
|--------|-----|-------|
| Order 15 (baseline) | 0.11906 | — |
| **Order 20** | **0.11873** | **-0.00033** |
| Two-pass rescore | 0.11906 | 0 |
| Int5 quantization | 0.11906 | 0 |
| Comp alpha=0.30 | 0.11906 | 0 |

Order 20 was the only ablation that showed improvement.

## Credits

Same as PR #948. Built on @Robby955 (PR #900), @signalrush (PR #414), @himanshudongre (PR #846), @deanbrr (PR #659), @newjordan (PR #674), @pentxayc (PR #803).

## Run Command

```bash
NGRAM_ORDER=20 \
NGRAM_PER_ORDER_CONC="50.0,50.0,6.95,2.98,2.05,2.05,2.05,1.86,1.86,1.86,1.86,1.86,1.86,1.86,1.86,1.86,1.86,1.86,1.86" \
# ... all other params same as PR #948
torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Loading