Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Order-16 Frozen N-gram Oracle + Learned Gate + TTT

**val_bpb: 0.02742 (3-seed mean, std 0.00003)**

## Results

| Seed | val_bpb |
|------|---------|
| 1337 | 0.02744 |
| 42 | 0.02739 |
| 2025 | 0.02744 |
| **Mean** | **0.02742** |

## Key Techniques

1. **Order-16 Frozen N-gram Oracle** — Pre-filled from all training shards at startup. 4M buckets, orders 2-16.
2. **Learned Multi-Expert Gate** — `nn.Linear(512, 17)` trained end-to-end with mixer loss to predict optimal per-token per-order blending weights.
3. **Complementary Training** — Downweights CE loss for tokens well-predicted by the oracle, forcing the neural model to specialize on hard tokens.
4. **Score-First TTT** — 1 epoch AdamW on all blocks with adaptive temperature and byte-weighted loss.
5. **11L 512d model** — MLP 3.5x, LeakyReLU(0.5)², XSA-all, EMA(0.997), SWA every 50 steps.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"author": "Tim Pietrusky",
"github_id": "TimPietrusky",
"name": "Order-16 Frozen N-gram Oracle + Learned Gate + Complementary Training + TTT",
"blurb": "Order-16 n-gram oracle pre-filled from training data with learned per-token per-order mixing gate, complementary training (downweight easy tokens), and score-first TTT with adaptive temperature. Based on PR #925 architecture with NGRAM_MAX_ORDER=16.",
"date": "2026-03-27T00:00:00Z",
"val_bpb": 0.02742,
"val_bpb_std": 0.00003,
"hardware": "8xH100 SXM",
"seeds": [1337, 42, 2025],
"seed_results": {
"1337": {"val_bpb": 0.02744},
"42": {"val_bpb": 0.02739},
"2025": {"val_bpb": 0.02744}
}
}
Loading