Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Record: Fine-Grained N-gram Cache (val_bpb=0.2873)

## Summary

- **val_bpb: 0.2873** (3-seed mean, std 0.0001)
- Artifact: ~13.4MB (code 181KB + model 13.2MB)
- Training: 600s on 8xH100 SXM (~7,050 steps at 85ms/step)
- Eval: ~405s (GPTQ ~22s + N-gram ~390s)

## Key Innovation: Fine-Grained N-gram Chunk Updates

The single most impactful change: reducing `NGRAM_EVAL_CHUNK_TOKENS` from 1,000,000 to 65,536.

The N-gram backoff cache only updates **after** each chunk is fully scored. With 1M-token chunks, the first million validation tokens see an empty cache β€” losing enormous predictive power. With 65K-token chunks, the cache refreshes 15x more frequently, giving each subsequent chunk a much richer set of n-gram statistics to draw from.

| Chunk Size | BPB | Delta |
|------------|-----|-------|
| 1,000,000 | 0.4572 | baseline |
| 65,536 | **0.2872** | **-0.170** |

This is purely an eval-time optimization β€” no training changes, no TTT, no additional compute.

## 3-Seed Results

| Seed | BPB | Artifact bytes |
|------|-----|----------------|
| 1337 | **0.28725** | ~13.4MB |
| 42 | **0.28720** | ~13.4MB |
| 2024 | **0.28744** | ~13.4MB |
| **Mean** | **0.2873 (std 0.0001)** | |

## Architecture

11L 512d GQA 8/4, MLP 3.0x, XSA-4, LeakyReLU(0.9)Β², BigramHash(4096), GPTQ int5 + LZMA.

EMA(0.997) + SWA. Parallel Muon optimizer. Perplexity-sorted shard ordering.

## N-gram Cache Details

- Order 2-9 backoff with 4M hash buckets
- Entropy-adaptive alpha: Ξ± varies by model confidence and n-gram order
- Per-order multipliers: low orders (2-3) suppressed at 0.3x, high orders (5-9) boosted at 2.0x
- Score-first: cache updated ONLY after scoring each 65K-token chunk
- All GPU ranks share identical cache state

## Compliance

- [x] Training: 600s on 8xH100 SXM (within 600s)
- [x] Eval: ~405s on 8xH100 SXM (within 600s)
- [x] Artifacts under 16,000,000 bytes
- [x] **No TTT** β€” purely N-gram cache at eval time
- [x] Cache strictly backward-looking β€” updated only after scoring
- [x] No oracle, no training data at eval time

## Future Ideas

1. **Even smaller chunks** (32K, 16K) β€” diminishing returns but may squeeze out 0.01-0.02 BPB more
2. **Legal score-first TTT** β€” per-chunk TTT where each chunk is scored first, then the model adapts on scored tokens
3. **Distributed cache pre-fill** (PR #796's approach) β€” each rank fills cache with all preceding positions
4. **Higher-order n-grams with more buckets** β€” order 9 is optimal at 4M buckets; more buckets may enable higher orders
5. **Complementary training** β€” bigram-weighted loss; didn't help with 1M chunks but may interact differently with 65K chunks

## Credits

- @deanbrr (PR #659) β€” original n-gram cache concept
- @newjordan (PR #674) β€” first legal implementation
- @lukacf (PR #702) β€” multi-order backoff + entropy-adaptive sigmoid
- @Asukabot0 (PR #727) β€” 7-gram, first sub-1.0 BPB
- @raahilshah (PR #634) β€” XSA-all
- @parinzee (PR #493) β€” LeakyReLU(0.5)Β²
- @signalrush (PR #414) β€” base GPTQ + EMA + warmdown stack
- @travispchen (PR #798) β€” per-order entropy thresholds

**Our novel contribution**: Fine-grained chunk updates for N-gram cache (65K vs 1M), demonstrating that cache update frequency is the dominant factor in N-gram BPB.
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"author": "quietsmile",
"github_id": "quietsmile",
"name": "Fine-Grained N-gram Cache + LeakyReLU(0.9)\u00b2 + GPTQ-Int5",
"blurb": "Order-9 N-gram backoff cache with 65K-token chunk updates (vs 1M in prior work). Entropy-adaptive alpha with per-order multipliers. 11L 512d GQA 8/4, MLP 3.0x, XSA-4, LeakyReLU(0.9)\u00b2, BigramHash(4096), GPTQ int5, LZMA. No TTT. Mean of 3 seeds: val_bpb 0.2873 (std 0.0001). Fully legal: no pre-eval TTT, score-first N-gram only. Training 600s on 8xH100 SXM, eval ~400s.",
"date": "2026-03-26",
"val_loss": 0.4852,
"val_bpb": 0.2873,
"val_loss_std": 0.0002,
"val_bpb_std": 0.0001,
"seeds": [1337, 42, 2024],
"seed_results": {
"1337": {"val_loss": 0.48500408, "val_bpb": 0.28724673},
"42": {"val_loss": 0.48493125, "val_bpb": 0.28720360},
"2024": {"val_loss": 0.48532938, "val_bpb": 0.28743939}
},
"pre_quant_val_bpb": 1.1404,
"step_stop": 7050,
"wallclock_seconds": 600.0,
"eval_time_seconds": 405.0,
"bytes_total": 13400000,
"bytes_code": 180967
}
Loading