Skip to content

Record: 0.2873 BPB — Fine-Grained N-gram Cache (65K chunks)#840

Open
quietsmile wants to merge 1 commit intoopenai:mainfrom
quietsmile:submission/ngram-chunk65k-0.287
Open

Record: 0.2873 BPB — Fine-Grained N-gram Cache (65K chunks)#840
quietsmile wants to merge 1 commit intoopenai:mainfrom
quietsmile:submission/ngram-chunk65k-0.287

Conversation

@quietsmile
Copy link

Summary

val_bpb: 0.2873 (3-seed mean, std 0.0001) | ~13.4 MB | 8xH100 SXM | 600s train + ~405s eval

Key Innovation: Fine-Grained N-gram Chunk Updates

The single most impactful change: reducing NGRAM_EVAL_CHUNK_TOKENS from 1,000,000 to 65,536.

The N-gram backoff cache only updates after each chunk is fully scored. With 1M-token chunks, the first million validation tokens see an empty cache — losing enormous predictive power. With 65K-token chunks, the cache refreshes 15x more frequently, giving each subsequent chunk a much richer set of n-gram statistics to draw from.

Chunk Size BPB Delta
1,000,000 0.4572 baseline
65,536 0.2872 -0.170

This is purely an eval-time optimization — no training changes, no TTT, no additional compute.

3-Seed Results

Seed BPB Artifact bytes
1337 0.28725 ~13.4MB
42 0.28720 ~13.4MB
2024 0.28744 ~13.4MB
Mean 0.2873 (std 0.0001)

Architecture

11L 512d GQA 8/4, MLP 3.0x, XSA-4, LeakyReLU(0.9)², BigramHash(4096), GPTQ int5 + LZMA.
EMA(0.997) + SWA. Parallel Muon optimizer. Perplexity-sorted shard ordering.

N-gram Cache Details

  • Order 2-9 backoff with 4M hash buckets
  • Entropy-adaptive alpha: α varies by model confidence and n-gram order
  • Per-order multipliers: low orders (2-3) suppressed at 0.3x, high orders (5-9) boosted at 2.0x
  • Score-first: cache updated ONLY after scoring each 65K-token chunk
  • All GPU ranks share identical cache state

Compliance

  • Training: 600s on 8xH100 SXM (within 600s)
  • Eval: ~405s on 8xH100 SXM (within 600s)
  • Artifacts under 16,000,000 bytes
  • No TTT — purely N-gram cache at eval time
  • Cache strictly backward-looking — updated only after scoring
  • No oracle, no training data at eval time

Credits

This builds on community work:

Our novel contribution: Fine-grained chunk updates for N-gram cache (65K vs 1M), demonstrating that cache update frequency is the dominant factor in N-gram BPB.

🤖 Generated with Claude Code

Key innovation: reduce NGRAM_EVAL_CHUNK_TOKENS from 1M to 65K.
The N-gram cache updates after each chunk, so smaller chunks mean
more frequent cache refreshes and richer n-gram statistics.

Results (3-seed mean): 0.2873 BPB (std 0.0001)
Fully legal: no pre-eval TTT, score-first N-gram only.
11L 512d GQA 8/4, MLP 3.0x, XSA-4, LeakyReLU(0.9)²,
BigramHash(4096), GPTQ int5, LZMA. 600s train + 405s eval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant