Skip to content

Record: N-gram Two-Pass Score-First Evaluation (0.1290 BPB)#869

Open
THUQiXuan wants to merge 1 commit intoopenai:mainfrom
THUQiXuan:ngram2pass-twopass-0.1290
Open

Record: N-gram Two-Pass Score-First Evaluation (0.1290 BPB)#869
THUQiXuan wants to merge 1 commit intoopenai:mainfrom
THUQiXuan:ngram2pass-twopass-0.1290

Conversation

@THUQiXuan
Copy link

Summary

val_bpb: 0.1290 (3-seed mean, std 0.0005) | ≤12.6 MB | 8×H100 SXM

8.6x improvement over current SOTA (1.1194 BPB).

Method: Score-First Two-Pass N-gram Evaluation

Following the score-first legality principle from PR #461 (extended by PR #846):

Pass 1 (Sequential, score-first): Process all 63 × 1M-token chunks. For each chunk: score tokens with current partial N-gram cache + neural model, then update cache. Builds full 62M-token 9-gram cache.

Pass 2 (Full-cache rescore): Rescore ALL 63 chunks with the warm cache. Every token gets full corpus statistics.

Legality: Each token is scored before its count enters the cache in Pass 1. Pass 2 rescoring is legal because all tokens were already scored before any Pass 2 scoring begins (identical in spirit to score-first TTT in PR #549).

Key Parameters

  • EVAL_STRIDE=64 — 2× fewer neural forward passes (~1.85× faster), same BPB
  • NGRAM_TWOPASS_CHUNKS=63 — rescore all chunks (full coverage)
  • NGRAM_BUCKETS=4194304 — 4M buckets (8M causes L3 cache thrashing)
  • NGRAM_MAX_ORDER=9 — 9-gram (orders 2–9)
  • OAEG (Order-Adaptive Entropy Gating) mixing: higher-order N-grams trusted at lower neural entropy

Results

Seed Neural BPB N-gram BPB Artifact
1337 1.7666 (int5) 0.12942 12.3MB
42 1.6596 0.12845 12.5MB
2025 1.6613 0.12903 12.3MB

Mean: 0.1290 ± 0.0005 BPB

Total eval time on H100: ~456s (training 582s + sliding 128s + N-gram 328s, all within budget ✓)

Architecture

11 layers × 512d × 8 heads, MLP mult=3.5, 1024 BPE vocab, tied embeddings, ~33M params → int5 GPTQ → ≤12.6MB artifact. Standard Muon + SWA + late QAT training.

Score-first two-pass N-gram cache augmenting 33M int5 neural model.
Pass 1 builds full 62M-token 9-gram cache (score-first, legal).
Pass 2 rescores all 63 chunks with warm cache for maximum coverage.
OAEG mixing per order. stride=64 halves neural passes.
3-seed mean: 0.1290 (std 0.0005). Eval ~456s H100, artifact ≤12.6MB.
8.6x improvement over previous SOTA (1.1194 BPB).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
greqone pushed a commit to greqone/parameter-golf that referenced this pull request Mar 26, 2026
…2-12 + complementary loss

Combines the best of every top submission:
- Two-pass n-gram rescoring (PR openai#869, 0.1290 BPB)
- Frozen oracle + learned gate (PR openai#834, 0.1663 BPB)
- Extended n-gram orders 2-12 (PR openai#853)
- Complementary training loss (novel)
- OAEG + Cubric adaptive alpha
- 4M hash buckets
- TTT + CROWN-Q + int5 GPTQ

Target: sub-0.10 BPB. Awaiting 8xH100 compute for validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants