Non-record (WIP): Multi-Order N-gram Backoff — val_bpb=0.8004 (1xH100 proxy)#871
Non-record (WIP): Multi-Order N-gram Backoff — val_bpb=0.8004 (1xH100 proxy)#871greqone wants to merge 2 commits intoopenai:mainfrom
Conversation
… proxy) 10L + Multi-Order N-gram Backoff with entropy-adaptive alpha. Validated on 1xH100 SXM (876 steps, 59% eval coverage). Pending 8xH100 SXM verification for official record submission. Based on PR openai#828 approach with MATRIX_LR=0.03. Architecture: 10L, 512d, MLP 3x LeakyReLU(0.5)², XSA-4, VRL, BigramHash, SmearGate. Artifact: 15.18 MB (under 16 MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new WIP record bundle under records/track_10min_16mb/ for a multi-order (2–7) n-gram backoff evaluation pipeline with entropy-adaptive mixing, validated via a 1×H100 proxy run.
Changes:
- Introduces a full training/eval script that includes multi-order n-gram backoff (score-first) sliding-window evaluation and mixed int5/int6 quantization export.
- Adds a proxy training log capturing the 1×H100 run and partial-coverage eval results.
- Adds a README describing the approach, results, and reproduction commands.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/train_gpt.py |
New WIP training + quantization + sliding n-gram backoff eval implementation. |
records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/train_1xh100_proxy.log |
Proxy run log demonstrating observed BPB and eval-time/coverage behavior. |
records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/README.md |
Documentation of the WIP method, proxy results, and reproduction steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| t = tokens.to(torch.int32) | ||
| mod = self.bigram_vocab_size - 1 | ||
| out = torch.empty_like(t) | ||
| out[..., 0] = mod | ||
| out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod | ||
| return out.long() |
There was a problem hiding this comment.
bigram_hash() sets mod = bigram_vocab_size - 1 and then computes % mod. If BIGRAM_VOCAB_SIZE is ever set to 1, this will raise a division-by-zero error. Consider validating bigram_vocab_size >= 2 in __init__ (or before constructing BigramHashEmbedding).
| eval_start = time.perf_counter() | ||
| eval_budget_s = 570.0 | ||
| # Pre-allocate eval buffers (avoid per-batch allocation) | ||
| x_buf = torch.zeros(batch_seqs, seq_len, dtype=torch.int64, device=device) | ||
| y_buf = torch.zeros(batch_seqs, seq_len, dtype=torch.int64, device=device) | ||
| base_model.eval() | ||
| # Compile eval path for faster inference | ||
| compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) | ||
| with torch.inference_mode(): | ||
| for bi in range(0, len(my_windows), batch_seqs): | ||
| eval_elapsed = time.perf_counter() - eval_start | ||
| if eval_elapsed > eval_budget_s: | ||
| if rank == 0: | ||
| print(f" FAILSAFE: ngram eval time {eval_elapsed:.0f}s exceeds budget", flush=True) | ||
| break |
There was a problem hiding this comment.
eval_val_sliding_ngram() can return partial-coverage metrics: it starts the timer before torch.compile(...) and breaks out once eval_elapsed > eval_budget_s, returning BPB over only the tokens processed. This makes reported val_bpb dependent on compilation time / machine speed and can be invalid for any run that requires full validation coverage. Consider (1) compiling + running a small warmup before starting the eval timer, and (2) failing hard (or at least returning coverage + refusing to report a single-number BPB) if the full validation set isn’t scored.
| The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder. | ||
|
|
||
| Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines. |
There was a problem hiding this comment.
This module docstring states a hard stop of 1500 lines for train_gpt.py, but this record script is 1542 lines long. Please update/remove the hard-stop note here (or point it to the top-level newcomer script) so the documentation matches the file’s intent and length.
| The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder. | |
| Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines. | |
| The top-level `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify those newcomer scripts without significantly increasing complexity, but competitive or archival submissions should stay in the `/records` folder. | |
| For readability, we aim to keep the top-level newcomer versions of `train_gpt.py` and `train_gpt_mlx.py` under ~1500 lines. Record scripts under `/records/` (like this file) are allowed to exceed that length to capture full experimental configurations. |
| self.head_dim = dim // num_heads | ||
| if self.head_dim % 2 != 0: | ||
| raise ValueError("head_dim must be even for RoPE") | ||
| self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim |
There was a problem hiding this comment.
rope_dims is not validated against head_dim. If ROPE_DIMS is set to an odd value or larger than head_dim, Rotary(self.rope_dims, ...) and apply_rotary_emb will produce shape/rotation errors at runtime. Add explicit validation (e.g., 0 < rope_dims <= head_dim and rope_dims % 2 == 0) and raise a clear ValueError when invalid.
| self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim | |
| # Determine and validate rotary embedding dimensionality. | |
| # If rope_dims <= 0, fall back to using the full head_dim (existing default behavior). | |
| if rope_dims <= 0: | |
| effective_rope_dims = self.head_dim | |
| else: | |
| effective_rope_dims = rope_dims | |
| if effective_rope_dims <= 0 or effective_rope_dims > self.head_dim or effective_rope_dims % 2 != 0: | |
| raise ValueError( | |
| f"Invalid rope_dims={rope_dims!r}. Expected an even value with " | |
| f"0 < rope_dims <= head_dim (head_dim={self.head_dim})." | |
| ) | |
| self.rope_dims = effective_rope_dims |
| buckets = args.ngram_eval_buckets | ||
| min_count = args.ngram_eval_min_count | ||
| use_entropy = args.ngram_eval_entropy | ||
| ent_base = args.ngram_eval_ent_base | ||
| ent_range = args.ngram_eval_ent_range | ||
| ent_scale = args.ngram_eval_ent_scale | ||
| ent_thresh = args.ngram_eval_ent_thresh | ||
| base_alpha = args.ngram_eval_alpha | ||
| n_orders = max_order - min_order + 1 | ||
|
|
||
| window_starts = [ws for ws in range(0, total_tokens, stride) | ||
| if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] | ||
| total_windows = len(window_starts) | ||
| my_s = (total_windows * rank) // world_size | ||
| my_e = (total_windows * (rank + 1)) // world_size | ||
| my_windows = window_starts[my_s:my_e] | ||
|
|
||
| val_np = val_tokens.numpy() | ||
| ctx_tables = [np.zeros((buckets,), dtype=np.uint32) for _ in range(n_orders)] | ||
| full_tables = [np.zeros((buckets,), dtype=np.uint32) for _ in range(n_orders)] | ||
| mask = np.uint64(buckets - 1) | ||
| primes = np.array( |
There was a problem hiding this comment.
The n-gram hash indexing uses mask = buckets - 1 and ctx_hash & mask, which only works correctly when buckets is a power of two. Add a validation that buckets is a power-of-two (and > 0), or switch to % buckets to support arbitrary sizes.
…2-12 + complementary loss Combines the best of every top submission: - Two-pass n-gram rescoring (PR openai#869, 0.1290 BPB) - Frozen oracle + learned gate (PR openai#834, 0.1663 BPB) - Extended n-gram orders 2-12 (PR openai#853) - Complementary training loss (novel) - OAEG + Cubric adaptive alpha - 4M hash buckets - TTT + CROWN-Q + int5 GPTQ Target: sub-0.10 BPB. Awaiting 8xH100 compute for validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Multi-Order N-gram Backoff with entropy-adaptive alpha, validated on 1×H100 SXM proxy run. Based on PR #828 approach with
MATRIX_LR=0.03.Proxy val_bpb = 0.8004 (1×H100, 876 steps, 59% eval coverage) | 15.18 MB artifact
Why WIP
This is a proxy validation run on 1×H100 SXM (876 training steps vs ~7000 on 8×H100). The base model quality is 1.38 BPB (vs expected ~1.15 on 8×H100). We're currently compute-constrained — awaiting 8×H100 SXM availability for official 3-seed verification.
On 8×H100 we expect ~0.90-0.92 BPB, consistent with PR #828's reported 0.9076.
Architecture
N-gram Eval (Legal, Score-First)
α = 0.05 + 0.55 × σ(2(H - 4))1×H100 Proxy Results
Roadmap
Test plan
🤖 Generated with Claude Code