Skip to content

Non-record (WIP): Multi-Order N-gram Backoff — val_bpb=0.8004 (1xH100 proxy)#871

Open
greqone wants to merge 2 commits intoopenai:mainfrom
greqone:submission/ngram-backoff-wip
Open

Non-record (WIP): Multi-Order N-gram Backoff — val_bpb=0.8004 (1xH100 proxy)#871
greqone wants to merge 2 commits intoopenai:mainfrom
greqone:submission/ngram-backoff-wip

Conversation

@greqone
Copy link

@greqone greqone commented Mar 26, 2026

Summary

Multi-Order N-gram Backoff with entropy-adaptive alpha, validated on 1×H100 SXM proxy run. Based on PR #828 approach with MATRIX_LR=0.03.

Proxy val_bpb = 0.8004 (1×H100, 876 steps, 59% eval coverage) | 15.18 MB artifact

Why WIP

This is a proxy validation run on 1×H100 SXM (876 training steps vs ~7000 on 8×H100). The base model quality is 1.38 BPB (vs expected ~1.15 on 8×H100). We're currently compute-constrained — awaiting 8×H100 SXM availability for official 3-seed verification.

On 8×H100 we expect ~0.90-0.92 BPB, consistent with PR #828's reported 0.9076.

Architecture

  • 10L, 512d, GQA 8H/4KV, MLP 3× LeakyReLU(0.5)²
  • BigramHash(4096, dim=128), SmearGate, Value Residual, Gated Attention
  • XSA last 4 layers, Partial RoPE 16/64, LN Scale
  • U-Net skip connections, tied embeddings, logit softcap=30

N-gram Eval (Legal, Score-First)

1×H100 Proxy Results

Metric Value
Training steps 876
Pre-quant val_bpb 1.3796
N-gram BPB 0.8004
Artifact 15.18 MB
Eval coverage 59.4%

Roadmap

Test plan

  • 1×H100 proxy run validates code, artifact size, and n-gram eval pipeline
  • 3-seed 8×H100 run for statistical significance (pending compute)

🤖 Generated with Claude Code

… proxy)

10L + Multi-Order N-gram Backoff with entropy-adaptive alpha.
Validated on 1xH100 SXM (876 steps, 59% eval coverage).
Pending 8xH100 SXM verification for official record submission.

Based on PR openai#828 approach with MATRIX_LR=0.03.
Architecture: 10L, 512d, MLP 3x LeakyReLU(0.5)², XSA-4, VRL, BigramHash, SmearGate.
Artifact: 15.18 MB (under 16 MB limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 26, 2026 17:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new WIP record bundle under records/track_10min_16mb/ for a multi-order (2–7) n-gram backoff evaluation pipeline with entropy-adaptive mixing, validated via a 1×H100 proxy run.

Changes:

  • Introduces a full training/eval script that includes multi-order n-gram backoff (score-first) sliding-window evaluation and mixed int5/int6 quantization export.
  • Adds a proxy training log capturing the 1×H100 run and partial-coverage eval results.
  • Adds a README describing the approach, results, and reproduction commands.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.

File Description
records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/train_gpt.py New WIP training + quantization + sliding n-gram backoff eval implementation.
records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/train_1xh100_proxy.log Proxy run log demonstrating observed BPB and eval-time/coverage behavior.
records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/README.md Documentation of the WIP method, proxy results, and reproduction steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +655 to +660
t = tokens.to(torch.int32)
mod = self.bigram_vocab_size - 1
out = torch.empty_like(t)
out[..., 0] = mod
out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod
return out.long()
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigram_hash() sets mod = bigram_vocab_size - 1 and then computes % mod. If BIGRAM_VOCAB_SIZE is ever set to 1, this will raise a division-by-zero error. Consider validating bigram_vocab_size >= 2 in __init__ (or before constructing BigramHashEmbedding).

Copilot uses AI. Check for mistakes.
Comment on lines +938 to +952
eval_start = time.perf_counter()
eval_budget_s = 570.0
# Pre-allocate eval buffers (avoid per-batch allocation)
x_buf = torch.zeros(batch_seqs, seq_len, dtype=torch.int64, device=device)
y_buf = torch.zeros(batch_seqs, seq_len, dtype=torch.int64, device=device)
base_model.eval()
# Compile eval path for faster inference
compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True)
with torch.inference_mode():
for bi in range(0, len(my_windows), batch_seqs):
eval_elapsed = time.perf_counter() - eval_start
if eval_elapsed > eval_budget_s:
if rank == 0:
print(f" FAILSAFE: ngram eval time {eval_elapsed:.0f}s exceeds budget", flush=True)
break
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_val_sliding_ngram() can return partial-coverage metrics: it starts the timer before torch.compile(...) and breaks out once eval_elapsed > eval_budget_s, returning BPB over only the tokens processed. This makes reported val_bpb dependent on compilation time / machine speed and can be invalid for any run that requires full validation coverage. Consider (1) compiling + running a small warmup before starting the eval timer, and (2) failing hard (or at least returning coverage + refusing to report a single-number BPB) if the full validation set isn’t scored.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +4
The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder.

Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module docstring states a hard stop of 1500 lines for train_gpt.py, but this record script is 1542 lines long. Please update/remove the hard-stop note here (or point it to the top-level newcomer script) so the documentation matches the file’s intent and length.

Suggested change
The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder.
Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.
The top-level `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify those newcomer scripts without significantly increasing complexity, but competitive or archival submissions should stay in the `/records` folder.
For readability, we aim to keep the top-level newcomer versions of `train_gpt.py` and `train_gpt_mlx.py` under ~1500 lines. Record scripts under `/records/` (like this file) are allowed to exceed that length to capture full experimental configurations.

Copilot uses AI. Check for mistakes.
self.head_dim = dim // num_heads
if self.head_dim % 2 != 0:
raise ValueError("head_dim must be even for RoPE")
self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rope_dims is not validated against head_dim. If ROPE_DIMS is set to an odd value or larger than head_dim, Rotary(self.rope_dims, ...) and apply_rotary_emb will produce shape/rotation errors at runtime. Add explicit validation (e.g., 0 < rope_dims <= head_dim and rope_dims % 2 == 0) and raise a clear ValueError when invalid.

Suggested change
self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim
# Determine and validate rotary embedding dimensionality.
# If rope_dims <= 0, fall back to using the full head_dim (existing default behavior).
if rope_dims <= 0:
effective_rope_dims = self.head_dim
else:
effective_rope_dims = rope_dims
if effective_rope_dims <= 0 or effective_rope_dims > self.head_dim or effective_rope_dims % 2 != 0:
raise ValueError(
f"Invalid rope_dims={rope_dims!r}. Expected an even value with "
f"0 < rope_dims <= head_dim (head_dim={self.head_dim})."
)
self.rope_dims = effective_rope_dims

Copilot uses AI. Check for mistakes.
Comment on lines +901 to +922
buckets = args.ngram_eval_buckets
min_count = args.ngram_eval_min_count
use_entropy = args.ngram_eval_entropy
ent_base = args.ngram_eval_ent_base
ent_range = args.ngram_eval_ent_range
ent_scale = args.ngram_eval_ent_scale
ent_thresh = args.ngram_eval_ent_thresh
base_alpha = args.ngram_eval_alpha
n_orders = max_order - min_order + 1

window_starts = [ws for ws in range(0, total_tokens, stride)
if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0]
total_windows = len(window_starts)
my_s = (total_windows * rank) // world_size
my_e = (total_windows * (rank + 1)) // world_size
my_windows = window_starts[my_s:my_e]

val_np = val_tokens.numpy()
ctx_tables = [np.zeros((buckets,), dtype=np.uint32) for _ in range(n_orders)]
full_tables = [np.zeros((buckets,), dtype=np.uint32) for _ in range(n_orders)]
mask = np.uint64(buckets - 1)
primes = np.array(
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The n-gram hash indexing uses mask = buckets - 1 and ctx_hash & mask, which only works correctly when buckets is a power of two. Add a validation that buckets is a power-of-two (and > 0), or switch to % buckets to support arbitrary sizes.

Copilot uses AI. Check for mistakes.
…2-12 + complementary loss

Combines the best of every top submission:
- Two-pass n-gram rescoring (PR openai#869, 0.1290 BPB)
- Frozen oracle + learned gate (PR openai#834, 0.1663 BPB)
- Extended n-gram orders 2-12 (PR openai#853)
- Complementary training loss (novel)
- OAEG + Cubric adaptive alpha
- 4M hash buckets
- TTT + CROWN-Q + int5 GPTQ

Target: sub-0.10 BPB. Awaiting 8xH100 compute for validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants