Non-record (WIP): Multi-Order N-gram Backoff — val_bpb=0.8004 (1xH100 proxy) by greqone · Pull Request #871 · openai/parameter-golf

greqone · 2026-03-26T17:13:15Z

Summary

Multi-Order N-gram Backoff with entropy-adaptive alpha, validated on 1×H100 SXM proxy run. Based on PR #828 approach with MATRIX_LR=0.03.

Proxy val_bpb = 0.8004 (1×H100, 876 steps, 59% eval coverage) | 15.18 MB artifact

Why WIP

This is a proxy validation run on 1×H100 SXM (876 training steps vs ~7000 on 8×H100). The base model quality is 1.38 BPB (vs expected ~1.15 on 8×H100). We're currently compute-constrained — awaiting 8×H100 SXM availability for official 3-seed verification.

On 8×H100 we expect ~0.90-0.92 BPB, consistent with PR #828's reported 0.9076.

Architecture

10L, 512d, GQA 8H/4KV, MLP 3× LeakyReLU(0.5)²
BigramHash(4096, dim=128), SmearGate, Value Residual, Gated Attention
XSA last 4 layers, Partial RoPE 16/64, LN Scale
U-Net skip connections, tied embeddings, logit softcap=30

N-gram Eval (Legal, Score-First)

Backward-looking n-gram cache, orders 2-7
Highest matching order wins (stupid backoff)
Entropy-adaptive alpha: α = 0.05 + 0.55 × σ(2(H - 4))
4M XOR-hash buckets, min_count=2
Each token scored BEFORE cache update (Issue Invalid submissions due to information leakage during TTT #402 compliant)

1×H100 Proxy Results

Metric	Value
Training steps	876
Pre-quant val_bpb	1.3796
N-gram BPB	0.8004
Artifact	15.18 MB
Eval coverage	59.4%

Roadmap

8×H100 SXM official run (3 seeds)
Frozen n-gram oracle + learned gate (PR Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT #834 approach)
Higher-order n-grams (2-9)

Test plan

1×H100 proxy run validates code, artifact size, and n-gram eval pipeline
3-seed 8×H100 run for statistical significance (pending compute)

🤖 Generated with Claude Code

… proxy) 10L + Multi-Order N-gram Backoff with entropy-adaptive alpha. Validated on 1xH100 SXM (876 steps, 59% eval coverage). Pending 8xH100 SXM verification for official record submission. Based on PR openai#828 approach with MATRIX_LR=0.03. Architecture: 10L, 512d, MLP 3x LeakyReLU(0.5)², XSA-4, VRL, BigramHash, SmearGate. Artifact: 15.18 MB (under 16 MB limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new WIP record bundle under records/track_10min_16mb/ for a multi-order (2–7) n-gram backoff evaluation pipeline with entropy-adaptive mixing, validated via a 1×H100 proxy run.

Changes:

Introduces a full training/eval script that includes multi-order n-gram backoff (score-first) sliding-window evaluation and mixed int5/int6 quantization export.
Adds a proxy training log capturing the 1×H100 run and partial-coverage eval results.
Adds a README describing the approach, results, and reproduction commands.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/train_gpt.py`	New WIP training + quantization + sliding n-gram backoff eval implementation.
`records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/train_1xh100_proxy.log`	Proxy run log demonstrating observed BPB and eval-time/coverage behavior.
`records/track_10min_16mb/2026-03-26_NgramBackoff_LeakyReLU_WIP/README.md`	Documentation of the WIP method, proxy results, and reproduction steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T17:18:25Z