openai · quietsmile · Mar 26, 2026
diff --git a/...ds/track_10min_16mb/2026-03-26_ComplementaryBackoff_NgramMixer_0.4377/README.md b/...ds/track_10min_16mb/2026-03-26_ComplementaryBackoff_NgramMixer_0.4377/README.md
@@ -0,0 +1,47 @@
+# Complementary Training + Backoff N-gram Mixer (Reproduction)
+
+**val_bpb: 0.4377** (2-seed mean 0.4379, std 0.0002) | 8x L20Z (H100) | eval 450s
+
+## Results
+
+| Seed | Steps | val_bpb | eval_time |
+|------|-------|---------|-----------|
+| 1337 | 7,003 | **0.4377** | 450s |
+| 42 | 7,011 | **0.4380** | 450s |
+
+## Approach
+
+Reproduction of PR #803 (pentxayc) on 8x NVIDIA L20Z GPUs with stride=128 optimization.
+
+### Key Techniques
+
+1. **Complementary Training** (COMPLEMENT_ALPHA=0.5): Downweights training loss on tokens that bigram statistics can predict, forcing the neural model to specialize on hard tokens (long-range dependencies, semantic surprises).
+
+2. **BackoffNgramMixer**: Orders 2-10, 4M flat hash buckets. At eval time, entropy-adaptive alpha mixing: `alpha = 0.20 + 0.55 * sigmoid(2 * (entropy - 3.0))`. High-entropy tokens get more n-gram weight.
+
+3. **Legal Score-First TTT**: AdamW (lr=5e-4), 4 epochs per chunk, freeze first 2 blocks, Polyak EMA 0.998. Every token scored BEFORE any update uses it.
+
+4. **Stride=128**: Reduces eval windows from ~30K to ~950, with negligible BPB impact vs stride=32.
+
+### Architecture
+
+- 11 layers, 512 dim, 8 heads, 4 KV heads, 3x MLP with LeakyReLU(0.5)^2
+- XSA on last 4 layers, VRL enabled
+- Int6 mixed quantization + lzma compression
+- Artifact: ~15.9MB (under 16MB limit)
+
+## Reproduction
+
+```bash
+VRL_ENABLED=1 LEAKY_RELU=1 GATED_ATTENTION=0 \
+TTT_ENABLED=1 TTT_OPTIMIZER=adamw TTT_LR=0.0005 TTT_EPOCHS=4 \
+TTT_FREEZE_BLOCKS=2 TTT_TEMPERATURE=0.98 \
+USE_HEDGE_MIXER=1 NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 \
+ALPHA_BASE=0.20 ALPHA_RANGE=0.55 ALPHA_CENTER=3.0 \
+COMPLEMENT_ALPHA=0.5 EVAL_STRIDE=128 \
+SEED=1337 torchrun --nproc_per_node=8 train_gpt.py
+```
+
+## Acknowledgment
+
+Based on PR #803 by pentxayc. The core innovation of complementary training (bigram-weighted loss reweighting) is their contribution.
diff --git a/records/track_10min_16mb/2026-03-26_ComplementaryBackoff_NgramMixer_0.4377/submission.json b/records/track_10min_16mb/2026-03-26_ComplementaryBackoff_NgramMixer_0.4377/submission.json
@@ -0,0 +1,23 @@
+{
+  "author": "quietsmile",
+  "github_id": "quietsmile",
+  "name": "Complementary Training + Backoff N-gram Mixer (Reproduction)",
+  "blurb": "Reproduction of PR #803's complementary training approach on 8xL20Z (H100). Bigram-weighted loss reweighting (COMPLEMENT_ALPHA=0.5) trains the neural model to specialize on tokens n-gram caches can't predict. BackoffNgramMixer (orders 2-10, 4M hash buckets) with entropy-adaptive alpha (0.20+0.55*sigmoid(2*(H-3.0))). Legal score-first AdamW TTT (4 epochs, lr=5e-4, freeze first 2 blocks). Int6 mixed quantization + lzma. Evaluated with stride=128.",
+  "date": "2026-03-26T12:00:00Z",
+  "val_loss": 0.73909041,
+  "val_bpb": 0.4377,
+  "val_loss_std": 0.00026,
+  "val_bpb_std": 0.00016,
+  "seeds": [1337, 42],
+  "seed_results": {
+    "1337": {"val_loss": 0.73909041, "val_bpb": 0.43773153},
+    "42": {"val_loss": 0.73961404, "val_bpb": 0.43804165}
+  },
+  "step_stop": 7003,
+  "wallclock_seconds": 600.074,
+  "eval_time_seconds": 449.857,
+  "bytes_total": 15875857,
+  "bytes_code": 87336,
+  "base_pr": 803,
+  "hardware": "8x NVIDIA L20Z (H100 equivalent, 81GB)"
+}