openai · Bortlesboat · Mar 21, 2026 · Mar 25, 2026 · Mar 26, 2026 · Mar 26, 2026
diff --git a/...ds/track_10min_16mb/2026-03-20_Int5MLP_BigramHash_SWA_NeuralCache_TTT/README.md b/...ds/track_10min_16mb/2026-03-20_Int5MLP_BigramHash_SWA_NeuralCache_TTT/README.md
@@ -0,0 +1,47 @@
+# QAT + Neural Cache + LoRA TTT (Non-Record Submission)
+
+**val_bpb: 1.4245** (sliding window, post int5/int6+zstd quantization roundtrip, 1 seed)
+
+This is a non-record submission exploring three eval-time techniques stacked on the current #1 training recipe. The QAT implementation has a bug (quantization penalty is ~0.25 BPB instead of expected ~0.02), making this run non-competitive. Submitting for transparency and to document the approach for iteration.
+
+## Approach
+
+Built on PR by @thwu1 (Int5-MLP + BigramHash + SWA), adding:
+
+### 1. Quantization-Aware Training (QAT)
+STE fake-quantization during training: int5 (clip=15) for MLP layers, int6 (clip=31) for attention. The model learns to be robust to quantization noise. **Bug found:** The STE uses symmetric clipping while the export uses percentile-based per-row scaling — this mismatch caused the model to optimize for the wrong quantization target, resulting in a 0.25 BPB penalty instead of the expected ~0.02.
+
+### 2. Neural Cache
+During sliding window eval, maintain a ring buffer of pre-lm_head hidden states (dim=512, bf16). For each token, compute cosine similarity against cached states, build a cache distribution via softmax-weighted scatter, and interpolate with model predictions using logaddexp. Causal token-by-token scoring with document boundary resets prevents information leakage.
+
+### 3. LoRA Test-Time Training
+Per-document rank-8 LoRA adaptation on lm_head, Q, and V projections during evaluation. Documents batched (batch_size=64), chunks scored before training (no leakage), with entropy-gated updates.
+
+## Architecture
+- 10 layers, 512 dim, 8 heads, 4 KV heads (GQA), 3x MLP (1536 hidden)
+- BigramHash(10240, dim=128), SmearGate, orthogonal init
+- Muon optimizer: matrix_lr=0.02, WD=0.04, momentum=0.99
+- SWA: last 40% of warmdown, every 50 steps, 24 checkpoints averaged
+- seq_len=2048, batch=786K tokens
+
+## Results
+| Seed | Pre-quant val_bpb | Post-quant sliding val_bpb | Steps | Artifact |
+|------|-------------------|---------------------------|-------|----------|
+| 1337 | 1.1739 | 1.4245 | 5109 | 15.77 MB |
+
+## Known Issues
+1. **QAT mismatch:** STE clip ranges don't match export quantization format — needs per-row percentile clipping in the STE to match `quantize_intN_per_row`
+2. **Pre-quant BPB already worse than SOTA:** 1.1739 vs 1.1428 — QAT may be hurting convergence with current hyperparameters
+3. Only 1 seed (need 3+ for statistical significance)
+
+## Next Steps
+- Run without QAT to verify base recipe reproduces 1.1428
+- Fix QAT to match exact export quantization format
+- Run neural cache + TTT eval on a working checkpoint
+- Sweep cache hyperparameters (theta, lambda)
+
+## Command
+```bash
+RUN_ID=run1_seed1337 SEED=1337 QAT_ENABLED=1 EVAL_STRIDE=64 EVAL_STRATEGY=combined \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-03-20_Int5MLP_BigramHash_SWA_NeuralCache_TTT/submission.json b/records/track_10min_16mb/2026-03-20_Int5MLP_BigramHash_SWA_NeuralCache_TTT/submission.json
@@ -0,0 +1,9 @@
+{
+  "name": "QAT + Neural Cache + LoRA TTT (non-record)",
+  "val_bpb": 1.4245,
+  "bytes_total": 15766801,
+  "blurb": "Non-record submission exploring QAT + neural cache + LoRA TTT on top of #1 recipe. QAT implementation has export mismatch bug causing 0.25 BPB quantization penalty. Submitting to document approach and iterate.",
+  "author": "Bortlesboat",
+  "github_id": "Bortlesboat",
+  "date": "2026-03-20"
+}