PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache#769
PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache#769MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
Conversation
3-seed mean: 0.8508 (std 0.0006), verified at stride=2048 (0.8709) Beats SOTA openai#549 (1.1194) by 0.269 BPB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Update — size issue on seed 42 We got excited and rushed this submission. On closer audit:
Also correcting: We need to fix the code size (99KB is bloated) or adjust compression to get all 3 seeds under 16MB before this is reviewable. Working on it — will update. |
- Fixed torch.compile double-invocation that silently killed sliding window eval - Trimmed train_gpt.py from 99KB to 72KB (removed dead TTT/QAT/LAWA/DTG code) - All 3 seeds re-run with sliding window + n-gram cache eval - New 3-seed mean: 0.8495 BPB (std 0.0013), all artifacts under 16,000,000 bytes - Old v1.0 logs preserved for transparency - Added rule compliance checklist, related work, cross-model audit (GPT Codex) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Update — v1.1 results (3 new seeds, sliding window fix, script cleanup) Two fixes since the initial submission: Script cleanup. The original Sliding window eval fix. The original submission had a bug where New 3-seed results (all re-run from scratch on 8×H100 SXM):
All artifacts under 16,000,000 bytes. Updated logs, Verification. This submission was independently audited by OpenAI Codex CLI (gpt-5.4) as a cross-model peer reviewer — verifying rule compliance, cache ordering, artifact sizes, and training logs against competition rules. Both Claude Code (Anthropic) and Codex (OpenAI) were used throughout development: Claude Code for architecture, implementation, and competition analysis; Codex for independent verification and audit. We believe cross-model review catches blind spots that single-model workflows miss. Built with PROTEUS+STYX by Light Speed Up |
|
nice 🔥🔥🔥🔥 |
Summary
Results (8×H100 SXM, RunPod)
Current Seeds (v1.1 — sliding window fix + script cleanup)
Training loop exit controlled by
MAX_WALLCLOCK_SECONDS=600. Logged wallclock includestorch.cuda.synchronize()overhead (~60-120ms beyond the 600s check).Superseded Seeds (v1.0)
We're showing the original v1.0 results for full transparency. They had two issues we caught in self-review: a seed 42 artifact that exceeded the 16MB cap, and a sliding window eval that never executed due to a double
torch.compileinvocation. Rather than quietly replace them, we're documenting what went wrong and why.These scores were from the int6 roundtrip eval path (non-sliding). The sliding window + n-gram cache eval path crashed silently under
torchrun. Fixed in v1.1.Overlap Verification
The 0.02 BPB gap between stride=64 and stride=2048 is the overlap contribution. The remaining 0.26 BPB improvement is genuine cache benefit from backward-looking n-gram statistics.
Rule Compliance Checklist
val_tokensonlymodel.eval()+torch.no_grad())Note on N-gram Cache Legality
The competition README does not address n-gram eval caches. No rule in the official documentation prohibits or permits this technique. The README states: "TTT only on tokens already graded" — our cache satisfies this: it is updated only with already-scored tokens. We note that 15+ concurrent PRs (#779, #797, #795, #786, #796, #798, #800, #806, among others) employ the same backward-looking n-gram cache concept.
Architecture
11L, 512d, GQA 8H/4KV, MLP 3×, LeakyReLU(0.9)², XSA (last 4 layers), Value Embedding, BigramHash(2048→128), Partial RoPE(16/64), LN Scale, EMA(0.997), Muon optimizer. Tied embeddings. Mixed int6/int8 quantization + LZMA compression.
Technique: 5-gram Eval Cache
During sliding window evaluation, a hash-based n-gram cache accumulates token statistics from already-scored windows. For each new window, the cache provides empirical next-token probabilities which are blended with the neural model's predictions using a fixed mixing coefficient. The cache is strictly causal — it never sees tokens before they are scored.
This is a pure eval-time technique. No architectural changes, no retraining, no TTT. The trained model is identical with or without the cache.
Related Work
The n-gram eval cache concept has seen significant community adoption since our initial analysis on Issue #140:
Our LeakyReLU(0.9)² slope sweep was independently cited by PR #764 (@ndokutovich).
Context
Same team that posted the compliance guide, LeakyReLU slope sweep, and n-gram cache analysis on Issue #140.
Docker:
matotezitanka/proteus-pytorch:2.11.0-cuda12.8RunPod template: Deploy PROTEUS+STYX
Verification
This submission was independently audited by OpenAI Codex CLI (gpt-5.4) as a cross-model peer reviewer — verifying rule compliance, cache ordering, artifact sizes, and training logs against competition rules. Both Claude Code (Anthropic) and Codex (OpenAI) were used throughout development: Claude Code for architecture, implementation, and competition analysis; Codex for independent verification and audit.
Built with PROTEUS+STYX by Light Speed Up