openai · ShihChunHao · Mar 26, 2026
diff --git a/records/track_10min_16mb/2026-03-25_WSD_CosineDecay_Schedule/README.md b/records/track_10min_16mb/2026-03-25_WSD_CosineDecay_Schedule/README.md
@@ -0,0 +1,46 @@
+# WSD Cosine Decay Schedule
+
+**val_bpb: TBD (8xH100)** — Preliminary 1-GPU result: 1.2824 BPB
+
+## Key Change
+
+Replace the default linear warmdown LR schedule with a **Warmup-Stable-Decay (WSD)** cosine schedule:
+
+| Phase | Fraction | LR behavior |
+|-------|----------|-------------|
+| Warmup | 0-5% of steps | Linear 0 → peak |
+| Stable | 5-80% of steps | Constant at peak LR |
+| Decay | 80-100% of steps | Cosine decay → 0 |
+
+The original schedule computes warmdown based on `warmdown_iters` and remaining wallclock time, which can cause LR to start decaying from very early in training (especially with fewer steps). WSD ensures the model trains at peak LR for the majority of the run.
+
+## Base Techniques (inherited from SOTA)
+
+- 10 layers, 512-dim, MLP 3x expansion
+- SmearGate + BigramHash(10240)
+- Mixed int5 (MLP) / int6 (attention) quantization
+- SWA (start_frac=0.4, every=50 steps)
+- Orthogonal init + Muon optimizer (WD=0.04)
+- zstd-22 compression
+- Sliding window eval (stride=64)
+
+## Preliminary Results (1 GPU, seed=42)
+
+| Config | val_bpb | artifact_bytes |
+|--------|---------|---------------|
+| 1 GPU, 600s, ~877 steps | 1.2824 | 15,767,236 |
+
+8xH100 3-seed results pending.
+
+## Run Command
+
+```bash
+# Single GPU
+python train_gpt.py
+
+# 8xH100 (competition setting)
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+
+# With specific seed
+SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-03-25_WSD_CosineDecay_Schedule/submission.json b/records/track_10min_16mb/2026-03-25_WSD_CosineDecay_Schedule/submission.json
@@ -0,0 +1,9 @@
+{
+  "name": "WSD Cosine Decay Schedule + 10L Int5-MLP BigramHash SmearGate SWA",
+  "val_loss": 1.28242,
+  "bytes_total": 15767236,
+  "blurb": "Replace linear warmdown with Warmup-Stable-Decay (WSD) cosine schedule: 5% warmup, 75% stable at peak LR, 20% cosine decay. Built on SOTA base (10L, MLP3x, SmearGate, BigramHash 10240, SWA 0.4, int5/int6 mixed quant, zstd-22). Preliminary 1-GPU result; 8xH100 3-seed results pending.",
+  "author": "ShihChunHao",
+  "github_id": "ShihChunHao",
+  "date": "2026-03-25"
+}