Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Record: 10L + Multi-Order N-gram Backoff + Matrix LR 0.03

**val_bpb = 0.9076** (seed 42, additional seeds pending) | **15.32 MB** | 8xH100 SXM, 600s

## Results

| Seed | Steps | ms/step | Pre-quant BPB | **N-gram BPB** | Artifact |
|------|-------|---------|---------------|----------------|----------|
| 42 | 6,693 | 89.6 | 1.1528 | **0.9076** | 15,320,749 |

## Key Change from PR #802

Single change: **MATRIX_LR=0.03** (from 0.02). This was discovered through systematic RTX4500 screening (20 experiments) to be the single largest training hyperparameter improvement for 10L architectures (-0.064 BPB on RTX4500 screening, -0.005 BPB on 8xH100 full run).

## Architecture (same as PR #802)

- 10L, 512d, GQA 8H/4KV, MLP 3x LeakyReLU(0.5)²
- BigramHash(4096, dim=128), SmearGate, Value Residual, Gated Attention
- XSA last 4 layers, Partial RoPE 16/64, LN Scale
- U-Net skip connections, tied embeddings, logit softcap=30

## Training

- Muon optimizer: **lr=0.03** (was 0.02), momentum 0.92→0.99, WD=0.04
- EMA(0.997), warmdown=3500 steps
- Mixed int5-MLP/int6-attn quantization + zstd-22
- 3% magnitude pruning

## Eval: Multi-Order N-gram Backoff (from PR #802)

- Score-first backward-looking n-gram cache (orders 2-7)
- Highest matching order wins (backoff from 7-gram to bigram)
- Entropy-adaptive alpha: `alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))`
- 4M XOR-hash buckets, min_count=2
- **Legal:** each token scored BEFORE cache is updated

## Reproduction

```bash
MATRIX_LR=0.03 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Based On

- PR #802 (@[author]): 10L + Multi-Order N-gram Backoff (0.9123 BPB)
- Our systematic hyperparameter screening (steps 10-12, 74 experiments)
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"track": "10min_16mb",
"date": "2026-03-26",
"name": "Record: 10L + Multi-Order N-gram Backoff + Matrix LR 0.03 (val_bpb=0.9076)",
"author": "bigbag",
"github": "bigbag",
"seed_results": {
"42": {"val_loss": 1.53248692, "val_bpb": 0.90762747, "artifact_bytes": 15320749}
},
"mean_val_loss": 1.53248692,
"mean_val_bpb": 0.90762747,
"code_bytes": 68444
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
W0326 07:28:29.013000 143 torch/distributed/run.py:803]
W0326 07:28:29.013000 143 torch/distributed/run.py:803] *****************************************
W0326 07:28:29.013000 143 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0326 07:28:29.013000 143 torch/distributed/run.py:803] *****************************************
logs/400c6056-3073-4716-ab99-5e1a14a95d10.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:24730705
world_size:8 grad_accum_steps:1
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.03 matrix_lr:0.03 scalar_lr:0.02
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:5 max_wallclock_seconds:600.000
seed:42
warmup_step:1/5
warmup_step:2/5
warmup_step:3/5
warmup_step:4/5
warmup_step:5/5
step:1/20000 train_loss:6.9311 train_time:147ms step_avg:147.24ms
step:2/20000 train_loss:7.9786 train_time:232ms step_avg:116.01ms
step:3/20000 train_loss:7.3922 train_time:319ms step_avg:106.37ms
step:4/20000 train_loss:7.0074 train_time:407ms step_avg:101.75ms
step:5/20000 train_loss:6.8978 train_time:495ms step_avg:98.91ms
step:6/20000 train_loss:6.8527 train_time:583ms step_avg:97.09ms
step:7/20000 train_loss:6.7423 train_time:670ms step_avg:95.76ms
step:8/20000 train_loss:6.7231 train_time:758ms step_avg:94.79ms
step:9/20000 train_loss:6.3524 train_time:847ms step_avg:94.09ms
step:10/20000 train_loss:6.0305 train_time:945ms step_avg:94.52ms
step:100/20000 train_loss:3.1302 train_time:8864ms step_avg:88.64ms
step:200/20000 train_loss:2.3178 train_time:17762ms step_avg:88.81ms
step:300/20000 train_loss:2.5089 train_time:26701ms step_avg:89.00ms
step:400/20000 train_loss:2.3906 train_time:35653ms step_avg:89.13ms
step:500/20000 train_loss:2.3830 train_time:44548ms step_avg:89.10ms
step:600/20000 train_loss:2.3275 train_time:53505ms step_avg:89.18ms
step:700/20000 train_loss:2.3416 train_time:62473ms step_avg:89.25ms
step:800/20000 train_loss:2.2354 train_time:71439ms step_avg:89.30ms
step:900/20000 train_loss:2.1295 train_time:80397ms step_avg:89.33ms
step:1000/20000 train_loss:2.2768 train_time:89292ms step_avg:89.29ms
step:1100/20000 train_loss:2.3216 train_time:98263ms step_avg:89.33ms
step:1200/20000 train_loss:2.3596 train_time:107250ms step_avg:89.38ms
step:1300/20000 train_loss:2.1052 train_time:116228ms step_avg:89.41ms
step:1400/20000 train_loss:2.1903 train_time:125214ms step_avg:89.44ms
step:1500/20000 train_loss:2.2287 train_time:134126ms step_avg:89.42ms
step:1600/20000 train_loss:2.0827 train_time:143110ms step_avg:89.44ms
step:1700/20000 train_loss:2.1530 train_time:152101ms step_avg:89.47ms
step:1800/20000 train_loss:2.1613 train_time:161091ms step_avg:89.49ms
step:1900/20000 train_loss:2.1324 train_time:170015ms step_avg:89.48ms
step:2000/20000 train_loss:2.0775 train_time:179010ms step_avg:89.51ms
step:2100/20000 train_loss:2.0585 train_time:188003ms step_avg:89.53ms
step:2200/20000 train_loss:2.1790 train_time:196990ms step_avg:89.54ms
step:2300/20000 train_loss:2.1175 train_time:205986ms step_avg:89.56ms
step:2400/20000 train_loss:2.0752 train_time:214900ms step_avg:89.54ms
step:2500/20000 train_loss:2.1815 train_time:223884ms step_avg:89.55ms
step:2600/20000 train_loss:2.1219 train_time:232872ms step_avg:89.57ms
step:2700/20000 train_loss:2.1168 train_time:241849ms step_avg:89.57ms
step:2800/20000 train_loss:2.1687 train_time:250835ms step_avg:89.58ms
step:2900/20000 train_loss:2.0388 train_time:259753ms step_avg:89.57ms
step:3000/20000 train_loss:2.1761 train_time:268744ms step_avg:89.58ms
step:3100/20000 train_loss:2.0509 train_time:277729ms step_avg:89.59ms
step:3200/20000 train_loss:2.1889 train_time:286723ms step_avg:89.60ms
step:3300/20000 train_loss:2.0859 train_time:295643ms step_avg:89.59ms
step:3400/20000 train_loss:2.0372 train_time:304639ms step_avg:89.60ms
step:3500/20000 train_loss:2.1929 train_time:313626ms step_avg:89.61ms
step:3600/20000 train_loss:2.1083 train_time:322610ms step_avg:89.61ms
step:3700/20000 train_loss:2.1014 train_time:331604ms step_avg:89.62ms
step:3800/20000 train_loss:2.0793 train_time:340517ms step_avg:89.61ms
step:3900/20000 train_loss:2.0826 train_time:349506ms step_avg:89.62ms
step:4000/20000 train_loss:1.9847 train_time:358487ms step_avg:89.62ms
step:4100/20000 train_loss:2.0224 train_time:367477ms step_avg:89.63ms
step:4200/20000 train_loss:2.1579 train_time:376471ms step_avg:89.64ms
step:4300/20000 train_loss:2.0687 train_time:385386ms step_avg:89.62ms
step:4400/20000 train_loss:2.0379 train_time:394370ms step_avg:89.63ms
step:4500/20000 train_loss:2.1302 train_time:403353ms step_avg:89.63ms
step:4600/20000 train_loss:1.8487 train_time:412338ms step_avg:89.64ms
step:4700/20000 train_loss:2.2395 train_time:421258ms step_avg:89.63ms
step:4800/20000 train_loss:2.4318 train_time:430246ms step_avg:89.63ms
step:4900/20000 train_loss:2.0563 train_time:439232ms step_avg:89.64ms
step:5000/20000 train_loss:2.1108 train_time:448222ms step_avg:89.64ms
step:5100/20000 train_loss:2.1305 train_time:457207ms step_avg:89.65ms
step:5200/20000 train_loss:2.0503 train_time:466122ms step_avg:89.64ms
step:5300/20000 train_loss:2.0182 train_time:475110ms step_avg:89.64ms
step:5400/20000 train_loss:2.0540 train_time:484085ms step_avg:89.65ms
step:5500/20000 train_loss:2.0202 train_time:493054ms step_avg:89.65ms
step:5600/20000 train_loss:1.9588 train_time:502041ms step_avg:89.65ms
step:5700/20000 train_loss:2.0162 train_time:510954ms step_avg:89.64ms
step:5800/20000 train_loss:2.0034 train_time:519939ms step_avg:89.64ms
step:5900/20000 train_loss:1.9065 train_time:528922ms step_avg:89.65ms
step:6000/20000 train_loss:1.9426 train_time:537914ms step_avg:89.65ms
step:6100/20000 train_loss:1.9187 train_time:546829ms step_avg:89.64ms
step:6200/20000 train_loss:1.9524 train_time:555811ms step_avg:89.65ms
step:6300/20000 train_loss:1.9471 train_time:564800ms step_avg:89.65ms
step:6400/20000 train_loss:1.9972 train_time:573786ms step_avg:89.65ms
step:6500/20000 train_loss:2.0796 train_time:582758ms step_avg:89.66ms
step:6600/20000 train_loss:1.8403 train_time:591663ms step_avg:89.65ms
step:6693/20000 val_loss:1.9464 val_bpb:1.1528 train_time:600024ms step_avg:89.65ms
stopping_early: wallclock_cap train_time:600024ms step:6693/20000
peak memory allocated: 20944 MiB reserved: 21086 MiB
ema:applying shadow model
Serialized model: 96864555 bytes
Code size: 68444 bytes
Total submission size: 96932999 bytes
Serialized model int6+zstd: 15252305 bytes
Total submission size: 15320749 bytes (15.32 MB)
SIZE CHECK PASSED: 15.32 MB < 16.00 MB
final_eval_mode:sliding_ngram orders=2-7 alpha=0.4 entropy=True stride:64
ngram_cache:enabled orders=2-7 backoff entropy=True alpha=0.4 ent_base=0.05 ent_range=0.55 min_count=2 buckets=4194304
ngram_eval [ 10.6%] bpb=1.110706 t=23s
ngram_eval [ 21.2%] bpb=1.092910 t=35s
ngram_eval [ 31.8%] bpb=1.068654 t=47s
ngram_eval [ 42.3%] bpb=1.039567 t=59s
ngram_eval [ 52.9%] bpb=1.011695 t=71s
ngram_eval [ 63.5%] bpb=0.985400 t=84s
ngram_eval [ 74.0%] bpb=0.963914 t=96s
ngram_eval [ 84.6%] bpb=0.942984 t=108s
ngram_eval [ 95.2%] bpb=0.922574 t=120s
ngram_eval DONE: bpb=0.907627 tokens=62023616 t=138s
final_int8_zlib_roundtrip val_loss:1.5325 val_bpb:0.9076 eval_time:138129ms
final_int8_zlib_roundtrip_exact val_loss:1.53248692 val_bpb:0.90762747
Loading