kengz · kengz · Feb 19, 2026 · Feb 19, 2026
diff --git a/.claude/skills/benchmark/SKILL.md b/.claude/skills/benchmark/SKILL.md
@@ -12,11 +12,29 @@ description: Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract result
 3. **Respect Settings line** for each env (max_frame, num_envs, etc.)
 4. **Use `${max_frame}` variable** in specs — never hardcode
 5. **Runs must complete in <6h** (dstack max_duration)
+6. **Max 10 concurrent dstack runs** — launch in batches of 10, wait for capacity/completion before launching more. Never submit all runs at once; dstack capacity is limited and mass submissions cause "no offers" failures
 
-## Run → Score → Record
+## Per-Run Intake Checklist
+
+**Every completed run MUST go through ALL of these steps. No exceptions. Do not skip any step.**
+
+When a run completes (`dstack ps` shows `exited (0)`):
+
+1. **Extract score**: `dstack logs NAME | grep "trial_metrics"` → get `total_reward_ma`
+2. **Find HF folder name**: `dstack logs NAME 2>&1 | grep "Uploading data/"` → extract folder name from the upload log line
+3. **Update table score** in BENCHMARKS.md
+4. **Update table HF link**: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)`
+5. **Pull HF data locally**: `source .env && hf download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"`
+6. **Generate plot**: `uv run slm-lab plot -t "EnvName" -f data/benchmark-dev/data/FOLDER1,data/benchmark-dev/data/FOLDER2`
+7. **Verify plot exists** in `docs/plots/`
+8. **Commit** score + link + plot together
+
+A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.
+
+## Launch
 
 ```bash
-# Launch
+# Launch a run
 source .env && uv run slm-lab run-remote --gpu \
   -s env=ALE/Pong-v5 SPEC_FILE SPEC_NAME train -n NAME
 
@@ -30,67 +48,53 @@ dstack logs NAME | grep "trial_metrics" # extract score at completion
 
 ## Data Lifecycle
 
-Data flows through three stages: **pull → plot → graduate**. Keep local data until graduation is complete.
-
 ```
 Remote GPU run → auto-uploads to benchmark-dev (HF)
-                      ↓
-               Pull to local data/
-                      ↓
-               Generate plots (docs/plots/)
-                      ↓
-               Update BENCHMARKS.md (scores, links)
-                      ↓
-               Graduate to public benchmark (HF)
-                      ↓
-               Update links: benchmark-dev → benchmark
-                      ↓
-               Upload docs/ to public benchmark (HF)
+  ↓ Pull to local data/
+  ↓ Generate plots (docs/plots/)
+  ↓ Update BENCHMARKS.md (scores, links, plots)
+  ↓ Graduate to public benchmark (HF)
+  ↓ Update links: benchmark-dev → benchmark
+  ↓ Upload docs/ to public benchmark (HF)
 ```
 
 ### Pull Data
 
 ```bash
 # Pull full dataset (fast, single request — avoids rate limits)
-source .env && uv run hf download SLM-Lab/benchmark-dev \
+source .env && hf download SLM-Lab/benchmark-dev \
   --local-dir data/benchmark-dev --repo-type dataset
 
+# Or pull specific folder
+source .env && hf download SLM-Lab/benchmark-dev \
+  --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"
+
 # KEEP this data — needed for plots AND graduation upload later
-# Never rm -rf data/benchmark-dev/ until graduation is complete
 ```
 
 ### Generate Plots
 
 ```bash
-# Find folders for a game (need a2c + ppo + sac for comparison)
+# Find folders for a game
 ls data/benchmark-dev/data/ | grep -i pong
 
-# Generate comparison plot
+# Generate comparison plot (include all algorithms available)
 uv run slm-lab plot -t "Pong" \
-  -f data/benchmark-dev/data/a2c_folder,data/benchmark-dev/data/ppo_folder,data/benchmark-dev/data/sac_folder
+  -f data/benchmark-dev/data/ppo_folder,data/benchmark-dev/data/sac_folder
 ```
 
-### Update BENCHMARKS.md
-
-- Add score in results table
-- Add HF link: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)`
-- Status: ✅ Solved | ⚠️ Close (>80%) | ❌ Failed
-
 ### Graduate to Public HF
 
 When benchmarks are finalized, publish from `benchmark-dev` → `benchmark`:
 
 ```bash
-# Upload data (from local copy — already pulled above)
-source .env && uv run hf upload SLM-Lab/benchmark \
+source .env && hf upload SLM-Lab/benchmark \
   data/benchmark-dev/data data --repo-type dataset
 
 # Update BENCHMARKS.md links: benchmark-dev → benchmark
-# (find-replace in docs/BENCHMARKS.md)
-
-# Upload docs (with updated links) and README
-source .env && uv run hf upload SLM-Lab/benchmark docs docs --repo-type dataset
-source .env && uv run hf upload SLM-Lab/benchmark README.md README.md --repo-type dataset
+# Upload docs and README
+source .env && hf upload SLM-Lab/benchmark docs docs --repo-type dataset
+source .env && hf upload SLM-Lab/benchmark README.md README.md --repo-type dataset
 ```
 
 | Repo | Purpose |
@@ -108,6 +112,19 @@ source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAM
 
 Budget: ~3-4 trials per dimension. After search: update spec with best params, run `train`, use that result.
 
+## Autonomous Execution
+
+Work continuously when benchmarking. Use `sleep 300 && dstack ps` to actively wait (5 min intervals) — never delegate monitoring to background processes or scripts. Stay engaged in the conversation.
+
+**Workflow loop** (repeat every 5-10 minutes):
+1. **Check status**: `dstack ps` — identify completed/failed/running
+2. **Intake completed runs**: For EACH completed run, do the full intake checklist above (score → HF link → pull → plot → table update)
+3. **Launch next batch**: Up to 10 concurrent. Check capacity before launching more
+4. **Iterate on failures**: Relaunch or adjust config immediately
+5. **Commit progress**: Regular commits of score + link + plot updates
+
+**Key principle**: Work continuously, check in regularly, iterate immediately on failures. Never idle. Keep reminding yourself to continue without pausing — check on tasks, update, plan, and pick up the next task immediately until all tasks are completed.
+
 ## Troubleshooting
 
 - **Run interrupted**: Relaunch, increment name suffix (e.g., pong3 → pong4)

diff --git a/.dstack/run-gpu-train.yml b/.dstack/run-gpu-train.yml
@@ -20,12 +20,14 @@ commands:
   - cd /workflow && uv run slm-lab run ${SPEC_VARS} ${SPEC_FILE} ${SPEC_NAME} ${LAB_MODE} --upload-hf
 
 resources:
-  gpu: 1..
-  memory: 48GB..
+  gpu:
+    name: [RTX3090]
+    count: 1
+  memory: 32GB..
 
-spot_policy: on-demand
+spot_policy: auto
 max_duration: 6h
-max_price: 0.25
+max_price: 0.50
 retry:
   on_events: [no-capacity]
   duration: 1h
diff --git a/.githooks/commit-msg b/.githooks/commit-msg
@@ -0,0 +1,22 @@
+#!/usr/bin/env bash
+# Validate conventional commit format: type: message | type(scope): message
+
+commit_msg_file="$1"
+commit_msg=$(head -1 "$commit_msg_file")
+
+# Skip merge commits and fixup/squash
+if echo "$commit_msg" | grep -qE '^(Merge |fixup! |squash! )'; then
+    exit 0
+fi
+
+# Validate format
+if ! echo "$commit_msg" | grep -qE '^(feat|fix|docs|chore|refactor|test|perf|ci|style|build)(\(.+\))?: .+'; then
+    echo "ERROR: Invalid commit message format."
+    echo ""
+    echo "  Expected: type: message"
+    echo "  Example:  feat: add user authentication"
+    echo "  Example:  fix(parser): handle empty input"
+    echo ""
+    echo "  Allowed types: feat fix docs chore refactor test perf ci style build"
+    exit 1
+fi
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -66,7 +66,9 @@ Apply these six principles to every decision.
 1. **Give enough context in spawn prompts** - teammates don't inherit conversation history, only CLAUDE.md and project context
 2. **Size tasks appropriately** - self-contained units with clear deliverables, ~5-6 per teammate
 3. **Avoid file conflicts** - each teammate owns different files
-4. **Use sonnet for volume work** - reserve opus for strategic decisions
+
+> Work autonomously: run things in parallel, continue without pausing, pick up the next task immediately. For long-running tasks, use `sleep N` to actively wait and check in — do NOT delegate to background processes. Stay engaged in the conversation.
+
 
 ## Documentation
 
@@ -192,7 +194,8 @@ uv run ruff format
 For a small box that only dispatches dstack runs and syncs results (no local ML training):
 
 ```bash
-uv sync --no-default-groups
+uv sync --no-default-groups  # skip ML deps (torch, gymnasium, etc.)
+uv tool install dstack
 uv run --no-default-groups slm-lab run-remote spec.json spec_name train
 uv run --no-default-groups slm-lab pull spec_name
 uv run --no-default-groups slm-lab plot -f folder1,folder2
@@ -236,11 +239,17 @@ dstack stop <run-name> -y  # terminate
 
 **`docs/BENCHMARKS.md`** is the single source of truth. See the `/benchmark` skill for operational details (commands, data lifecycle, graduation).
 
-**Pattern**: Launch → Monitor → Extract score → Pull data → Plot → Update table → Commit
+**Per-run intake (MANDATORY — every completed run must go through ALL steps):**
+1. Extract score (`dstack logs NAME | grep trial_metrics`)
+2. Find HF folder name (query HuggingFace API)
+3. Update table with score AND HF link
+4. Pull HF data locally (`hf download`)
+5. Generate plot (`uv run slm-lab plot`)
+6. Commit score + link + plot together
 
-**Autonomous execution**: Fill GPU capacity (~30 concurrent runs), check status regularly, extract immediately on completion, iterate on failures. Never idle.
+A table row is NOT complete until it has: **score, HF link, and plot**. See `/benchmark` skill for commands.
 
-**Data lifecycle**: Pull full HF dataset to `data/benchmark-dev/` once — keep it for plots AND graduation. Never delete until graduation to public repo is complete.
+**Autonomous execution**: Max 10 concurrent runs. Use `sleep 300 && dstack ps` to actively wait. Never delegate monitoring to background scripts. Never idle.
 
 ### Hyperparameter Search
 
@@ -261,6 +270,6 @@ Prefer continuous distributions (`__uniform`, `__loguniform`) over `__choice`. S
 
 ## SLM-Lab Documentation
 
-- **Changelog**: Document major changes in `CHANGELOG.md`
+- **Changelog**: Document major changes in `docs/CHANGELOG.md`
 - **Benchmarks**: `docs/BENCHMARKS.md` — results tables, targets, reproducibility
 - **Specs**: Document rationale in commit messages when updating specs
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
   <a href="https://slm-lab.gitbook.io/slm-lab/">Documentation</a> · <a href="https://github.com/kengz/SLM-Lab/blob/master/docs/BENCHMARKS.md">Benchmark Results</a>
 </p>
 
->**NOTE:** v5.0 updates to Gymnasium, `uv` tooling, and modern dependencies with ARM support - see [CHANGELOG.md](CHANGELOG.md).
+>**NOTE:** v5.0 updates to Gymnasium, `uv` tooling, and modern dependencies with ARM support - see [CHANGELOG.md](docs/CHANGELOG.md).
 >
 >Book readers: `git checkout v4.1.1` for *Foundations of Deep Reinforcement Learning* code.
 
@@ -111,7 +111,8 @@ Config options in `.dstack/`: `run-gpu-train.yml`, `run-gpu-search.yml`, `run-cp
 For a lightweight box that only dispatches dstack runs, syncs results, and generates plots (no local ML training):
 
 ```bash
-uv sync --no-default-groups
+uv sync --no-default-groups  # skip ML deps (torch, gymnasium, etc.)
+uv tool install dstack
 uv run --no-default-groups slm-lab run-remote spec.json spec_name train
 uv run --no-default-groups slm-lab pull spec_name
 uv run --no-default-groups slm-lab plot -f folder1,folder2