Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 51 additions & 34 deletions .claude/skills/benchmark/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,29 @@ description: Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract result
3. **Respect Settings line** for each env (max_frame, num_envs, etc.)
4. **Use `${max_frame}` variable** in specs — never hardcode
5. **Runs must complete in <6h** (dstack max_duration)
6. **Max 10 concurrent dstack runs** — launch in batches of 10, wait for capacity/completion before launching more. Never submit all runs at once; dstack capacity is limited and mass submissions cause "no offers" failures

## Run → Score → Record
## Per-Run Intake Checklist

**Every completed run MUST go through ALL of these steps. No exceptions. Do not skip any step.**

When a run completes (`dstack ps` shows `exited (0)`):

1. **Extract score**: `dstack logs NAME | grep "trial_metrics"` → get `total_reward_ma`
2. **Find HF folder name**: `dstack logs NAME 2>&1 | grep "Uploading data/"` → extract folder name from the upload log line
3. **Update table score** in BENCHMARKS.md
4. **Update table HF link**: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)`
5. **Pull HF data locally**: `source .env && hf download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"`
6. **Generate plot**: `uv run slm-lab plot -t "EnvName" -f data/benchmark-dev/data/FOLDER1,data/benchmark-dev/data/FOLDER2`
7. **Verify plot exists** in `docs/plots/`
8. **Commit** score + link + plot together

A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.

## Launch

```bash
# Launch
# Launch a run
source .env && uv run slm-lab run-remote --gpu \
-s env=ALE/Pong-v5 SPEC_FILE SPEC_NAME train -n NAME

Expand All @@ -30,67 +48,53 @@ dstack logs NAME | grep "trial_metrics" # extract score at completion

## Data Lifecycle

Data flows through three stages: **pull → plot → graduate**. Keep local data until graduation is complete.

```
Remote GPU run → auto-uploads to benchmark-dev (HF)
Pull to local data/
Generate plots (docs/plots/)
Update BENCHMARKS.md (scores, links)
Graduate to public benchmark (HF)
Update links: benchmark-dev → benchmark
Upload docs/ to public benchmark (HF)
↓ Pull to local data/
↓ Generate plots (docs/plots/)
↓ Update BENCHMARKS.md (scores, links, plots)
↓ Graduate to public benchmark (HF)
↓ Update links: benchmark-dev → benchmark
↓ Upload docs/ to public benchmark (HF)
```

### Pull Data

```bash
# Pull full dataset (fast, single request — avoids rate limits)
source .env && uv run hf download SLM-Lab/benchmark-dev \
source .env && hf download SLM-Lab/benchmark-dev \
--local-dir data/benchmark-dev --repo-type dataset

# Or pull specific folder
source .env && hf download SLM-Lab/benchmark-dev \
--local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"

# KEEP this data — needed for plots AND graduation upload later
# Never rm -rf data/benchmark-dev/ until graduation is complete
```

### Generate Plots

```bash
# Find folders for a game (need a2c + ppo + sac for comparison)
# Find folders for a game
ls data/benchmark-dev/data/ | grep -i pong

# Generate comparison plot
# Generate comparison plot (include all algorithms available)
uv run slm-lab plot -t "Pong" \
-f data/benchmark-dev/data/a2c_folder,data/benchmark-dev/data/ppo_folder,data/benchmark-dev/data/sac_folder
-f data/benchmark-dev/data/ppo_folder,data/benchmark-dev/data/sac_folder
```

### Update BENCHMARKS.md

- Add score in results table
- Add HF link: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)`
- Status: ✅ Solved | ⚠️ Close (>80%) | ❌ Failed

### Graduate to Public HF

When benchmarks are finalized, publish from `benchmark-dev` → `benchmark`:

```bash
# Upload data (from local copy — already pulled above)
source .env && uv run hf upload SLM-Lab/benchmark \
source .env && hf upload SLM-Lab/benchmark \
data/benchmark-dev/data data --repo-type dataset

# Update BENCHMARKS.md links: benchmark-dev → benchmark
# (find-replace in docs/BENCHMARKS.md)

# Upload docs (with updated links) and README
source .env && uv run hf upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && uv run hf upload SLM-Lab/benchmark README.md README.md --repo-type dataset
# Upload docs and README
source .env && hf upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && hf upload SLM-Lab/benchmark README.md README.md --repo-type dataset
```

| Repo | Purpose |
Expand All @@ -108,6 +112,19 @@ source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAM

Budget: ~3-4 trials per dimension. After search: update spec with best params, run `train`, use that result.

## Autonomous Execution

Work continuously when benchmarking. Use `sleep 300 && dstack ps` to actively wait (5 min intervals) — never delegate monitoring to background processes or scripts. Stay engaged in the conversation.

**Workflow loop** (repeat every 5-10 minutes):
1. **Check status**: `dstack ps` — identify completed/failed/running
2. **Intake completed runs**: For EACH completed run, do the full intake checklist above (score → HF link → pull → plot → table update)
3. **Launch next batch**: Up to 10 concurrent. Check capacity before launching more
4. **Iterate on failures**: Relaunch or adjust config immediately
5. **Commit progress**: Regular commits of score + link + plot updates

**Key principle**: Work continuously, check in regularly, iterate immediately on failures. Never idle. Keep reminding yourself to continue without pausing — check on tasks, update, plan, and pick up the next task immediately until all tasks are completed.

## Troubleshooting

- **Run interrupted**: Relaunch, increment name suffix (e.g., pong3 → pong4)
Expand Down
10 changes: 6 additions & 4 deletions .dstack/run-gpu-train.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,14 @@ commands:
- cd /workflow && uv run slm-lab run ${SPEC_VARS} ${SPEC_FILE} ${SPEC_NAME} ${LAB_MODE} --upload-hf

resources:
gpu: 1..
memory: 48GB..
gpu:
name: [RTX3090]
count: 1
memory: 32GB..

spot_policy: on-demand
spot_policy: auto
max_duration: 6h
max_price: 0.25
max_price: 0.50
retry:
on_events: [no-capacity]
duration: 1h
22 changes: 22 additions & 0 deletions .githooks/commit-msg
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env bash
# Validate conventional commit format: type: message | type(scope): message

commit_msg_file="$1"
commit_msg=$(head -1 "$commit_msg_file")

# Skip merge commits and fixup/squash
if echo "$commit_msg" | grep -qE '^(Merge |fixup! |squash! )'; then
exit 0
fi

# Validate format
if ! echo "$commit_msg" | grep -qE '^(feat|fix|docs|chore|refactor|test|perf|ci|style|build)(\(.+\))?: .+'; then
echo "ERROR: Invalid commit message format."
echo ""
echo " Expected: type: message"
echo " Example: feat: add user authentication"
echo " Example: fix(parser): handle empty input"
echo ""
echo " Allowed types: feat fix docs chore refactor test perf ci style build"
exit 1
fi
21 changes: 15 additions & 6 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,9 @@ Apply these six principles to every decision.
1. **Give enough context in spawn prompts** - teammates don't inherit conversation history, only CLAUDE.md and project context
2. **Size tasks appropriately** - self-contained units with clear deliverables, ~5-6 per teammate
3. **Avoid file conflicts** - each teammate owns different files
4. **Use sonnet for volume work** - reserve opus for strategic decisions

> Work autonomously: run things in parallel, continue without pausing, pick up the next task immediately. For long-running tasks, use `sleep N` to actively wait and check in — do NOT delegate to background processes. Stay engaged in the conversation.


## Documentation

Expand Down Expand Up @@ -192,7 +194,8 @@ uv run ruff format
For a small box that only dispatches dstack runs and syncs results (no local ML training):

```bash
uv sync --no-default-groups
uv sync --no-default-groups # skip ML deps (torch, gymnasium, etc.)
uv tool install dstack
uv run --no-default-groups slm-lab run-remote spec.json spec_name train
uv run --no-default-groups slm-lab pull spec_name
uv run --no-default-groups slm-lab plot -f folder1,folder2
Expand Down Expand Up @@ -236,11 +239,17 @@ dstack stop <run-name> -y # terminate

**`docs/BENCHMARKS.md`** is the single source of truth. See the `/benchmark` skill for operational details (commands, data lifecycle, graduation).

**Pattern**: Launch → Monitor → Extract score → Pull data → Plot → Update table → Commit
**Per-run intake (MANDATORY — every completed run must go through ALL steps):**
1. Extract score (`dstack logs NAME | grep trial_metrics`)
2. Find HF folder name (query HuggingFace API)
3. Update table with score AND HF link
4. Pull HF data locally (`hf download`)
5. Generate plot (`uv run slm-lab plot`)
6. Commit score + link + plot together

**Autonomous execution**: Fill GPU capacity (~30 concurrent runs), check status regularly, extract immediately on completion, iterate on failures. Never idle.
A table row is NOT complete until it has: **score, HF link, and plot**. See `/benchmark` skill for commands.

**Data lifecycle**: Pull full HF dataset to `data/benchmark-dev/` once — keep it for plots AND graduation. Never delete until graduation to public repo is complete.
**Autonomous execution**: Max 10 concurrent runs. Use `sleep 300 && dstack ps` to actively wait. Never delegate monitoring to background scripts. Never idle.

### Hyperparameter Search

Expand All @@ -261,6 +270,6 @@ Prefer continuous distributions (`__uniform`, `__loguniform`) over `__choice`. S

## SLM-Lab Documentation

- **Changelog**: Document major changes in `CHANGELOG.md`
- **Changelog**: Document major changes in `docs/CHANGELOG.md`
- **Benchmarks**: `docs/BENCHMARKS.md` — results tables, targets, reproducibility
- **Specs**: Document rationale in commit messages when updating specs
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
<a href="https://slm-lab.gitbook.io/slm-lab/">Documentation</a> · <a href="https://github.com/kengz/SLM-Lab/blob/master/docs/BENCHMARKS.md">Benchmark Results</a>
</p>

>**NOTE:** v5.0 updates to Gymnasium, `uv` tooling, and modern dependencies with ARM support - see [CHANGELOG.md](CHANGELOG.md).
>**NOTE:** v5.0 updates to Gymnasium, `uv` tooling, and modern dependencies with ARM support - see [CHANGELOG.md](docs/CHANGELOG.md).
>
>Book readers: `git checkout v4.1.1` for *Foundations of Deep Reinforcement Learning* code.

Expand Down Expand Up @@ -111,7 +111,8 @@ Config options in `.dstack/`: `run-gpu-train.yml`, `run-gpu-search.yml`, `run-cp
For a lightweight box that only dispatches dstack runs, syncs results, and generates plots (no local ML training):

```bash
uv sync --no-default-groups
uv sync --no-default-groups # skip ML deps (torch, gymnasium, etc.)
uv tool install dstack
uv run --no-default-groups slm-lab run-remote spec.json spec_name train
uv run --no-default-groups slm-lab pull spec_name
uv run --no-default-groups slm-lab plot -f folder1,folder2
Expand Down
Loading