From faf3d9097d9d45fd10f31e19c724842efaf741be Mon Sep 17 00:00:00 2001 From: kengz Date: Sat, 14 Feb 2026 13:26:36 -0500 Subject: [PATCH] docs: SAC Atari benchmarks - all 58 games Complete SAC Atari benchmark across all 58 games (2M frames, 4 seeds). Single universal spec (sac_atari.json): training_iter=3, Categorical, AdamW lr=3e-4. A2C+PPO+SAC comparison plots and HF data graduated to public repo. Streamlined CLAUDE.md and benchmark skill. Removed stale SAC PER specs. Co-Authored-By: Claude Opus 4.6 --- .claude/skills/benchmark/SKILL.md | 178 ++++------ CLAUDE.md | 326 +++++------------- docs/BENCHMARKS.md | 85 ++++- ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 94455 -> 100356 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 95898 -> 106239 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 97497 -> 104017 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 95881 -> 101312 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 105370 -> 112664 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 87673 -> 95176 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 103248 -> 107799 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 107112 -> 111333 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 87992 -> 89873 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 105698 -> 109516 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 90356 -> 96749 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 80033 -> 85326 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 80698 -> 92012 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 105892 -> 110128 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 102366 -> 116278 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 93617 -> 104138 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 96603 -> 100571 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 103199 -> 115237 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 98110 -> 102597 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 90681 -> 97053 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 115307 -> 118471 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 96633 -> 103296 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 97517 -> 97532 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 98320 -> 103956 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 86942 -> 88408 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 91799 -> 92912 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 95341 -> 103282 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 90255 -> 107821 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 99321 -> 105335 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 91565 -> 96837 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 105525 -> 112192 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 90500 -> 100578 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 82244 -> 104112 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 105518 -> 110615 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 105831 -> 113179 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 101035 -> 106147 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 105403 -> 113417 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 88505 -> 93639 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 96125 -> 113892 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 102110 -> 110292 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 104686 -> 110985 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 94575 -> 120720 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 100459 -> 120631 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 86223 -> 96968 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 92320 -> 106806 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 84375 -> 82261 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 99642 -> 111354 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 97736 -> 106407 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 90105 -> 94544 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 74824 -> 71177 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 89825 -> 94749 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 87584 -> 95600 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 97226 -> 105728 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 92711 -> 95887 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 97477 -> 103820 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 100279 -> 104809 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 94527 -> 105332 bytes ..._trial_graph_mean_returns_ma_vs_frames.png | Bin 86470 -> 95184 bytes pyproject.toml | 2 +- slm_lab/spec/benchmark/sac/sac_atari.json | 6 +- .../benchmark/sac/sac_per_halfcheetah.json | 57 --- .../sac/sac_per_halfcheetah_pybullet.json | 57 --- .../spec/benchmark/sac/sac_per_mujoco.json | 57 --- slm_lab/spec/benchmark/sac/sac_pong.json | 83 ----- 67 files changed, 223 insertions(+), 628 deletions(-) delete mode 100644 slm_lab/spec/benchmark/sac/sac_per_halfcheetah.json delete mode 100644 slm_lab/spec/benchmark/sac/sac_per_halfcheetah_pybullet.json delete mode 100644 slm_lab/spec/benchmark/sac/sac_per_mujoco.json delete mode 100644 slm_lab/spec/benchmark/sac/sac_pong.json diff --git a/.claude/skills/benchmark/SKILL.md b/.claude/skills/benchmark/SKILL.md index 8e6af339d..431225313 100644 --- a/.claude/skills/benchmark/SKILL.md +++ b/.claude/skills/benchmark/SKILL.md @@ -3,159 +3,115 @@ name: slm-lab-benchmark description: Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots. --- -# SLM-Lab Benchmark Workflow +# SLM-Lab Benchmark Skill ## Critical Rules -1. **NEVER push to remote** without explicit user permission - commit locally only -2. **ONLY train runs** in BENCHMARKS.md - NEVER use search results (search folders = UNACCEPTABLE) -3. **Respect Settings line** for each env (max_frame, num_envs, etc.) - see [BENCHMARKS.md](docs/BENCHMARKS.md) -4. **Use `${max_frame}` variable** in specs - never hardcode max_frame values -5. **Verify HF links work** before updating table -6. **Runs must complete in <6h** +1. **NEVER push to remote** without explicit user permission +2. **ONLY train runs** in BENCHMARKS.md — never search results +3. **Respect Settings line** for each env (max_frame, num_envs, etc.) +4. **Use `${max_frame}` variable** in specs — never hardcode +5. **Runs must complete in <6h** (dstack max_duration) -## Benchmark Contribution Workflow - -### 1. Audit Spec Settings - -**Before Running**: Ensure spec matches the **Settings** line in BENCHMARKS.md for each env. - -Example Settings line: `max_frame 3e5 | num_envs 4 | max_session 4 | log_frequency 500` - -**After Pulling**: Verify downloaded `spec.json` matches these rules before using data. - -### 2. Run Benchmark & Commit Specs +## Run → Score → Record ```bash -# Remote (GPU) - auto-syncs to HuggingFace -source .env && slm-lab run-remote --gpu SPEC_FILE SPEC_NAME train -n NAME - -# With variable substitution (MuJoCo/Atari) -source .env && slm-lab run-remote --gpu -s env=ENV -s max_frame=MAX_FRAME \ - SPEC_FILE SPEC_NAME train -n NAME - -# Local (Classic Control only) -slm-lab run SPEC_FILE SPEC_NAME train -``` +# Launch +source .env && uv run slm-lab run-remote --gpu \ + -s env=ALE/Pong-v5 SPEC_FILE SPEC_NAME train -n NAME -**Always commit the spec file** used for the run. Ensure BENCHMARKS.md has entry with correct SPEC_FILE and SPEC_NAME. - -### 3. Monitor Status +# Monitor +dstack ps # running jobs +dstack logs NAME | grep "trial_metrics" # extract score at completion -**Monitor autonomously** - use sleep to check in periodically until completion: -```bash -sleep 900 && dstack ps # wait 15min then check +# Score = total_reward_ma from trial_metrics line +# trial_metrics: frame:1.00e+07 | total_reward_ma:816.18 | ... ``` -```bash -dstack ps # running jobs -dstack ps -a | head -20 # recent jobs (done/failed) -dstack logs NAME # view logs -dstack logs NAME | grep "trial_metrics" # extract final score -dstack stop NAME -y # terminate run -``` +## Data Lifecycle -### 4. Record Scores & Plots +Data flows through three stages: **pull → plot → graduate**. Keep local data until graduation is complete. -**Score**: At end of run, extract `total_reward_ma` from logs (`trial_metrics`): ``` -trial_metrics: frame:1.00e+07 | total_reward_ma:816.18 | strength:570.4 | ... +Remote GPU run → auto-uploads to benchmark-dev (HF) + ↓ + Pull to local data/ + ↓ + Generate plots (docs/plots/) + ↓ + Update BENCHMARKS.md (scores, links) + ↓ + Graduate to public benchmark (HF) + ↓ + Update links: benchmark-dev → benchmark + ↓ + Upload docs/ to public benchmark (HF) ``` -**Link**: Add HuggingFace folder link to table: -- Format: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)` +### Pull Data -**Pull Data Efficiently** (avoid rate limiting): ```bash -# DON'T use slm-lab pull for bulk downloads - will get rate limited -# Instead, pull specific folders directly from HF: - -# Get folder name from run logs -dstack logs NAME | grep "SLM-Lab: Running" -# Output shows: data/FOLDER_NAME/... - -# Pull only the folder you need +# Pull full dataset (fast, single request — avoids rate limits) source .env && uv run hf download SLM-Lab/benchmark-dev \ - --include "data/FOLDER_NAME/*" --local-dir hf_data --repo-type dataset -``` - -**Plot**: -```bash -# Verify scores in trial_metrics.json match logs -# Ensure all runs share same max_frame + --local-dir data/benchmark-dev --repo-type dataset -# Generate plot using ONLY folders from table -slm-lab plot -t "ENV_NAME" -f folder1,folder2,folder3 +# KEEP this data — needed for plots AND graduation upload later +# Never rm -rf data/benchmark-dev/ until graduation is complete ``` -**Status legend**: ✅ Solved | ⚠️ Close (>80%) | ❌ Failed - -### 5. Commit Changes +### Generate Plots ```bash -git add docs/BENCHMARKS.md slm_lab/spec/benchmark/... -git commit -m "docs: update ENV benchmark (SCORE)" -# NEVER push without explicit permission -``` - -## Publishing to Public Dataset +# Find folders for a game (need a2c + ppo + sac for comparison) +ls data/benchmark-dev/data/ | grep -i pong -During development, runs upload to `SLM-Lab/benchmark-dev` (noisy, iterative). When benchmarks are finalized, publish clean results to the public `SLM-Lab/benchmark` dataset. +# Generate comparison plot +uv run slm-lab plot -t "Pong" \ + -f data/benchmark-dev/data/a2c_folder,data/benchmark-dev/data/ppo_folder,data/benchmark-dev/data/sac_folder +``` -### Strategy: `hf_data/` IS the Manifest +### Update BENCHMARKS.md -The `hf_data/data/` directory defines what gets uploaded. Same process works for any subset (Phase 1-3, Atari, single env). +- Add score in results table +- Add HF link: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)` +- Status: ✅ Solved | ⚠️ Close (>80%) | ❌ Failed -### Upload Workflow +### Graduate to Public HF -HF dataset mirrors repo structure: `README.md`, `docs/`, `data/` +When benchmarks are finalized, publish from `benchmark-dev` → `benchmark`: ```bash -# 1. Clear hf_data/ before upload cycle -rm -rf hf_data/ - -# 2. Pull only folders you want from benchmark-dev -source .env && uv run hf download SLM-Lab/benchmark-dev \ - --include "data/ppo_cartpole_2026*/*" "data/sac_lunar*/*" \ - --local-dir hf_data --repo-type dataset - -# 3. Upload README to public repo -source .env && uv run hf upload SLM-Lab/benchmark README.md README.md --repo-type dataset - -# 4. Upload data to public repo -source .env && uv run hf upload SLM-Lab/benchmark hf_data/data data --repo-type dataset +# Upload data (from local copy — already pulled above) +source .env && uv run hf upload SLM-Lab/benchmark \ + data/benchmark-dev/data data --repo-type dataset -# 5. Update BENCHMARKS.md links: benchmark-dev -> benchmark -# (data now exists on public repo, so links will work) +# Update BENCHMARKS.md links: benchmark-dev → benchmark +# (find-replace in docs/BENCHMARKS.md) -# 6. Upload docs (with updated links) +# Upload docs (with updated links) and README source .env && uv run hf upload SLM-Lab/benchmark docs docs --repo-type dataset +source .env && uv run hf upload SLM-Lab/benchmark README.md README.md --repo-type dataset ``` -### Two-Repo Strategy - -| Repo | Purpose | Links in BENCHMARKS.md | -|------|---------|------------------------| -| `SLM-Lab/benchmark-dev` | Development iterations, noisy | During active work | -| `SLM-Lab/benchmark` | Public, finalized results | After publishing | +| Repo | Purpose | +|------|---------| +| `SLM-Lab/benchmark-dev` | Development — noisy, iterative | +| `SLM-Lab/benchmark` | Public — finalized, validated | ## Hyperparameter Search -Only when algorithm fails to reach target. Use search to find hyperparams, then run final `train` for benchmark. +Only when algorithm fails to reach target: ```bash -source .env && slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAME +source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAME ``` -**Search budget**: ~3-4 trials per dimension (8 trials = 2-3 dims, 16 = 3-4 dims). - -**After search**: Update spec with best hyperparams, run `train` mode, use that result in BENCHMARKS.md. +Budget: ~3-4 trials per dimension. After search: update spec with best params, run `train`, use that result. ## Troubleshooting -- **Run interrupted**: Expected with spot instances - relaunch with same command, increment run name (e.g. rv2 → rv3) -- **Low GPU usage** (<50%): CPU bottleneck or config issue, not training problem -- **Score below target**: Check hyperparams match spec, try search mode -- **HF link 404**: Run didn't complete or upload failed, rerun - -For full details, see [docs/BENCHMARKS.md](docs/BENCHMARKS.md). +- **Run interrupted**: Relaunch, increment name suffix (e.g., pong3 → pong4) +- **Low GPU usage** (<50%): CPU bottleneck or config issue +- **HF rate limit**: Download full dataset, not selective `--include` patterns +- **HF link 404**: Run didn't complete or upload failed — rerun +- **.env inline comments**: Break dstack env vars — put comments on separate lines diff --git a/CLAUDE.md b/CLAUDE.md index 3cecf8cd9..ef0291ee2 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,117 +4,104 @@ You are a seasoned software engineer with the following traits: -- **Perfectionist**: Code quality is non-negotiable - clean, idiomatic, maintainable code every time +- **Quality-driven**: Code quality is non-negotiable - clean, idiomatic, maintainable code every time - **Autonomous**: Make informed technical decisions independently - only ask when requirements are genuinely unclear - **Pragmatic**: Balance perfect with practical - ship working solutions, iterate when needed - **Detail-oriented**: Catch edge cases, handle errors properly, think through implications - **Proactive**: Refactor immediately, delete dead code aggressively, improve as you go -- **Efficient**: Minimal token usage - no fluff, explanations only when asked **Working principles:** -1. Work independently - make reasonable technical decisions, only ask when requirements are unclear -2. Follow ALL instructions in this document - tools, style guide, workflow, version control practices -3. Use TODO section below to plan and execute work, and update with task progress -4. Stage changes frequently - commit related work as logical units -5. Never hard reset or delete work - preserve changes even during corruption/errors -6. Keep responses SHORT - no explanations unless asked, no restating what was done, just confirm completion +1. Stage changes frequently - commit related work as logical units +2. Never hard reset or delete work - preserve changes even during corruption/errors +3. Work autonomously - run things in parallel when possible, continue without pausing, pick up the next task immediately +4. Keep responses SHORT - no explanations unless asked, just confirm completion. State rationale briefly for non-obvious decisions. -## Project Setup +## Principles of Good Code Design -### Python Projects +Apply these six principles to every decision. -1. **Package Management**: Use [`uv`](https://docs.astral.sh/uv/getting-started/installation/) and `pyproject.toml` - 1. Install dependencies: `uv sync` - 2. Add packages: `uv add ` - 3. Run scripts: `uv run