Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 59 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,31 @@ View the results later.
upskill runs --skill git-commit-messages
```

## Model Handling Overview

upskill uses distinct phases with explicit model roles:

- **Skill generation**: create/refine `SKILL.md`
- **Test generation**: create synthetic evaluation cases
- **Evaluation**: run tests against evaluator model(s)
- **Benchmark**: repeated evaluation across multiple runs/models

Model flags by command:

| Command | Flag | Meaning |
|---|---|---|
| `generate` | `--model` | Skill generation/refinement model |
| `generate` | `--test-gen-model` | Test generation model override |
| `generate` | `--eval-model` | Optional extra cross-model eval pass |
| `eval` | `-m/--model` | Evaluation model(s) (repeatable) |
| `eval` | `--test-gen-model` | Test generation model override (when tests are generated) |
| `benchmark` | `-m/--model` | Evaluation model(s) to benchmark |
| `benchmark` | `--test-gen-model` | Test generation model override (when tests are generated) |
| `runs` / `plot` | `-m/--model` | Historical results filter only |

`upskill eval` enters **benchmark mode** whenever you pass multiple `-m` values or `--runs > 1`.
In benchmark mode, baseline comparison is always off; `--no-baseline` is redundant.

## Commands

### `upskill generate`
Expand All @@ -59,7 +84,8 @@ upskill generate TASK [OPTIONS]
- `-e, --example` - Input -> output example (can be repeated)
- `--tool` - Generate from MCP tool schema (path#tool_name)
- `-f, --from PATH` - Improve from existing skill dir or agent trace file (auto-detected)
- `-m, --model MODEL` - Model for generation (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514')
- `-m, --model MODEL` - Skill generation model (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514')
- `--test-gen-model MODEL` - Override test generation model for this run
- `-o, --output PATH` - Output directory for skill
- `--no-eval` - Skip evaluation and refinement
- `--eval-model MODEL` - Different model to evaluate skill on
Expand Down Expand Up @@ -120,10 +146,9 @@ upskill eval SKILL_PATH [OPTIONS]
**Options:**
- `-t, --tests PATH` - Test cases JSON file
- `-m, --model MODEL` - Model(s) to evaluate against (repeatable for multi-model benchmarking)
- `--test-gen-model MODEL` - Override test generation model when tests must be generated
- `--runs N` - Number of runs per model (default: 1)
- `--provider [anthropic|openai|generic]` - API provider (auto-detected as 'generic' when --base-url is provided)
- `--base-url URL` - Custom API endpoint for local models
- `--no-baseline` - Skip baseline comparison
- `--no-baseline` - Skip baseline comparison (simple eval mode only; ignored in benchmark mode)
- `-v, --verbose` - Show per-test results
- `--log-runs / --no-log-runs` - Log run data (default: enabled)
- `--runs-dir PATH` - Directory for run logs
Expand All @@ -149,14 +174,15 @@ upskill eval ./skills/my-skill/ -m haiku -m sonnet
# Multiple runs per model for statistical significance
upskill eval ./skills/my-skill/ -m haiku -m sonnet --runs 5

# Evaluate on local model (llama.cpp server)
upskill eval ./skills/my-skill/ \
-m "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \
--base-url http://localhost:8080/v1
# Evaluate a local model configured in fast-agent
upskill eval ./skills/my-skill/ -m generic.my-model

# Skip baseline (just test with skill)
upskill eval ./skills/my-skill/ --no-baseline

# Benchmark mode is triggered by multiple models OR --runs > 1
upskill eval ./skills/my-skill/ -m haiku --runs 5

# Disable run logging
upskill eval ./skills/my-skill/ --no-log-runs
```
Expand Down Expand Up @@ -246,7 +272,7 @@ upskill runs [OPTIONS]
**Options:**
- `-d, --dir PATH` - Runs directory
- `-s, --skill TEXT` - Filter by skill name(s) (repeatable)
- `-m, --model TEXT` - Filter by model(s) (repeatable)
- `-m, --model TEXT` - Filter historical run data by model(s) (repeatable)
- `--metric [success|tokens]` - Metric to display (default: success)
- `--csv PATH` - Export to CSV instead of plot

Expand Down Expand Up @@ -369,16 +395,34 @@ Disable with `--no-log-runs`.

## Configuration

### upskill config (`~/.config/upskill/config.yaml`)
### upskill config (`./upskill.config.yaml`)

```yaml
model: sonnet # Default generation model
skill_generation_model: sonnet # Default skill generation model
eval_model: haiku # Default evaluation model (optional)
test_gen_model: null # Optional test generation model
skills_dir: ./skills # Where to save skills
runs_dir: ./runs # Where to save run logs
max_refine_attempts: 3 # Refinement iterations
```

`test_gen_model` fallback behavior:

- CLI `--test-gen-model` overrides config for a single run.
- If set, test generation uses `test_gen_model`.
- If unset, test generation falls back to `skill_generation_model`.
- For `eval`/`benchmark`, this intentionally uses `skill_generation_model` (not `eval_model`) so generated tests stay
stable when sweeping multiple evaluation models.

Backward compatibility: `model` is still accepted in config files as a legacy alias for
`skill_generation_model`.

Config lookup order:

1. `UPSKILL_CONFIG` environment variable (path)
2. `./upskill.config.yaml` (project local)
3. `~/.config/upskill/config.yaml` (legacy fallback)

### FastAgent config (`fastagent.config.yaml`)

Place in your project directory to customize FastAgent settings:
Expand Down Expand Up @@ -488,10 +532,8 @@ upskill supports local models through any OpenAI-compatible endpoint (Ollama, ll
# Start Ollama (default port 11434)
ollama serve

# Evaluate with a local model
upskill eval ./skills/my-skill/ \
--model llama3.2:latest \
--base-url http://localhost:11434/v1
# Configure endpoint via fast-agent config/env, then evaluate
upskill eval ./skills/my-skill/ --model generic.llama3.2:latest
```

**With llama.cpp server:**
Expand All @@ -500,10 +542,6 @@ upskill eval ./skills/my-skill/ \
# Start llama.cpp server
./llama-server -m model.gguf --port 8080

# Evaluate with the local model
upskill eval ./skills/my-skill/ \
--model my-model \
--base-url http://localhost:8080/v1
# Configure endpoint via fast-agent config/env, then evaluate
upskill eval ./skills/my-skill/ --model generic.my-model
```

When `--base-url` is provided, the provider is automatically set to `generic` unless you specify `--provider` explicitly.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ readme = "README.md"
requires-python = ">=3.13.5,<3.14"
dependencies = [
"click>=8.1",
"fast-agent-mcp>=0.4.41",
"fast-agent-mcp>=0.4.53",
"pydantic>=2.0",
"python-dotenv>=1.0",
"pyyaml>=6.0",
Expand Down
3 changes: 0 additions & 3 deletions src/upskill/agent_cards/test_gen.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
---
type: agent
# note that this takes precedence over cli switch. you can set model string directly.
#model: opus?structured=tool_use
model: opus?reasoning=1024
description: Generate test cases for evaluating skills.
---
You generate test cases for evaluating AI agent skills. Output only valid JSON.
Loading