diff --git a/README.md b/README.md index b93af5e..24e9088 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,31 @@ View the results later. upskill runs --skill git-commit-messages ``` +## Model Handling Overview + +upskill uses distinct phases with explicit model roles: + +- **Skill generation**: create/refine `SKILL.md` +- **Test generation**: create synthetic evaluation cases +- **Evaluation**: run tests against evaluator model(s) +- **Benchmark**: repeated evaluation across multiple runs/models + +Model flags by command: + +| Command | Flag | Meaning | +|---|---|---| +| `generate` | `--model` | Skill generation/refinement model | +| `generate` | `--test-gen-model` | Test generation model override | +| `generate` | `--eval-model` | Optional extra cross-model eval pass | +| `eval` | `-m/--model` | Evaluation model(s) (repeatable) | +| `eval` | `--test-gen-model` | Test generation model override (when tests are generated) | +| `benchmark` | `-m/--model` | Evaluation model(s) to benchmark | +| `benchmark` | `--test-gen-model` | Test generation model override (when tests are generated) | +| `runs` / `plot` | `-m/--model` | Historical results filter only | + +`upskill eval` enters **benchmark mode** whenever you pass multiple `-m` values or `--runs > 1`. +In benchmark mode, baseline comparison is always off; `--no-baseline` is redundant. + ## Commands ### `upskill generate` @@ -59,7 +84,8 @@ upskill generate TASK [OPTIONS] - `-e, --example` - Input -> output example (can be repeated) - `--tool` - Generate from MCP tool schema (path#tool_name) - `-f, --from PATH` - Improve from existing skill dir or agent trace file (auto-detected) -- `-m, --model MODEL` - Model for generation (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514') +- `-m, --model MODEL` - Skill generation model (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514') +- `--test-gen-model MODEL` - Override test generation model for this run - `-o, --output PATH` - Output directory for skill - `--no-eval` - Skip evaluation and refinement - `--eval-model MODEL` - Different model to evaluate skill on @@ -120,10 +146,9 @@ upskill eval SKILL_PATH [OPTIONS] **Options:** - `-t, --tests PATH` - Test cases JSON file - `-m, --model MODEL` - Model(s) to evaluate against (repeatable for multi-model benchmarking) +- `--test-gen-model MODEL` - Override test generation model when tests must be generated - `--runs N` - Number of runs per model (default: 1) -- `--provider [anthropic|openai|generic]` - API provider (auto-detected as 'generic' when --base-url is provided) -- `--base-url URL` - Custom API endpoint for local models -- `--no-baseline` - Skip baseline comparison +- `--no-baseline` - Skip baseline comparison (simple eval mode only; ignored in benchmark mode) - `-v, --verbose` - Show per-test results - `--log-runs / --no-log-runs` - Log run data (default: enabled) - `--runs-dir PATH` - Directory for run logs @@ -149,14 +174,15 @@ upskill eval ./skills/my-skill/ -m haiku -m sonnet # Multiple runs per model for statistical significance upskill eval ./skills/my-skill/ -m haiku -m sonnet --runs 5 -# Evaluate on local model (llama.cpp server) -upskill eval ./skills/my-skill/ \ - -m "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \ - --base-url http://localhost:8080/v1 +# Evaluate a local model configured in fast-agent +upskill eval ./skills/my-skill/ -m generic.my-model # Skip baseline (just test with skill) upskill eval ./skills/my-skill/ --no-baseline +# Benchmark mode is triggered by multiple models OR --runs > 1 +upskill eval ./skills/my-skill/ -m haiku --runs 5 + # Disable run logging upskill eval ./skills/my-skill/ --no-log-runs ``` @@ -246,7 +272,7 @@ upskill runs [OPTIONS] **Options:** - `-d, --dir PATH` - Runs directory - `-s, --skill TEXT` - Filter by skill name(s) (repeatable) -- `-m, --model TEXT` - Filter by model(s) (repeatable) +- `-m, --model TEXT` - Filter historical run data by model(s) (repeatable) - `--metric [success|tokens]` - Metric to display (default: success) - `--csv PATH` - Export to CSV instead of plot @@ -369,16 +395,34 @@ Disable with `--no-log-runs`. ## Configuration -### upskill config (`~/.config/upskill/config.yaml`) +### upskill config (`./upskill.config.yaml`) ```yaml -model: sonnet # Default generation model +skill_generation_model: sonnet # Default skill generation model eval_model: haiku # Default evaluation model (optional) +test_gen_model: null # Optional test generation model skills_dir: ./skills # Where to save skills runs_dir: ./runs # Where to save run logs max_refine_attempts: 3 # Refinement iterations ``` +`test_gen_model` fallback behavior: + +- CLI `--test-gen-model` overrides config for a single run. +- If set, test generation uses `test_gen_model`. +- If unset, test generation falls back to `skill_generation_model`. +- For `eval`/`benchmark`, this intentionally uses `skill_generation_model` (not `eval_model`) so generated tests stay + stable when sweeping multiple evaluation models. + +Backward compatibility: `model` is still accepted in config files as a legacy alias for +`skill_generation_model`. + +Config lookup order: + +1. `UPSKILL_CONFIG` environment variable (path) +2. `./upskill.config.yaml` (project local) +3. `~/.config/upskill/config.yaml` (legacy fallback) + ### FastAgent config (`fastagent.config.yaml`) Place in your project directory to customize FastAgent settings: @@ -488,10 +532,8 @@ upskill supports local models through any OpenAI-compatible endpoint (Ollama, ll # Start Ollama (default port 11434) ollama serve -# Evaluate with a local model -upskill eval ./skills/my-skill/ \ - --model llama3.2:latest \ - --base-url http://localhost:11434/v1 +# Configure endpoint via fast-agent config/env, then evaluate +upskill eval ./skills/my-skill/ --model generic.llama3.2:latest ``` **With llama.cpp server:** @@ -500,10 +542,6 @@ upskill eval ./skills/my-skill/ \ # Start llama.cpp server ./llama-server -m model.gguf --port 8080 -# Evaluate with the local model -upskill eval ./skills/my-skill/ \ - --model my-model \ - --base-url http://localhost:8080/v1 +# Configure endpoint via fast-agent config/env, then evaluate +upskill eval ./skills/my-skill/ --model generic.my-model ``` - -When `--base-url` is provided, the provider is automatically set to `generic` unless you specify `--provider` explicitly. diff --git a/pyproject.toml b/pyproject.toml index ffc71c0..160bcec 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,7 +6,7 @@ readme = "README.md" requires-python = ">=3.13.5,<3.14" dependencies = [ "click>=8.1", - "fast-agent-mcp>=0.4.41", + "fast-agent-mcp>=0.4.53", "pydantic>=2.0", "python-dotenv>=1.0", "pyyaml>=6.0", diff --git a/src/upskill/agent_cards/test_gen.md b/src/upskill/agent_cards/test_gen.md index 0aba7f1..3de8ebe 100644 --- a/src/upskill/agent_cards/test_gen.md +++ b/src/upskill/agent_cards/test_gen.md @@ -1,8 +1,5 @@ --- type: agent -# note that this takes precedence over cli switch. you can set model string directly. -#model: opus?structured=tool_use -model: opus?reasoning=1024 description: Generate test cases for evaluating skills. --- You generate test cases for evaluating AI agent skills. Output only valid JSON. diff --git a/src/upskill/cli.py b/src/upskill/cli.py index 783248c..1a9d63c 100644 --- a/src/upskill/cli.py +++ b/src/upskill/cli.py @@ -2,6 +2,7 @@ from __future__ import annotations import asyncio +import inspect import json import sys from collections.abc import AsyncIterator @@ -18,7 +19,7 @@ from rich.table import Table from rich.tree import Tree -from upskill.config import Config +from upskill.config import Config, resolve_upskill_config_path from upskill.evaluate import evaluate_skill, get_failure_descriptions from upskill.generate import generate_skill, generate_tests, improve_skill, refine_skill from upskill.logging import ( @@ -32,6 +33,7 @@ write_run_metadata, write_run_result, ) +from upskill.model_resolution import ResolvedModels, resolve_models from upskill.models import ( BatchSummary, RunMetadata, @@ -47,10 +49,13 @@ @asynccontextmanager -async def _fast_agent_context() -> AsyncIterator[object]: +async def _fast_agent_context(config: Config | None = None) -> AsyncIterator[object]: + config = config or Config.load() fast = FastAgent( "upskill", + config_path=str(config.effective_fastagent_config), ignore_unknown_args=True, + parse_cli_args=False, ) @fast.agent() @@ -65,6 +70,62 @@ async def empty(): yield agent +async def _set_agent_model(agent: object, model: str | None) -> None: + """Best-effort model assignment for a fast-agent instance.""" + if not model: + return + set_model = getattr(agent, "set_model", None) + if not callable(set_model): + return + result = set_model(model) + if inspect.isawaitable(result): + await result + + +def _require_resolved_model(value: str | None, *, field: str, command: str) -> str: + """Require a non-null resolved model value for a command.""" + if value is None: + raise RuntimeError( + f"Model resolution bug: `{command}` requires resolved `{field}` to be set." + ) + return value + + +def _require_resolved_models(values: list[str], *, field: str, command: str) -> list[str]: + """Require a non-empty resolved model list for a command.""" + if not values: + raise RuntimeError( + f"Model resolution bug: `{command}` requires resolved `{field}` to be non-empty." + ) + return values + + +def _print_model_plan(command: str, resolved: ResolvedModels, runs: int | None = None) -> None: + """Print resolved model plan for command execution.""" + console.print("[dim]Resolved model plan:[/dim]") + + if command == "generate": + console.print(f" Skill Generation Model: {resolved.skill_generation_model}") + console.print(f" Test Generation Model: {resolved.test_generation_model}") + console.print(f" Evaluation Model (main loop): {resolved.skill_generation_model}") + if resolved.extra_eval_model: + console.print(f" Evaluation Model (extra pass): {resolved.extra_eval_model}") + return + + if command in {"eval", "benchmark"}: + models = ", ".join(resolved.evaluation_models) + console.print(f" Evaluation Model(s): {models}") + if runs is not None: + console.print(f" Runs per model: {runs}") + baseline_state = "off (benchmark mode)" if resolved.is_benchmark_mode else ( + "on" if resolved.run_baseline else "off" + ) + console.print( + f" Baseline: {baseline_state}" + ) + console.print(f" Test Generation Model: {resolved.test_generation_model}") + + def _render_bar(value: float, width: int = 20) -> str: """Render a simple text bar for a 0-1 value.""" if width <= 0: @@ -163,6 +224,15 @@ def _load_eval_results(runs_path: Path) -> list[EvalPlotResult]: @click.version_option() def main(): """upskill - Generate and evaluate agent skills.""" + resolution = resolve_upskill_config_path() + if resolution.path is None: + console.print("[dim]Config source: defaults (no config file found)[/dim]") + return + + message = f"[dim]Config source: {resolution.source} ({resolution.path})[/dim]" + if not resolution.exists: + message += " [yellow](file missing; using defaults until saved)[/yellow]" + console.print(message) @main.command() @@ -179,11 +249,15 @@ def main(): @click.option( "-m", "--model", - help="Model for generation (e.g., 'sonnet', 'anthropic.claude-sonnet-4-20250514')", + help="Skill generation model for skill creation/refinement", +) +@click.option( + "--test-gen-model", + help="Override test generation model for this run", ) @click.option("-o", "--output", type=click.Path(), help="Output directory for skill") @click.option("--no-eval", is_flag=True, help="Skip eval and refinement") -@click.option("--eval-model", help="Model to evaluate skill on (different from generation model)") +@click.option("--eval-model", help="Optional extra cross-model eval pass after generation") @click.option("--runs-dir", type=click.Path(), help="Directory for run logs (default: ./runs)") @click.option("--log-runs/--no-log-runs", default=True, help="Log run data (default: enabled)") def generate( @@ -192,6 +266,7 @@ def generate( tool: str | None, # noqa: ARG001 from_source: str | None, model: str | None, + test_gen_model: str | None, output: str | None, no_eval: bool, eval_model: str | None, @@ -206,6 +281,8 @@ def generate( upskill generate "write git commits" --model sonnet + upskill generate "write git commits" --model sonnet --test-gen-model opus + upskill generate "handle API errors" --eval-model haiku upskill generate "validate forms" -o ./my-skills/validation @@ -243,6 +320,7 @@ def generate( from_skill, from_trace, model, + test_gen_model, output, no_eval, eval_model, @@ -258,6 +336,7 @@ async def _generate_async( from_skill: str | None, from_trace: str | None, model: str | None, + test_gen_model: str | None, output: str | None, no_eval: bool, eval_model: str | None, @@ -266,7 +345,26 @@ async def _generate_async( ): """Async implementation of generate command.""" config = Config.load() - gen_model = model or config.model + resolved = resolve_models( + "generate", + config=config, + cli_model=model, + cli_eval_model=eval_model, + cli_test_gen_model=test_gen_model, + ) + skill_gen_model = _require_resolved_model( + resolved.skill_generation_model, + field="skill_generation_model", + command="generate", + ) + test_gen_model = _require_resolved_model( + resolved.test_generation_model, + field="test_generation_model", + command="generate", + ) + extra_eval_model = resolved.extra_eval_model + + _print_model_plan("generate", resolved) # Setup run logging if enabled batch_id = None @@ -278,9 +376,7 @@ async def _generate_async( batch_id, batch_folder = create_batch_folder(runs_path) console.print(f"Logging runs to: {batch_folder}", style="dim") - - async with _fast_agent_context() as agent: - + async with _fast_agent_context(config) as agent: # Generate from trace file if from_trace: console.print(f"Generating skill from trace: {from_trace}", style="dim") @@ -300,47 +396,54 @@ async def _generate_async( trace_context = trace_content[:4000] task = f"{task}\n\nBased on this agent trace:\n\n{trace_context}" - console.print(f"Generating skill with {gen_model}...", style="dim") + console.print(f"Generating skill with {skill_gen_model}...", style="dim") + await _set_agent_model(agent.skill_gen, skill_gen_model) skill = await generate_skill( task=task, examples=examples, generator=agent.skill_gen, - model=model, + model=skill_gen_model, ) # Improve existing skill elif from_skill: existing_skill = Skill.load(Path(from_skill)) console.print( - f"Improving [bold]{existing_skill.name}[/bold] with {gen_model}...", + f"Improving [bold]{existing_skill.name}[/bold] with {skill_gen_model}...", style="dim", ) + await _set_agent_model(agent.skill_gen, skill_gen_model) skill = await improve_skill( existing_skill, instructions=task, generator=agent.skill_gen, - model=model, + model=skill_gen_model, ) else: - console.print(f"Generating skill with {gen_model}...", style="dim") + console.print(f"Generating skill with {skill_gen_model}...", style="dim") + await _set_agent_model(agent.skill_gen, skill_gen_model) skill = await generate_skill( task=task, examples=examples, generator=agent.skill_gen, - model=model, + model=skill_gen_model, ) if no_eval: _save_and_display(skill, output, config) return console.print("Generating test cases...", style="dim") - test_cases = await generate_tests(task, generator=agent.test_gen, model=model) + await _set_agent_model(agent.test_gen, test_gen_model) + test_cases = await generate_tests(task, generator=agent.test_gen, model=test_gen_model) - # Eval loop with refinement (on generation model) + # Eval loop with refinement (on skill generation model) prev_success_rate = 0.0 results = None attempts = max(1, config.max_refine_attempts) for attempt in range(attempts): - console.print(f"Evaluating on {gen_model}... (attempt {attempt + 1})", style="dim") + console.print( + f"Evaluating on {skill_gen_model}... (attempt {attempt + 1})", + style="dim", + ) # Create run folder for logging (2 folders per attempt: baseline + with_skill) run_folder = None @@ -350,7 +453,7 @@ async def _generate_async( write_run_metadata( run_folder, RunMetadata( - model=gen_model, + model=skill_gen_model, task=task, batch_id=batch_id or "", run_number=baseline_run_num, @@ -363,7 +466,8 @@ async def _generate_async( skill, test_cases=test_cases, evaluator=agent.evaluator, - model=gen_model, + model=skill_gen_model, + show_baseline_progress=False, ) # Log run results (both baseline and with-skill for plot command) @@ -371,7 +475,7 @@ async def _generate_async( # Log baseline result baseline_result = RunResult( metadata=RunMetadata( - model=gen_model, + model=skill_gen_model, task=task, batch_id=batch_id or "", run_number=baseline_run_num, @@ -390,7 +494,7 @@ async def _generate_async( with_skill_folder = create_run_folder(batch_folder, attempt * 2 + 2) with_skill_result = RunResult( metadata=RunMetadata( - model=gen_model, + model=skill_gen_model, task=task, batch_id=batch_id or "", run_number=attempt * 2 + 2, @@ -430,27 +534,28 @@ async def _generate_async( if attempt < attempts - 1: console.print("Refining...", style="dim") failures = get_failure_descriptions(results) + await _set_agent_model(agent.skill_gen, skill_gen_model) skill = await refine_skill( skill, failures, generator=agent.skill_gen, - model=model, + model=skill_gen_model, ) # If eval_model specified, also eval on that model eval_results = None - if eval_model: - console.print(f"Evaluating on {eval_model}...", style="dim") + if extra_eval_model: + console.print(f"Evaluating on {extra_eval_model}...", style="dim") # Create run folder for eval model run_folder = None if log_runs and batch_folder: - run_number = attempts + 1 + run_number = len(run_results) + 1 run_folder = create_run_folder(batch_folder, run_number) write_run_metadata( run_folder, RunMetadata( - model=eval_model, + model=extra_eval_model, task=task, batch_id=batch_id or "", run_number=run_number, @@ -461,7 +566,8 @@ async def _generate_async( skill, test_cases, evaluator=agent.evaluator, - model=eval_model, + model=extra_eval_model, + show_baseline_progress=False, ) # Log eval run results (both baseline and with-skill) @@ -469,7 +575,7 @@ async def _generate_async( # Log baseline result baseline_result = RunResult( metadata=RunMetadata( - model=eval_model, + model=extra_eval_model, task=task, batch_id=batch_id or "", run_number=run_number, @@ -488,7 +594,7 @@ async def _generate_async( with_skill_folder = create_run_folder(batch_folder, run_number + 1) with_skill_result = RunResult( metadata=RunMetadata( - model=eval_model, + model=extra_eval_model, task=task, batch_id=batch_id or "", run_number=run_number + 1, @@ -515,7 +621,7 @@ async def _generate_async( if log_runs and batch_folder and batch_id: summary = BatchSummary( batch_id=batch_id, - model=gen_model, + model=skill_gen_model, task=task, total_runs=len(run_results), passed_runs=sum(1 for r in run_results if r.passed), @@ -531,7 +637,15 @@ async def _generate_async( "[yellow]No evaluation results available; skipping report output.[/yellow]" ) - _save_and_display(skill, output, config, results, eval_results, gen_model, eval_model) + _save_and_display( + skill, + output, + config, + results, + eval_results, + skill_gen_model, + extra_eval_model, + ) @@ -543,7 +657,7 @@ def _save_and_display( config: Config, results=None, eval_results=None, - gen_model: str | None = None, + skill_gen_model: str | None = None, eval_model: str | None = None, ): """Save skill and display summary.""" @@ -573,7 +687,11 @@ def _save_and_display( if results and eval_results: # Multiple models - show each with bars console.print() - for model_name, r in [(gen_model or "gen", results), (eval_model or "eval", eval_results)]: + model_rows = [ + (skill_gen_model or "skill-gen", results), + (eval_model or "eval", eval_results), + ] + for model_name, r in model_rows: console.print(f" [bold]{model_name}[/bold]") baseline_bar = _render_bar(r.baseline_success_rate) with_skill_bar = _render_bar(r.with_skill_success_rate) @@ -621,18 +739,22 @@ def _save_and_display( @click.argument("skill_path", type=click.Path(exists=True)) @click.option("-t", "--tests", type=click.Path(exists=True), help="Test cases JSON file") @click.option( - "-m", "--model", "models", multiple=True, help="Model(s) to evaluate (repeatable)" + "-m", + "--model", + "models", + multiple=True, + help="Evaluation model(s) to run tests on (repeatable)", ) -@click.option("--runs", "num_runs", type=int, default=1, help="Number of runs per model") @click.option( - "--provider", - type=click.Choice(["anthropic", "openai", "generic"]), - help="API provider (auto-detected as 'generic' when --base-url is provided)", + "--test-gen-model", + help="Override test generation model when tests must be generated", ) +@click.option("--runs", "num_runs", type=int, default=1, help="Number of runs per model") @click.option( - "--base-url", help="Custom API endpoint for local models (e.g., http://localhost:8080/v1)" + "--no-baseline", + is_flag=True, + help="Skip baseline comparison in simple eval mode (ignored in benchmark mode)", ) -@click.option("--no-baseline", is_flag=True, help="Skip baseline comparison") @click.option("-v", "--verbose", is_flag=True, help="Show per-test results") @click.option("--log-runs/--no-log-runs", default=True, help="Log run data (default: enabled)") @click.option("--runs-dir", type=click.Path(), help="Directory for run logs") @@ -640,15 +762,17 @@ def eval_cmd( skill_path: str, tests: str | None, models: tuple[str, ...], + test_gen_model: str | None, num_runs: int, - provider: str | None, - base_url: str | None, no_baseline: bool, verbose: bool, log_runs: bool, runs_dir: str | None, ): - """Evaluate a skill (compares with vs without). + """Evaluate a skill. + + Uses simple eval mode for one model with ``--runs 1``. + Enters benchmark mode when using multiple ``-m`` values or ``--runs > 1``. Examples: @@ -658,19 +782,15 @@ def eval_cmd( upskill eval ./skills/my-skill/ -m haiku + upskill eval ./skills/my-skill/ -m haiku --test-gen-model opus + upskill eval ./skills/my-skill/ -m haiku -m sonnet upskill eval ./skills/my-skill/ -m haiku --runs 5 - # Local model with llama.cpp server: - - upskill eval ./skills/my-skill/ -m my-model \\ - --base-url http://localhost:8080/v1 - - # Local model with Ollama: + # Evaluate local models configured in fast-agent - upskill eval ./skills/my-skill/ -m llama3.2:latest \\ - --base-url http://localhost:11434/v1 + upskill eval ./skills/my-skill/ -m generic.llama3.2:latest upskill eval ./skills/my-skill/ --no-log-runs """ @@ -679,9 +799,8 @@ def eval_cmd( skill_path, tests, list(models) if models else None, + test_gen_model, num_runs, - provider, - base_url, no_baseline, verbose, log_runs, @@ -694,9 +813,8 @@ async def _eval_async( skill_path: str, tests: str | None, models: list[str] | None, + test_gen_model: str | None, num_runs: int, - provider: str | None, - base_url: str | None, no_baseline: bool, verbose: bool, log_runs: bool, @@ -706,6 +824,31 @@ async def _eval_async( from upskill.evaluate import run_test config = Config.load() + resolved = resolve_models( + "eval", + config=config, + cli_models=models, + cli_test_gen_model=test_gen_model, + num_runs=num_runs, + no_baseline=no_baseline, + ) + evaluation_models = _require_resolved_models( + resolved.evaluation_models, + field="evaluation_models", + command="eval", + ) + test_gen_model = _require_resolved_model( + resolved.test_generation_model, + field="test_generation_model", + command="eval", + ) + + _print_model_plan("eval", resolved, runs=num_runs) + if resolved.is_benchmark_mode and no_baseline: + console.print( + "[dim]Note: --no-baseline is redundant in benchmark mode and is ignored.[/dim]" + ) + skill_dir = Path(skill_path) try: @@ -714,13 +857,7 @@ async def _eval_async( console.print(f"[red]No SKILL.md found in {skill_dir}[/red]") sys.exit(1) - # Use default model if none specified - if not models: - models = [config.effective_eval_model] - - is_benchmark_mode = len(models) > 1 or num_runs > 1 - - async with _fast_agent_context() as agent: + async with _fast_agent_context(config) as agent: # Load test cases test_cases: list[TestCase] = [] if tests: @@ -736,10 +873,11 @@ async def _eval_async( test_source = "skill_meta.json" else: console.print("Generating test cases from skill...", style="dim") + await _set_agent_model(agent.test_gen, test_gen_model) test_cases = await generate_tests( skill.description, generator=agent.test_gen, - model=models[0], + model=test_gen_model, ) test_source = "generated" @@ -764,26 +902,20 @@ async def _eval_async( batch_id, batch_folder = create_batch_folder(runs_path) console.print(f"Logging to: {batch_folder}", style="dim") - provider_info = "" - if provider: - provider_info += f" via {provider}" - if base_url: - provider_info += f" @ {base_url}" - - if is_benchmark_mode: + if resolved.is_benchmark_mode: # Benchmark mode: multiple models and/or runs console.print( - f"\nEvaluating [bold]{skill.name}[/bold] across {len(models)} model(s)" + f"\nEvaluating [bold]{skill.name}[/bold] across {len(evaluation_models)} model(s)" ) console.print( f" {len(test_cases)} test case(s), " - f"{num_runs} run(s) per model{provider_info}\n" + f"{num_runs} run(s) per model\n" ) - model_results: dict[str, list[RunResult]] = {m: [] for m in models} + model_results: dict[str, list[RunResult]] = {m: [] for m in evaluation_models} all_run_results: list[RunResult] = [] - for model in models: + for model in evaluation_models: console.print(f"[bold]{model}[/bold]") for run_num in range(1, num_runs + 1): @@ -811,6 +943,10 @@ async def _eval_async( tc, evaluator=agent.evaluator, skill=skill, + model=model, + instance_name=( + f"eval ({model} run {run_num} test {tc_idx})" + ), ) except Exception as e: console.print(f" [red]Test error: {e}[/red]") @@ -905,7 +1041,7 @@ async def _eval_async( if log_runs and batch_folder and batch_id: summary = BatchSummary( batch_id=batch_id, - model=", ".join(models), + model=", ".join(evaluation_models), task=skill.description, total_runs=len(all_run_results), passed_runs=sum(1 for r in all_run_results if r.passed), @@ -915,22 +1051,23 @@ async def _eval_async( else: # Simple eval mode: single model, single run - model = models[0] - console.print(f"Running {len(test_cases)} test cases{provider_info}...", style="dim") + model = evaluation_models[0] + console.print(f"Running {len(test_cases)} test cases...", style="dim") results = await evaluate_skill( skill, test_cases, evaluator=agent.evaluator, model=model, - run_baseline=not no_baseline, + run_baseline=resolved.run_baseline, + show_baseline_progress=verbose, ) # Log results (both baseline and with-skill) run_results: list[RunResult] = [] if log_runs and batch_folder: # Log baseline result - if not no_baseline: + if resolved.run_baseline: baseline_folder = create_run_folder(batch_folder, 1) baseline_result = RunResult( metadata=RunMetadata( @@ -951,17 +1088,20 @@ async def _eval_async( run_results.append(baseline_result) # Log with-skill result - with_skill_folder = create_run_folder(batch_folder, 2 if not no_baseline else 1) + with_skill_folder = create_run_folder( + batch_folder, + 2 if resolved.run_baseline else 1, + ) with_skill_result = RunResult( metadata=RunMetadata( model=model, task=skill.description, batch_id=batch_id or "", - run_number=2 if not no_baseline else 1, + run_number=2 if resolved.run_baseline else 1, ), stats=aggregate_conversation_stats(results.with_skill_results), passed=results.is_beneficial - if not no_baseline + if resolved.run_baseline else results.with_skill_success_rate > 0.5, assertions_passed=int(results.with_skill_success_rate * len(test_cases)), assertions_total=len(test_cases), @@ -983,7 +1123,7 @@ async def _eval_async( ) write_batch_summary(batch_folder, summary) - if verbose and not no_baseline: + if verbose and resolved.run_baseline: console.print() for i, (with_r, base_r) in enumerate( zip(results.with_skill_results, results.baseline_results), 1 @@ -996,7 +1136,7 @@ async def _eval_async( # Display results with horizontal bars console.print() - if not no_baseline: + if resolved.run_baseline: baseline_rate = results.baseline_success_rate with_skill_rate = results.with_skill_success_rate lift = results.skill_lift @@ -1031,7 +1171,7 @@ async def _eval_async( console.print(f" with skill {with_skill_bar} {with_skill_rate:>5.0%}") console.print(f" tokens: {results.with_skill_total_tokens}") - if not no_baseline: + if resolved.run_baseline: if results.is_beneficial: console.print("\n[green]Recommendation: keep skill[/green]") else: @@ -1118,9 +1258,20 @@ def list_cmd(skills_dir: str | None, verbose: bool): @main.command("benchmark") @click.argument("skill_path", type=click.Path(exists=True)) -@click.option("-m", "--model", "models", multiple=True, required=True, help="Model to benchmark") +@click.option( + "-m", + "--model", + "models", + multiple=True, + required=True, + help="Evaluation model(s) to benchmark (repeatable)", +) @click.option("--runs", "num_runs", type=int, default=3, help="Runs per model (default: 3)") @click.option("-t", "--tests", type=click.Path(exists=True), help="Test cases JSON file") +@click.option( + "--test-gen-model", + help="Override test generation model when tests must be generated", +) @click.option("-o", "--output", type=click.Path(), help="Output directory for results") @click.option("-v", "--verbose", is_flag=True, help="Show per-run details") def benchmark_cmd( @@ -1128,6 +1279,7 @@ def benchmark_cmd( models: tuple[str, ...], num_runs: int, tests: str | None, + test_gen_model: str | None, output: str | None, verbose: bool, ): @@ -1141,13 +1293,21 @@ def benchmark_cmd( upskill benchmark ./skills/hf-eval-extraction/ -m haiku -m sonnet + upskill benchmark ./skills/hf-eval-extraction/ -m haiku -m sonnet --test-gen-model opus + upskill benchmark ./skills/my-skill/ -m gpt-4o -m claude-sonnet --runs 5 upskill benchmark ./skills/my-skill/ -m haiku -t ./custom_tests.json -v """ asyncio.run( _benchmark_async( - skill_path, list(models), num_runs, tests, output, verbose + skill_path, + list(models), + test_gen_model, + num_runs, + tests, + output, + verbose, ) ) @@ -1155,6 +1315,7 @@ def benchmark_cmd( async def _benchmark_async( skill_path: str, models: list[str], + test_gen_model: str | None, num_runs: int, tests_path: str | None, output_dir: str | None, @@ -1164,9 +1325,29 @@ async def _benchmark_async( from upskill.evaluate import run_test config = Config.load() + resolved = resolve_models( + "benchmark", + config=config, + cli_models=models, + cli_test_gen_model=test_gen_model, + num_runs=num_runs, + ) + evaluation_models = _require_resolved_models( + resolved.evaluation_models, + field="evaluation_models", + command="benchmark", + ) + test_gen_model = _require_resolved_model( + resolved.test_generation_model, + field="test_generation_model", + command="benchmark", + ) + + _print_model_plan("benchmark", resolved, runs=num_runs) + skill = Skill.load(Path(skill_path)) - async with _fast_agent_context() as agent: + async with _fast_agent_context(config) as agent: # Load test cases if tests_path: with open(tests_path, encoding="utf-8") as f: @@ -1179,9 +1360,11 @@ async def _benchmark_async( test_cases = skill.tests else: console.print("Generating test cases from skill...", style="dim") + await _set_agent_model(agent.test_gen, test_gen_model) test_cases = await generate_tests( skill.description, generator=agent.test_gen, + model=test_gen_model, ) # Setup output directory @@ -1194,13 +1377,15 @@ async def _benchmark_async( console.print(f"Results will be saved to: {batch_folder}", style="dim") # Track results per model - model_results: dict[str, list[RunResult]] = {m: [] for m in models} + model_results: dict[str, list[RunResult]] = {m: [] for m in evaluation_models} all_run_results: list[RunResult] = [] - console.print(f"\nBenchmarking [bold]{skill.name}[/bold] across {len(models)} model(s)") + console.print( + f"\nBenchmarking [bold]{skill.name}[/bold] across {len(evaluation_models)} model(s)" + ) console.print(f" {len(test_cases)} test case(s), {num_runs} run(s) per model\n") - for model in models: + for model in evaluation_models: console.print(f"[bold]{model}[/bold]") for run_num in range(1, num_runs + 1): @@ -1221,6 +1406,10 @@ async def _benchmark_async( tc, evaluator=agent.evaluator, skill=skill, + model=model, + instance_name=( + f"benchmark ({model} run {run_num} test {tc_idx})" + ), ) except Exception as e: console.print(f" [red]Test error: {e}[/red]") @@ -1309,7 +1498,7 @@ async def _benchmark_async( summary = BatchSummary( batch_id=batch_id, - model=", ".join(models), + model=", ".join(evaluation_models), task=skill.description, total_runs=len(all_run_results), passed_runs=sum(1 for r in all_run_results if r.passed), @@ -1320,7 +1509,13 @@ async def _benchmark_async( @main.command("runs") @click.option("-d", "--dir", "runs_dir", type=click.Path(exists=True), help="Runs directory") @click.option("-s", "--skill", "skills", multiple=True, help="Filter by skill name(s)") -@click.option("-m", "--model", "models", multiple=True, help="Filter by model(s)") +@click.option( + "-m", + "--model", + "models", + multiple=True, + help="Filter historical run data by model(s)", +) @click.option("--csv", "csv_output", type=click.Path(), help="Export to CSV file") @click.option( "--metric", @@ -1436,7 +1631,13 @@ def runs_cmd( @main.command("plot", hidden=True) @click.option("-d", "--dir", "runs_dir", type=click.Path(exists=True), help="Runs directory") @click.option("-s", "--skill", "skills", multiple=True, help="Filter by skill name(s)") -@click.option("-m", "--model", "models", multiple=True, help="Filter by model(s)") +@click.option( + "-m", + "--model", + "models", + multiple=True, + help="Filter historical run data by model(s)", +) @click.option( "--metric", type=click.Choice(["success", "tokens"]), diff --git a/src/upskill/config.py b/src/upskill/config.py index c253884..01f1f5c 100644 --- a/src/upskill/config.py +++ b/src/upskill/config.py @@ -2,10 +2,25 @@ from __future__ import annotations +import os +from dataclasses import dataclass from pathlib import Path import yaml -from pydantic import BaseModel, Field +from pydantic import AliasChoices, BaseModel, ConfigDict, Field + +UPSKILL_CONFIG_FILE = "upskill.config.yaml" +LEGACY_CONFIG_FILE = "config.yaml" +UPSKILL_CONFIG_ENV = "UPSKILL_CONFIG" + + +@dataclass(frozen=True) +class UpskillConfigPathResolution: + """Resolved upskill config path and where it was found.""" + + path: Path | None + source: str + exists: bool def get_config_dir() -> Path: @@ -13,6 +28,57 @@ def get_config_dir() -> Path: return Path.home() / ".config" / "upskill" +def get_local_config_path() -> Path: + """Get the project-local upskill config path.""" + return Path.cwd() / UPSKILL_CONFIG_FILE + + +def get_legacy_config_path() -> Path: + """Get the legacy user-level upskill config path.""" + return get_config_dir() / LEGACY_CONFIG_FILE + + +def find_upskill_config_path() -> Path | None: + """Find upskill config path in priority order.""" + return resolve_upskill_config_path().path + + +def resolve_upskill_config_path() -> UpskillConfigPathResolution: + """Find upskill config path in priority order. + + Priority: + 1. UPSKILL_CONFIG env var + 2. ./upskill.config.yaml (project local) + 3. ~/.config/upskill/config.yaml (legacy) + """ + config_override = os.getenv(UPSKILL_CONFIG_ENV) + if config_override: + override_path = Path(config_override).expanduser() + return UpskillConfigPathResolution( + path=override_path, + source=f"{UPSKILL_CONFIG_ENV} env var", + exists=override_path.exists(), + ) + + local_config = get_local_config_path() + if local_config.exists(): + return UpskillConfigPathResolution( + path=local_config, + source="project-local config", + exists=True, + ) + + legacy_config = get_legacy_config_path() + if legacy_config.exists(): + return UpskillConfigPathResolution( + path=legacy_config, + source="legacy user config", + exists=True, + ) + + return UpskillConfigPathResolution(path=None, source="defaults", exists=False) + + def get_default_skills_dir() -> Path: """Get the default skills directory (current working directory).""" return Path.cwd() / "skills" @@ -38,9 +104,22 @@ def find_config_path() -> Path: class Config(BaseModel): """upskill configuration.""" + model_config = ConfigDict(populate_by_name=True) + # Model settings - model: str = Field(default="sonnet", description="Model for generation (FastAgent format)") - eval_model: str | None = Field(default=None, description="Model for eval (defaults to model)") + skill_generation_model: str = Field( + default="sonnet", + validation_alias=AliasChoices("skill_generation_model", "model"), + description="Model for skill generation (FastAgent format)", + ) + eval_model: str | None = Field( + default=None, + description="Model for evaluation (defaults to skill_generation_model)", + ) + test_gen_model: str | None = Field( + default=None, + description="Model for test generation (defaults to skill generation model)", + ) # Directory settings skills_dir: Path = Field( @@ -60,7 +139,10 @@ class Config(BaseModel): @classmethod def load(cls) -> Config: """Load config from file, or return defaults.""" - config_path = get_config_dir() / "config.yaml" + config_path = find_upskill_config_path() + if config_path is None: + return cls() + if config_path.exists(): with open(config_path) as f: data = yaml.safe_load(f) or {} @@ -72,13 +154,14 @@ def load(cls) -> Config: if "fastagent_config" in data and isinstance(data["fastagent_config"], str): data["fastagent_config"] = Path(data["fastagent_config"]) return cls(**data) + return cls() def save(self) -> None: """Save config to file.""" - config_dir = get_config_dir() - config_dir.mkdir(parents=True, exist_ok=True) - config_path = config_dir / "config.yaml" + config_path = find_upskill_config_path() or get_local_config_path() + config_path.parent.mkdir(parents=True, exist_ok=True) + data = self.model_dump(mode="json") # Convert Path objects to strings for YAML data["skills_dir"] = str(self.skills_dir) @@ -91,7 +174,12 @@ def save(self) -> None: @property def effective_eval_model(self) -> str: """Get the model to use for evaluation.""" - return self.eval_model or self.model + return self.eval_model or self.skill_generation_model + + @property + def model(self) -> str: + """Backward-compatible alias for ``skill_generation_model``.""" + return self.skill_generation_model @property def effective_fastagent_config(self) -> Path: diff --git a/src/upskill/evaluate.py b/src/upskill/evaluate.py index 08b4300..fdc07c4 100644 --- a/src/upskill/evaluate.py +++ b/src/upskill/evaluate.py @@ -7,12 +7,17 @@ import shutil import tempfile from collections.abc import Generator -from contextlib import contextmanager +from contextlib import contextmanager, nullcontext from pathlib import Path from fast_agent import ConversationSummary from fast_agent.agents.llm_agent import LlmAgent +try: + from fast_agent.ui.rich_progress import progress_display +except Exception: # pragma: no cover - defensive import for older fast-agent versions + progress_display = None + from upskill.fastagent_integration import ( compose_instruction, ) @@ -28,6 +33,20 @@ ) from upskill.validators import get_validator + +def _hide_progress_task(task_name: str | None) -> None: + """Best-effort hide of a completed task from the shared progress display.""" + if not task_name or progress_display is None: + return + hide_task = getattr(progress_display, "hide_task", None) + if not callable(hide_task): + return + try: + hide_task(task_name) + except Exception: + # Progress cleanup is best-effort and should never fail evaluations. + return + logger = logging.getLogger(__name__) PROMPT = ( @@ -171,6 +190,7 @@ async def _run_in_workspace(workspace: Path | None) -> TestResult: await clone.shutdown() except Exception as exc: logger.exception("Failed to shutdown evaluator clone", exc_info=exc) + _hide_progress_task(instance_name) if needs_workspace: with isolated_workspace() as workspace: @@ -184,6 +204,7 @@ async def run_test( skill: Skill | None, use_workspace: bool | None = None, model: str | None = None, + instance_name: str | None = None, ) -> TestResult: """Run a single test case using an evaluator agent. @@ -192,6 +213,8 @@ async def run_test( evaluator: Evaluator agent to run the test case skill: Optional skill to inject (None for baseline) use_workspace: Force workspace isolation (auto-detected from test_case.validator) + model: Model to evaluate with for this test case + instance_name: Optional evaluator instance display name """ try: @@ -203,6 +226,7 @@ async def run_test( evaluator, instruction, use_workspace=use_workspace, + instance_name=instance_name, ) except Exception as exc: return TestResult(test_case=test_case, success=False, error=str(exc)) @@ -214,6 +238,7 @@ async def evaluate_skill( evaluator: LlmAgent, model: str | None = None, run_baseline: bool = True, + show_baseline_progress: bool = False, ) -> EvalResults: """Evaluate a skill against test cases using FastAgent. @@ -223,6 +248,7 @@ async def evaluate_skill( evaluator: Evaluator agent to run the test cases model: Model to evaluate on (defaults to config.eval_model) run_baseline: Whether to also run without the skill + show_baseline_progress: Whether to render baseline progress output Returns: EvalResults comparing skill vs baseline @@ -238,7 +264,7 @@ async def _run_batch( ) -> list[TestResult]: tasks = [] for index, tc in enumerate(test_cases, start=1): - instance_name = f"{evaluator.name}[{label}-{index}]" + instance_name = f"eval ({label} test {index})" tasks.append( _run_test_with_evaluator( tc, @@ -254,7 +280,7 @@ async def _run_batch( # Run with skill skill_instruction = compose_instruction(base_instruction, skill) - results.with_skill_results = await _run_batch(skill_instruction, "skill") + results.with_skill_results = await _run_batch(skill_instruction, "with-skill") # Calculate with-skill metrics successes = sum(1 for r in results.with_skill_results if r.success) @@ -270,7 +296,14 @@ async def _run_batch( # Run baseline if requested if run_baseline: - results.baseline_results = await _run_batch(None, "baseline") + pause_cm = nullcontext() + if not show_baseline_progress and progress_display is not None: + paused = getattr(progress_display, "paused", None) + if callable(paused): + pause_cm = paused() + + with pause_cm: + results.baseline_results = await _run_batch(None, "baseline") successes = sum(1 for r in results.baseline_results if r.success) results.baseline_success_rate = successes / len(test_cases) if test_cases else 0 diff --git a/src/upskill/generate.py b/src/upskill/generate.py index 4b6e054..1cca065 100644 --- a/src/upskill/generate.py +++ b/src/upskill/generate.py @@ -235,14 +235,14 @@ async def improve_skill( Args: skill: The existing skill to improve instructions: What improvements to make - model: Model to use for generation + model: Model to use for skill generation config: Configuration Returns: Improved Skill object """ # config = config or Config.load() - # model = model or config.model + # model = model or config.skill_generation_model prompt = IMPROVE_PROMPT.format( name=skill.name, diff --git a/src/upskill/model_resolution.py b/src/upskill/model_resolution.py new file mode 100644 index 0000000..e7ae7f3 --- /dev/null +++ b/src/upskill/model_resolution.py @@ -0,0 +1,95 @@ +"""Model resolution helpers for CLI commands. + +This module centralizes command-specific model fallback and mode selection logic. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Literal + +from upskill.config import Config + +CommandName = Literal["generate", "eval", "benchmark"] + + +@dataclass(frozen=True) +class ResolvedModels: + """Resolved model plan for a command invocation.""" + + skill_generation_model: str | None = None + test_generation_model: str | None = None + evaluation_models: list[str] = field(default_factory=list) + extra_eval_model: str | None = None + is_benchmark_mode: bool = False + run_baseline: bool = True + + +def resolve_models( + command: CommandName, + *, + config: Config, + cli_model: str | None = None, + cli_models: list[str] | tuple[str, ...] | None = None, + cli_eval_model: str | None = None, + cli_test_gen_model: str | None = None, + num_runs: int = 1, + no_baseline: bool = False, +) -> ResolvedModels: + """Resolve all models and mode flags for a command. + + Args: + command: CLI command name. + config: Loaded upskill configuration. + cli_model: Single ``--model`` value for ``generate``. + cli_models: Repeatable ``-m/--model`` values for ``eval``/``benchmark``. + cli_eval_model: Optional ``--eval-model`` value for ``generate``. + cli_test_gen_model: Optional ``--test-gen-model`` override. + num_runs: ``--runs`` value. + no_baseline: Whether ``--no-baseline`` was passed. + + Returns: + A ``ResolvedModels`` instance containing command-specific resolved model fields. + """ + + if command == "generate": + skill_generation_model = cli_model or config.skill_generation_model + test_generation_model = ( + cli_test_gen_model or config.test_gen_model or skill_generation_model + ) + return ResolvedModels( + skill_generation_model=skill_generation_model, + test_generation_model=test_generation_model, + evaluation_models=[skill_generation_model], + extra_eval_model=cli_eval_model, + is_benchmark_mode=False, + run_baseline=True, + ) + + if command == "eval": + evaluation_models = list(cli_models) if cli_models else [config.effective_eval_model] + is_benchmark_mode = len(evaluation_models) > 1 or num_runs > 1 + run_baseline = (not no_baseline) if not is_benchmark_mode else False + return ResolvedModels( + test_generation_model=( + cli_test_gen_model or config.test_gen_model or config.skill_generation_model + ), + evaluation_models=evaluation_models, + is_benchmark_mode=is_benchmark_mode, + run_baseline=run_baseline, + ) + + if command == "benchmark": + evaluation_models = list(cli_models) if cli_models else [] + if not evaluation_models: + raise ValueError("benchmark requires at least one model") + return ResolvedModels( + test_generation_model=( + cli_test_gen_model or config.test_gen_model or config.skill_generation_model + ), + evaluation_models=evaluation_models, + is_benchmark_mode=True, + run_baseline=False, + ) + + raise ValueError(f"Unsupported command: {command}") diff --git a/tests/conftest.py b/tests/conftest.py new file mode 100644 index 0000000..fdcbc1f --- /dev/null +++ b/tests/conftest.py @@ -0,0 +1,9 @@ +from __future__ import annotations + +import sys +from pathlib import Path + +ROOT = Path(__file__).resolve().parents[1] +SRC = ROOT / "src" +if str(SRC) not in sys.path: + sys.path.insert(0, str(SRC)) diff --git a/tests/test_agent_card_guardrails.py b/tests/test_agent_card_guardrails.py new file mode 100644 index 0000000..fe8dfe5 --- /dev/null +++ b/tests/test_agent_card_guardrails.py @@ -0,0 +1,64 @@ +from __future__ import annotations + +from pathlib import Path + +import pytest + +AGENT_CARDS_DIR = Path("src/upskill/agent_cards") +GUARDED_CARDS = ("skill_gen.md", "test_gen.md") +# Intentional exceptions require both allowlist entry and frontmatter annotation. +ALLOWED_MODEL_PIN_OVERRIDES: dict[str, str] = {} + + +def _parse_frontmatter(path: Path) -> dict[str, str]: + text = path.read_text(encoding="utf-8") + if not text.startswith("---"): + return {} + + lines = text.splitlines() + if not lines or lines[0] != "---": + return {} + + frontmatter_lines: list[str] = [] + for line in lines[1:]: + if line == "---": + break + frontmatter_lines.append(line) + + data: dict[str, str] = {} + for line in frontmatter_lines: + stripped = line.strip() + if not stripped or stripped.startswith("#") or ":" not in stripped: + continue + key, value = stripped.split(":", 1) + data[key.strip()] = value.strip() + + return data + + +@pytest.mark.parametrize("card_name", GUARDED_CARDS) +def test_guarded_agent_cards_do_not_pin_model_unless_explicitly_allowed(card_name: str) -> None: + card_path = AGENT_CARDS_DIR / card_name + assert card_path.exists(), f"Missing guarded agent card: {card_path}" + + frontmatter = _parse_frontmatter(card_path) + if "model" not in frontmatter: + return + + assert card_name in ALLOWED_MODEL_PIN_OVERRIDES, ( + f"Unexpected model pin in {card_name}. Remove `model:` from frontmatter or add an " + "explicit temporary override in ALLOWED_MODEL_PIN_OVERRIDES with a justification." + ) + assert frontmatter.get("allow_model_pin", "").lower() == "true", ( + f"{card_name} is allowlisted but missing `allow_model_pin: true` annotation in frontmatter." + ) + + +def test_default_guarded_cards_have_no_model_pin() -> None: + """Regression guard: current default cards should not define a model pin.""" + for card_name in GUARDED_CARDS: + frontmatter = _parse_frontmatter(AGENT_CARDS_DIR / card_name) + assert "model" not in frontmatter, ( + f"Unexpected model pin in guarded card {card_name}. " + "Model selection should come from runtime resolution." + ) diff --git a/tests/test_config.py b/tests/test_config.py new file mode 100644 index 0000000..2ec6e1f --- /dev/null +++ b/tests/test_config.py @@ -0,0 +1,48 @@ +from __future__ import annotations + +import yaml + +from upskill.config import ( + UPSKILL_CONFIG_ENV, + Config, + find_upskill_config_path, + resolve_upskill_config_path, +) + + +def test_find_upskill_config_path_uses_env_override_when_file_is_missing( + tmp_path, monkeypatch +) -> None: + override_path = tmp_path / "custom" / "upskill.yaml" + monkeypatch.setenv(UPSKILL_CONFIG_ENV, str(override_path)) + monkeypatch.chdir(tmp_path) + + assert find_upskill_config_path() == override_path + + +def test_resolve_upskill_config_path_reports_missing_env_override(tmp_path, monkeypatch) -> None: + override_path = tmp_path / "custom" / "upskill.yaml" + monkeypatch.setenv(UPSKILL_CONFIG_ENV, str(override_path)) + + resolution = resolve_upskill_config_path() + + assert resolution.path == override_path + assert resolution.source == f"{UPSKILL_CONFIG_ENV} env var" + assert resolution.exists is False + + +def test_config_save_uses_env_override_path_when_file_is_missing(tmp_path, monkeypatch) -> None: + override_path = tmp_path / "custom" / "upskill.yaml" + monkeypatch.setenv(UPSKILL_CONFIG_ENV, str(override_path)) + monkeypatch.chdir(tmp_path) + + config = Config(skill_generation_model="haiku") + config.save() + + assert override_path.exists() + assert (tmp_path / "upskill.config.yaml").exists() is False + + with open(override_path, encoding="utf-8") as f: + saved = yaml.safe_load(f) or {} + + assert saved["skill_generation_model"] == "haiku" diff --git a/tests/test_model_resolution.py b/tests/test_model_resolution.py new file mode 100644 index 0000000..c0b330c --- /dev/null +++ b/tests/test_model_resolution.py @@ -0,0 +1,163 @@ +from __future__ import annotations + +import pytest + +from upskill.config import Config +from upskill.model_resolution import resolve_models + + +def test_resolve_generate_uses_generation_model_for_test_gen_by_default() -> None: + config = Config(skill_generation_model="sonnet", eval_model="haiku", test_gen_model=None) + + resolved = resolve_models("generate", config=config) + + assert resolved.skill_generation_model == "sonnet" + assert resolved.test_generation_model == "sonnet" + assert resolved.extra_eval_model is None + assert resolved.is_benchmark_mode is False + assert resolved.skill_generation_model is not None + assert resolved.test_generation_model is not None + + +def test_resolve_generate_honors_test_gen_model_config() -> None: + config = Config(skill_generation_model="sonnet", test_gen_model="haiku") + + resolved = resolve_models("generate", config=config, cli_model="opus", cli_eval_model="haiku") + + assert resolved.skill_generation_model == "opus" + assert resolved.test_generation_model == "haiku" + assert resolved.extra_eval_model == "haiku" + + +def test_resolve_generate_cli_test_gen_model_overrides_config() -> None: + config = Config(skill_generation_model="sonnet", test_gen_model="haiku") + + resolved = resolve_models( + "generate", + config=config, + cli_model="sonnet", + cli_test_gen_model="opus", + ) + + assert resolved.test_generation_model == "opus" + + +def test_resolve_eval_defaults_and_simple_mode() -> None: + config = Config(skill_generation_model="sonnet", eval_model="haiku", test_gen_model=None) + + resolved = resolve_models("eval", config=config, cli_models=None, num_runs=1, no_baseline=False) + + assert resolved.evaluation_models == ["haiku"] + assert resolved.test_generation_model == "sonnet" + assert resolved.is_benchmark_mode is False + assert resolved.run_baseline is True + assert resolved.evaluation_models + + +def test_resolve_eval_cli_test_gen_model_overrides_config() -> None: + config = Config(skill_generation_model="sonnet", test_gen_model="haiku") + + resolved = resolve_models( + "eval", + config=config, + cli_models=["kimi"], + cli_test_gen_model="opus", + ) + + assert resolved.test_generation_model == "opus" + + +def test_resolve_eval_benchmark_mode_disables_baseline() -> None: + config = Config(skill_generation_model="sonnet", eval_model="haiku") + + resolved = resolve_models( + "eval", + config=config, + cli_models=["haiku", "sonnet"], + num_runs=3, + no_baseline=False, + ) + + assert resolved.evaluation_models == ["haiku", "sonnet"] + assert resolved.is_benchmark_mode is True + assert resolved.run_baseline is False + + +def test_resolve_eval_simple_mode_respects_no_baseline() -> None: + config = Config(skill_generation_model="sonnet", eval_model="haiku") + + resolved = resolve_models( + "eval", + config=config, + cli_models=["haiku"], + num_runs=1, + no_baseline=True, + ) + + assert resolved.is_benchmark_mode is False + assert resolved.run_baseline is False + + +def test_resolve_benchmark_requires_models() -> None: + config = Config(skill_generation_model="sonnet") + + with pytest.raises(ValueError): + resolve_models("benchmark", config=config, cli_models=[]) + + +def test_resolve_benchmark_uses_config_test_generation_fallback() -> None: + config = Config(skill_generation_model="sonnet", test_gen_model="opus") + + resolved = resolve_models("benchmark", config=config, cli_models=["haiku"], num_runs=2) + + assert resolved.evaluation_models == ["haiku"] + assert resolved.test_generation_model == "opus" + assert resolved.is_benchmark_mode is True + assert resolved.run_baseline is False + + +def test_resolve_benchmark_cli_test_gen_model_overrides_config() -> None: + config = Config(skill_generation_model="sonnet", test_gen_model="haiku") + + resolved = resolve_models( + "benchmark", + config=config, + cli_models=["kimi"], + cli_test_gen_model="opus", + ) + + assert resolved.test_generation_model == "opus" + + +def test_resolve_eval_prefers_cli_models_over_config_default() -> None: + config = Config(skill_generation_model="sonnet", eval_model="haiku") + + resolved = resolve_models( + "eval", + config=config, + cli_models=["opus"], + ) + + assert resolved.evaluation_models == ["opus"] + + +def test_resolve_unsupported_command_raises() -> None: + config = Config(skill_generation_model="sonnet") + + with pytest.raises(ValueError, match="Unsupported command"): + resolve_models("not-a-command", config=config) # type: ignore[arg-type] + + +def test_config_legacy_model_key_maps_to_skill_generation_model() -> None: + config = Config(model="haiku") + + assert config.skill_generation_model == "haiku" + assert config.model == "haiku" + + +def test_config_dump_uses_skill_generation_model_key() -> None: + config = Config(skill_generation_model="sonnet") + + dumped = config.model_dump(mode="json") + assert dumped["skill_generation_model"] == "sonnet" + assert "model" not in dumped diff --git a/upskill.config.yaml b/upskill.config.yaml new file mode 100644 index 0000000..3ddf1e2 --- /dev/null +++ b/upskill.config.yaml @@ -0,0 +1,17 @@ +# upskill project configuration + +# Default model for skill generation. +model: sonnet + +# Optional separate model for evaluation. If omitted, uses `model`. +# eval_model: haiku + +# Output directories. +skills_dir: ./skills +runs_dir: ./runs + +# Number of refinement passes during `upskill generate`. +max_refine_attempts: 2 + +# Optional override for fast-agent config file. +# fastagent_config: ./fastagent.config.yaml diff --git a/uv.lock b/uv.lock index dd46d91..9caa95c 100644 --- a/uv.lock +++ b/uv.lock @@ -20,14 +20,14 @@ wheels = [ [[package]] name = "agent-client-protocol" -version = "0.7.1" +version = "0.8.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "pydantic" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/db/7c/12da39be4f73026fd9b02144df5f64d803488cf1439aa221b0edb7c305e3/agent_client_protocol-0.7.1.tar.gz", hash = "sha256:8d7031209e14c3f2f987e3b95e7d9c3286158e7b2af1bf43d6aae5b8a429249f", size = 66226, upload-time = "2025-12-28T13:58:57.012Z" } +sdist = { url = "https://files.pythonhosted.org/packages/1b/7b/7cdac86db388809d9e3bc58cac88cc7dfa49b7615b98fab304a828cd7f8a/agent_client_protocol-0.8.1.tar.gz", hash = "sha256:1bbf15663bf51f64942597f638e32a6284c5da918055d9672d3510e965143dbd", size = 68866, upload-time = "2026-02-13T15:34:54.567Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/03/48/48d2fb454f911147432cd779f548e188274e1700f1cbe0a258e78158331a/agent_client_protocol-0.7.1-py3-none-any.whl", hash = "sha256:4ffe999488f2b23db26f09becdfaa2aaae6529f0847a52bca61bc2c628001c0f", size = 53771, upload-time = "2025-12-28T13:58:55.967Z" }, + { url = "https://files.pythonhosted.org/packages/4b/f3/219eeca0ad4a20843d4b9eaac5532f87018b9d25730a62a16f54f6c52d1a/agent_client_protocol-0.8.1-py3-none-any.whl", hash = "sha256:9421a11fd435b4831660272d169c3812d553bb7247049c138c3ca127e4b8af8e", size = 54529, upload-time = "2026-02-13T15:34:53.344Z" }, ] [[package]] @@ -105,7 +105,7 @@ wheels = [ [[package]] name = "anthropic" -version = "0.76.0" +version = "0.79.0" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "anyio" }, @@ -117,9 +117,9 @@ dependencies = [ { name = "sniffio" }, { name = "typing-extensions" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/6e/be/d11abafaa15d6304826438170f7574d750218f49a106c54424a40cef4494/anthropic-0.76.0.tar.gz", hash = "sha256:e0cae6a368986d5cf6df743dfbb1b9519e6a9eee9c6c942ad8121c0b34416ffe", size = 495483, upload-time = "2026-01-13T18:41:14.908Z" } +sdist = { url = "https://files.pythonhosted.org/packages/15/b1/91aea3f8fd180d01d133d931a167a78a3737b3fd39ccef2ae8d6619c24fd/anthropic-0.79.0.tar.gz", hash = "sha256:8707aafb3b1176ed6c13e2b1c9fb3efddce90d17aee5d8b83a86c70dcdcca871", size = 509825, upload-time = "2026-02-07T18:06:18.388Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/e5/70/7b0fd9c1a738f59d3babe2b4212031c34ab7d0fda4ffef15b58a55c5bcea/anthropic-0.76.0-py3-none-any.whl", hash = "sha256:81efa3113901192af2f0fe977d3ec73fdadb1e691586306c4256cd6d5ccc331c", size = 390309, upload-time = "2026-01-13T18:41:13.483Z" }, + { url = "https://files.pythonhosted.org/packages/95/b2/cc0b8e874a18d7da50b0fda8c99e4ac123f23bf47b471827c5f6f3e4a767/anthropic-0.79.0-py3-none-any.whl", hash = "sha256:04cbd473b6bbda4ca2e41dd670fe2f829a911530f01697d0a1e37321eb75f3cf", size = 405918, upload-time = "2026-02-07T18:06:20.246Z" }, ] [[package]] @@ -342,7 +342,7 @@ wheels = [ [[package]] name = "fast-agent-mcp" -version = "0.4.41" +version = "0.4.53" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "a2a-sdk" }, @@ -370,14 +370,15 @@ dependencies = [ { name = "python-frontmatter" }, { name = "pyyaml" }, { name = "rich" }, + { name = "ruamel-yaml" }, { name = "tiktoken" }, { name = "typer" }, { name = "uvloop", marker = "sys_platform != 'win32'" }, { name = "watchfiles" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/e6/08/df96d7c50ed32c1a62e6c0967affd35018111eca3257a555b4bc1c02534a/fast_agent_mcp-0.4.41.tar.gz", hash = "sha256:0d68dadec046ffa88a7a0d07ce84ade89c55aa8b458e87ada5a9c71006867023", size = 1608737, upload-time = "2026-01-25T21:42:46.568Z" } +sdist = { url = "https://files.pythonhosted.org/packages/89/48/027760d3e271299ad71b4baef77f0edb509fcf1ad1e0b6e38367fabf622f/fast_agent_mcp-0.4.53.tar.gz", hash = "sha256:bada3c4ec8be873e2b0fa844524df9da0c0492ca67270ec2b826e7e319f95dda", size = 1688537, upload-time = "2026-02-15T23:09:31.809Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/e1/33/5567bfe3f3e5621077bc20126d72b2e3a4b02db4c1e65c2d570049be64de/fast_agent_mcp-0.4.41-py3-none-any.whl", hash = "sha256:2f61bbfe4b1a88b47b93f55610fc259bb4a885302a6b0e4fb2ec5913cd8530ca", size = 1043530, upload-time = "2026-01-25T21:42:44.961Z" }, + { url = "https://files.pythonhosted.org/packages/7c/3b/c385a276521033ce1dec729feb9b7760a7d6f7ff15641e51c51b6d27301d/fast_agent_mcp-0.4.53-py3-none-any.whl", hash = "sha256:9dac6fe59e552b3ba56d19e225bbc59a9e3ec20ac7b8cfe1760c62cb54384a23", size = 1130674, upload-time = "2026-02-15T23:09:26.314Z" }, ] [[package]] @@ -759,7 +760,7 @@ wheels = [ [[package]] name = "mcp" -version = "1.25.0" +version = "1.26.0" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "anyio" }, @@ -777,9 +778,9 @@ dependencies = [ { name = "typing-inspection" }, { name = "uvicorn", marker = "sys_platform != 'emscripten'" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/d5/2d/649d80a0ecf6a1f82632ca44bec21c0461a9d9fc8934d38cb5b319f2db5e/mcp-1.25.0.tar.gz", hash = "sha256:56310361ebf0364e2d438e5b45f7668cbb124e158bb358333cd06e49e83a6802", size = 605387, upload-time = "2025-12-19T10:19:56.985Z" } +sdist = { url = "https://files.pythonhosted.org/packages/fc/6d/62e76bbb8144d6ed86e202b5edd8a4cb631e7c8130f3f4893c3f90262b10/mcp-1.26.0.tar.gz", hash = "sha256:db6e2ef491eecc1a0d93711a76f28dec2e05999f93afd48795da1c1137142c66", size = 608005, upload-time = "2026-01-24T19:40:32.468Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/e2/fc/6dc7659c2ae5ddf280477011f4213a74f806862856b796ef08f028e664bf/mcp-1.25.0-py3-none-any.whl", hash = "sha256:b37c38144a666add0862614cc79ec276e97d72aa8ca26d622818d4e278b9721a", size = 233076, upload-time = "2025-12-19T10:19:55.416Z" }, + { url = "https://files.pythonhosted.org/packages/fd/d9/eaa1f80170d2b7c5ba23f3b59f766f3a0bb41155fbc32a69adfa1adaaef9/mcp-1.26.0-py3-none-any.whl", hash = "sha256:904a21c33c25aa98ddbeb47273033c435e595bbacfdb177f4bd87f6dceebe1ca", size = 233615, upload-time = "2026-01-24T19:40:30.652Z" }, ] [[package]] @@ -862,7 +863,7 @@ wheels = [ [[package]] name = "openai" -version = "2.15.0" +version = "2.21.0" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "anyio" }, @@ -874,9 +875,9 @@ dependencies = [ { name = "tqdm" }, { name = "typing-extensions" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/94/f4/4690ecb5d70023ce6bfcfeabfe717020f654bde59a775058ec6ac4692463/openai-2.15.0.tar.gz", hash = "sha256:42eb8cbb407d84770633f31bf727d4ffb4138711c670565a41663d9439174fba", size = 627383, upload-time = "2026-01-09T22:10:08.603Z" } +sdist = { url = "https://files.pythonhosted.org/packages/92/e5/3d197a0947a166649f566706d7a4c8f7fe38f1fa7b24c9bcffe4c7591d44/openai-2.21.0.tar.gz", hash = "sha256:81b48ce4b8bbb2cc3af02047ceb19561f7b1dc0d4e52d1de7f02abfd15aa59b7", size = 644374, upload-time = "2026-02-14T00:12:01.577Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/b5/df/c306f7375d42bafb379934c2df4c2fa3964656c8c782bac75ee10c102818/openai-2.15.0-py3-none-any.whl", hash = "sha256:6ae23b932cd7230f7244e52954daa6602716d6b9bf235401a107af731baea6c3", size = 1067879, upload-time = "2026-01-09T22:10:06.446Z" }, + { url = "https://files.pythonhosted.org/packages/cc/56/0a89092a453bb2c676d66abee44f863e742b2110d4dbb1dbcca3f7e5fc33/openai-2.21.0-py3-none-any.whl", hash = "sha256:0bc1c775e5b1536c294eded39ee08f8407656537ccc71b1004104fe1602e267c", size = 1103065, upload-time = "2026-02-14T00:11:59.603Z" }, ] [package.optional-dependencies] @@ -959,7 +960,7 @@ wheels = [ [[package]] name = "opentelemetry-instrumentation-anthropic" -version = "0.51.0" +version = "0.52.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "opentelemetry-api" }, @@ -967,14 +968,14 @@ dependencies = [ { name = "opentelemetry-semantic-conventions" }, { name = "opentelemetry-semantic-conventions-ai" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/25/a8/5fb21cfca2a5ffdf5a94eb6032dd1cf90df04646d01702c96b73301af6d0/opentelemetry_instrumentation_anthropic-0.51.0.tar.gz", hash = "sha256:2b555866380535c79b4d6b1586bd82b933df3cf9c51fe89c8732309ec13bd2c0", size = 682760, upload-time = "2026-01-20T11:41:51.545Z" } +sdist = { url = "https://files.pythonhosted.org/packages/99/0d/cd59fb2475dfb245af493873cb4c5afa568cf66ad4e0de832c6513541267/opentelemetry_instrumentation_anthropic-0.52.1.tar.gz", hash = "sha256:e3462adc0956c95575ff845be78c3ab51113cf9372d3f64ef7119896ab304fbb", size = 682764, upload-time = "2026-02-02T09:23:02.275Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/8b/94/429645fd491f59beaf7d60a075207cc078b97bf0a19c89e1401a09f62b68/opentelemetry_instrumentation_anthropic-0.51.0-py3-none-any.whl", hash = "sha256:5ac2171b5fc9f8387b3e4f4b4c639fece8e9eed3d4ae868069bae20955fb89b1", size = 18427, upload-time = "2026-01-20T11:41:15.371Z" }, + { url = "https://files.pythonhosted.org/packages/79/b8/87380b52b436d4d4683425038d8ab28987f070ff806dc0561ed0bcae69b6/opentelemetry_instrumentation_anthropic-0.52.1-py3-none-any.whl", hash = "sha256:9e902e4ae14b5ca2a5a60c22a1a2d6fe245ff45f4d7037e5705468c21ea431fa", size = 18428, upload-time = "2026-02-02T09:22:23.45Z" }, ] [[package]] name = "opentelemetry-instrumentation-google-genai" -version = "0.5b0" +version = "0.6b0" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "opentelemetry-api" }, @@ -982,14 +983,14 @@ dependencies = [ { name = "opentelemetry-semantic-conventions" }, { name = "opentelemetry-util-genai" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/8f/62/b2506d74f50d4d0f150293d171dc1196c9334ba02c51d6ed19d64d0f76c4/opentelemetry_instrumentation_google_genai-0.5b0.tar.gz", hash = "sha256:1986cd1a69dafdcccee15ae9f114e45ff04954951af0fef8b5482e2930fc0b17", size = 47840, upload-time = "2025-12-11T14:50:48.641Z" } +sdist = { url = "https://files.pythonhosted.org/packages/60/59/a0d3da1679c056db45fa6338b332f9add8544c5afe9f7643d37062617c9a/opentelemetry_instrumentation_google_genai-0.6b0.tar.gz", hash = "sha256:76229c51a166d53e58e0376487f420562f1ab155511fe932110b4ea9c5718aad", size = 48433, upload-time = "2026-01-27T22:15:13.906Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/90/9f/a55591e2b41f6c29c4cf4b459617b03318e6d1e9c06f3b3ce7f22b7da8fc/opentelemetry_instrumentation_google_genai-0.5b0-py3-none-any.whl", hash = "sha256:20467a96d7407affc975e63d1175c21a4d33dd83f5ec162dddde6cea9e8f3995", size = 29531, upload-time = "2025-12-11T14:50:47.323Z" }, + { url = "https://files.pythonhosted.org/packages/71/fd/5da48a5efef82034b6c0e20d5d293f4ade223bb2ec54101c1f7c81206577/opentelemetry_instrumentation_google_genai-0.6b0-py3-none-any.whl", hash = "sha256:bc5cf5957b697f05ffb765f59bb9e08aae457f1d08967753966e9d8a49b1b79f", size = 29861, upload-time = "2026-01-27T22:15:12.912Z" }, ] [[package]] name = "opentelemetry-instrumentation-mcp" -version = "0.51.0" +version = "0.52.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "opentelemetry-api" }, @@ -997,14 +998,14 @@ dependencies = [ { name = "opentelemetry-semantic-conventions" }, { name = "opentelemetry-semantic-conventions-ai" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/dc/e4/d02ffcad44e751c6f63faa52ad22a0b067855abc87eb75ab93e42d5136e3/opentelemetry_instrumentation_mcp-0.51.0.tar.gz", hash = "sha256:7400f1b119fc23eefa827b247ebc54adc3f7afb93b437159e69112c05c2fb0e7", size = 112676, upload-time = "2026-01-20T11:42:04.1Z" } +sdist = { url = "https://files.pythonhosted.org/packages/a3/69/9b9413172a630e61426772699d828a94f3354f647d5b0266b579df3a18ac/opentelemetry_instrumentation_mcp-0.52.1.tar.gz", hash = "sha256:b159190a9a93ccf8c39259f250d62f187b720b4a3844d7e4f655ccfaa25bc1f8", size = 120368, upload-time = "2026-02-02T09:23:14.785Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/bb/f7/9d58646f48ba681cc3b929ad329df679c8827c0072134c22f4393b4065e4/opentelemetry_instrumentation_mcp-0.51.0-py3-none-any.whl", hash = "sha256:62dbeaff6fea82647a451f180d187fd60010d7d37bc51aa06300cd0c7fdb4c4a", size = 10463, upload-time = "2026-01-20T11:41:29.845Z" }, + { url = "https://files.pythonhosted.org/packages/57/70/dc3a34e55da4c015aa2c879fb317d23fedf20175ba30a6f68e21d73da778/opentelemetry_instrumentation_mcp-0.52.1-py3-none-any.whl", hash = "sha256:39b8c3c0f841b694ae2301075633fbd8ae30d6ee06671d2f878d2742ef88c7d3", size = 10463, upload-time = "2026-02-02T09:22:39.167Z" }, ] [[package]] name = "opentelemetry-instrumentation-openai" -version = "0.51.0" +version = "0.52.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "opentelemetry-api" }, @@ -1012,9 +1013,9 @@ dependencies = [ { name = "opentelemetry-semantic-conventions" }, { name = "opentelemetry-semantic-conventions-ai" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/82/68/32fdb0926993514f1bb0811883dd5eb2b2a284cff7e3d8153fbbd710dfb7/opentelemetry_instrumentation_openai-0.51.0.tar.gz", hash = "sha256:3b47dbf6b7aa690d5e6b966896002c2aa7eb374ea63598b6e3410c6276976692", size = 6978221, upload-time = "2026-01-20T11:42:07.788Z" } +sdist = { url = "https://files.pythonhosted.org/packages/c0/71/b178325571b504cd5709a61f3c205ddcdb0471eab945c564354f64ac2de9/opentelemetry_instrumentation_openai-0.52.1.tar.gz", hash = "sha256:444a60163856c52a1a620197ae6e3bb6d4492da94969737b0abe9a4388a108fc", size = 6978373, upload-time = "2026-02-02T09:23:18.838Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/0d/7f/1547d062a994e48e2114c48750b12b0a52debcf6e0a47c208677dc007c8f/opentelemetry_instrumentation_openai-0.51.0-py3-none-any.whl", hash = "sha256:39b02f905f50ff3f73aeef011449786b6e2fdad2cb8520fa996bf81f65f0d6f1", size = 42968, upload-time = "2026-01-20T11:41:37.499Z" }, + { url = "https://files.pythonhosted.org/packages/4c/c3/f4e62f185d9f55286cfaf1dbf89cf87f48f17d39d7b1602c55cf2dcb2dd8/opentelemetry_instrumentation_openai-0.52.1-py3-none-any.whl", hash = "sha256:1f3818fbdd6ac4b038a099b99cd8afd5a959a396ad26baf422ad2002d8209ddd", size = 43085, upload-time = "2026-02-02T09:22:43.274Z" }, ] [[package]] @@ -1560,6 +1561,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/64/8d/0133e4eb4beed9e425d9a98ed6e081a55d195481b7632472be1af08d2f6b/rsa-4.9.1-py3-none-any.whl", hash = "sha256:68635866661c6836b8d39430f97a996acbd61bfa49406748ea243539fe239762", size = 34696, upload-time = "2025-04-16T09:51:17.142Z" }, ] +[[package]] +name = "ruamel-yaml" +version = "0.19.1" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/c7/3b/ebda527b56beb90cb7652cb1c7e4f91f48649fbcd8d2eb2fb6e77cd3329b/ruamel_yaml-0.19.1.tar.gz", hash = "sha256:53eb66cd27849eff968ebf8f0bf61f46cdac2da1d1f3576dd4ccee9b25c31993", size = 142709, upload-time = "2026-01-02T16:50:31.84Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/b8/0c/51f6841f1d84f404f92463fc2b1ba0da357ca1e3db6b7fbda26956c3b82a/ruamel_yaml-0.19.1-py3-none-any.whl", hash = "sha256:27592957fedf6e0b62f281e96effd28043345e0e66001f97683aa9a40c667c93", size = 118102, upload-time = "2026-01-02T16:50:29.201Z" }, +] + [[package]] name = "ruff" version = "0.14.13" @@ -1727,7 +1737,7 @@ wheels = [ [[package]] name = "upskill" -version = "0.2.0" +version = "0.2.1" source = { editable = "." } dependencies = [ { name = "click" }, @@ -1748,7 +1758,7 @@ dev = [ [package.metadata] requires-dist = [ { name = "click", specifier = ">=8.1" }, - { name = "fast-agent-mcp", specifier = ">=0.4.41" }, + { name = "fast-agent-mcp", specifier = ">=0.4.53" }, { name = "pydantic", specifier = ">=2.0" }, { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0" }, { name = "pytest-asyncio", marker = "extra == 'dev'", specifier = ">=0.23" },