Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,16 @@ jobs:
- run: bunx @biomejs/biome check .
- run: bun run lint-architecture.ts

build-dashboard:
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: oven-sh/setup-bun@ecf28ddc73e819eb6fa29df6b34ef8921c743461 # v2
- run: bun install
Comment on lines +28 to +30
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider caching bun dependencies.

While the existing jobs also lack caching, adding bun cache would speed up all jobs. This is optional but beneficial for larger dependency sets.

♻️ Optional: Add bun caching
     steps:
       - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
       - uses: oven-sh/setup-bun@ecf28ddc73e819eb6fa29df6b34ef8921c743461 # v2
+        with:
+          bun-version: latest
       - run: bun install
+      - uses: actions/cache@v4
+        with:
+          path: ~/.bun/install/cache
+          key: ${{ runner.os }}-bun-${{ hashFiles('**/bun.lockb') }}
+          restore-keys: |
+            ${{ runner.os }}-bun-
       - run: bun run build:dashboard

Note: oven-sh/setup-bun may have built-in caching options — verify current action docs.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: oven-sh/setup-bun@ecf28ddc73e819eb6fa29df6b34ef8921c743461 # v2
- run: bun install
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: oven-sh/setup-bun@ecf28ddc73e819eb6fa29df6b34ef8921c743461 # v2
with:
bun-version: latest
- run: bun install
- uses: actions/cache@v4
with:
path: ~/.bun/install/cache
key: ${{ runner.os }}-bun-${{ hashFiles('**/bun.lockb') }}
restore-keys: |
${{ runner.os }}-bun-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 28 - 30, Add bun dependency caching
around the existing "uses: oven-sh/setup-bun" and "run: bun install" steps: add
an actions/cache@v3 restore/save step keyed on bun.lockb (e.g. key: ${{
runner.os }}-bun-${{ hashFiles('bun.lockb') }}) and cache the bun cache
directory (e.g. ~/.bun or $HOME/.bun) so subsequent runs reuse installed
packages; verify the exact cache path and any built-in caching options in
oven-sh/setup-bun docs and place the restore step before "run: bun install" and
the save step after it (or use a single actions/cache entry that handles both
restore and save).

- run: bun run build:dashboard

test:
runs-on: ubuntu-latest
permissions:
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ jobs:
- name: Install dependencies
run: bun install

- name: Build dashboard SPA
run: bun run build:dashboard

- name: Verify npm version for trusted publishing
run: npm --version

Expand Down
7 changes: 2 additions & 5 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ cli/selftune/
├── observability.ts Health checks (doctor command)
├── status.ts Skill health summary (status command)
├── last.ts Last session insight (last command)
├── dashboard.ts HTML dashboard builder (dashboard command)
├── dashboard-server.ts Live Bun.serve server with SSE (dashboard --serve)
├── dashboard.ts Dashboard command entry point (SPA server launcher)
├── dashboard-server.ts Bun.serve SPA + v2 API server
├── types.ts Shared interfaces (incl. SelftuneConfig)
├── constants.ts Log paths, config paths, known tools
├── utils/ Shared utilities (jsonl, transcript, logging, llm-call, schema-validator, trigger-check)
Expand Down Expand Up @@ -100,9 +100,6 @@ apps/local-dashboard/ React SPA dashboard (Vite + TypeScript + shadcn/ui)
├── vite.config.ts Dev proxy → dashboard-server, build to dist/
└── package.json React 19, Tailwind v4, shadcn/ui, recharts

dashboard/ Legacy HTML dashboard (served at /legacy/)
└── index.html Original embedded-JSON dashboard (v1 endpoints)

templates/ Settings and config templates
├── single-skill-settings.json
├── multi-skill-settings.json
Expand Down
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
- Onboarding flow: full empty-state guide for first-time users (3-step setup), dismissible welcome banner for returning users (localStorage-persisted)
- **SQLite v2 API endpoints** — `GET /api/v2/overview` and `GET /api/v2/skills/:name` backed by materialized SQLite queries (`getOverviewPayload()`, `getSkillReportPayload()`, `getSkillsList()`)
- **SQL query optimizations** — Replaced `NOT IN` subqueries with `LEFT JOIN + IS NULL`, moved JS-side dedup to SQL `GROUP BY`, added `LIMIT 200` to unbounded evidence queries
- **SPA serving from dashboard server** — Built SPA served at `/`, legacy HTML dashboard moved to `/legacy/`
- **SPA serving from dashboard server** — Built SPA served at `/` as the supported local dashboard experience
- **Source-truth-driven pipeline** — Transcripts and rollouts are now the authoritative source; `sync` rebuilds repaired overlays from source data rather than relying solely on hook-time capture
- **Telemetry contract package** — `@selftune/telemetry-contract` workspace package with canonical schema types, validators, versioning, metadata, and golden fixture tests
- **Test split** — `make test-fast` / `make test-slow` and `bun run test:fast` / `bun run test:slow` for faster development feedback loop
Expand Down
79 changes: 49 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@
[![Zero Dependencies](https://img.shields.io/badge/dependencies-0-brightgreen)](https://www.npmjs.com/package/selftune?activeTab=dependencies)
[![Bun](https://img.shields.io/badge/runtime-bun%20%7C%20node-black)](https://bun.sh)

Your agent skills learn how you work. Detect what's broken. Fix it automatically.
Your agent skills learn how you work. Detect what's broken. Improve low-risk skill behavior automatically.

**[Install](#install)** · **[Use Cases](#built-for-how-you-actually-work)** · **[How It Works](#how-it-works)** · **[Commands](#commands)** · **[Platforms](#platforms)** · **[Docs](docs/integration-guide.md)**

</div>

---

Your skills don't understand how you talk. You say "make me a slide deck" and nothing happensno error, no log, no signal. selftune watches your real sessions, learns how you actually speak, and rewrites skill descriptions to match. Automatically.
Your skills do not understand how you talk. You say "make me a slide deck" and nothing happens: no error, no signal, no clue why the right skill never fired. selftune reads the transcripts and telemetry your agent already saves, learns how you actually speak, and improves skill descriptions to match. It validates changes before deployment, watches for regressions after, and rolls back when needed.

Built for **Claude Code**. Also works with Codex, OpenCode, and OpenClaw. Zero runtime dependencies.

Expand All @@ -35,9 +35,28 @@ npx skills add selftune-dev/selftune

Then tell your agent: **"initialize selftune"**

Two minutes. No API keys. No external services. No configuration ceremony. Uses your existing agent subscription. Within minutes you'll see which skills are undertriggering.
Two minutes. No API keys. No external services. No configuration ceremony. Uses your existing agent subscription.

**CLI only** (no skill, just the CLI):
Quick proof path:

```bash
npx selftune@latest doctor
npx selftune@latest sync
npx selftune@latest status
npx selftune@latest dashboard
```

Use `--force` only when you explicitly need to rebuild local state from scratch.

Autonomy quick start:

```bash
npx selftune@latest init --enable-autonomy
npx selftune@latest orchestrate --dry-run
npx selftune@latest schedule --install --dry-run
```
Comment on lines +51 to +57
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add context for autonomy mode.

The autonomy quick start jumps directly to commands without explaining what autonomous mode does. Add a brief one-liner (e.g., "Autonomous mode enables auto-deployment of low-risk skill improvements with validation and rollback safeguards") before the code block so users understand what they're opting into.

📝 Suggested clarification
+Autonomy quick start (auto-deploy validated low-risk improvements):
-Autonomy quick start:
 
 ```bash
 npx selftune@latest init --enable-autonomy
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 51 - 57, Add a one-line explanation of what
autonomous mode does immediately before the "Autonomy quick start:" code block
(the block that begins with the command "npx selftune@latest init
--enable-autonomy"), e.g. "Autonomous mode enables automatic deployment of
low-risk skill improvements with validation and rollback safeguards." Keep it
concise and inline with the existing README formatting so the new sentence
appears directly above the three example commands.


**CLI only** (no installed skill):

```bash
npx selftune@latest doctor
Expand Down Expand Up @@ -68,51 +87,51 @@ combinations repeat, which ones help, and where the friction is.
<img src="./assets/FeedbackLoop.gif" alt="Observe → Detect → Evolve → Watch" width="800">
</p>

A continuous feedback loop that makes your skills learn and adapt. Automatically.
A continuous feedback loop that makes your skills learn and adapt from real work.

**Observe** — Hooks capture every user query and which skills fired. On Claude Code, hooks install automatically. Use `selftune replay` to backfill existing transcripts. This is how your skills start learning.
**Observe** — selftune reads the transcripts and telemetry your agents already save. On Claude Code, hooks can add low-latency hints, but transcripts and logs are the source of truth. Use `selftune sync` to ingest current activity and `selftune replay` to backfill older Claude Code sessions.

**Detect** — selftune finds the gap between how you talk and how your skills are described. You say "make me a slide deck" and your pptx skill stays silent — selftune catches that mismatch.
**Detect** — selftune finds the gap between how you talk and how your skills are described. It spots missed triggers, underperforming descriptions, noisy environments, and regressions in real usage.

**Evolve** — Rewrites skill descriptions — and full skill bodies — to match how you actually work. Batched validation with per-stage model control (`--cheap-loop` uses haiku for the loop, sonnet for the gate). Teacher-student body evolution with 3-gate validation. Baseline comparison gates on measurable lift. Automatic backup.
**Evolve** — For low-risk changes, selftune can autonomously rewrite skill descriptions to match how you actually work. Every proposal is validated before deploy. Full skill-body or routing changes stay available for higher-touch workflows.

**Watch** — After deploying changes, selftune monitors skill trigger rates. If anything regresses, it rolls back automatically. Your skills keep improving without you touching them.
**Watch** — After deploying changes, selftune monitors trigger quality and post-deploy evidence. If something regresses, it can roll back automatically. The goal is autonomous improvement with safeguards, not blind self-editing.

## What's New in v0.2.0
## What's New in v0.2.x

- **Full skill body evolution** — Beyond descriptions: evolve routing tables and entire skill bodies using teacher-student model with structural, trigger, and quality gates
- **Synthetic eval generation** — `selftune evals --synthetic` generates eval sets from SKILL.md via LLM, no session logs needed. Solves cold-start: new skills get evals immediately.
- **Cheap-loop evolution** — `selftune evolve --cheap-loop` uses haiku for proposal generation and validation, sonnet only for the final deployment gate. ~80% cost reduction.
- **Batch trigger validation** — Validation now batches 10 queries per LLM call instead of one-per-query. ~10x faster evolution loops.
- **Per-stage model control** — `--validation-model`, `--proposal-model`, and `--gate-model` flags give fine-grained control over which model runs each evolution stage.
- **Auto-activation system** — Hooks detect when selftune should run and suggest actions
- **Enforcement guardrails** — Blocks SKILL.md edits on monitored skills unless `selftune watch` has been run
- **React SPA dashboard** — `selftune dashboard` serves a React SPA with skill health grid, per-skill drilldown, evidence viewer, evolution timeline, dark/light theming, and SQLite-backed v2 API (legacy dashboard at `/legacy/`)
- **Evolution memory** — Persists context, plans, and decisions across context resets
- **4 specialized agents** — Diagnosis analyst, pattern analyst, evolution reviewer, integration guide
- **Sandbox test harness** — Comprehensive automated test coverage, including devcontainer-based LLM testing
- **Workflow discovery + codification** — `selftune workflows` finds repeated
multi-skill sequences from telemetry, and `selftune workflows save
<workflow-id|index>` appends them to `## Workflows` in SKILL.md
- **Source-truth sync** — `selftune sync` now leads the product loop, using transcripts/logs as truth and hooks as hints
- **SQLite-backed local app** — `selftune dashboard` now serves the React SPA by default with faster overview/report routes on top of materialized local data
- **Autonomous low-risk evolution** — description evolution is autonomous by default, with explicit review-required mode for stricter policies
- **Autonomous scheduling** — `selftune init --enable-autonomy` and `selftune schedule --install` make the orchestrated loop the default recurring runtime
- **Full skill body evolution** — evolve routing tables and entire skill bodies using teacher-student model with structural, trigger, and quality gates
- **Synthetic eval generation** — `selftune evals --synthetic` generates eval sets from `SKILL.md` for cold-start skills
- **Cheap-loop evolution** — `selftune evolve --cheap-loop` uses haiku for proposal generation and validation, sonnet only for the final deployment gate
- **Per-stage model control** — `--validation-model`, `--proposal-model`, and `--gate-model` give fine-grained control over each evolution stage
- **Sandbox test harness** — automated coverage, including devcontainer-based LLM testing
- **Workflow discovery + codification** — `selftune workflows` finds repeated multi-skill sequences from telemetry and can append them to `## Workflows` in `SKILL.md`

## Commands

| Command | What it does |
|---|---|
| `selftune doctor` | Health check: logs, config, permissions, dashboard build/runtime expectations |
| `selftune sync` | Ingest source-truth activity from supported agents and rebuild local state |
| `selftune status` | See which skills are undertriggering and why |
| `selftune dashboard` | Open the React SPA dashboard (SQLite-backed) |
| `selftune orchestrate` | Run the core loop: sync, inspect candidates, evolve, and watch |
| `selftune schedule --install` | Install platform-native scheduling for the autonomous loop |
| `selftune evals --skill <name>` | Generate eval sets from real session data (`--synthetic` for cold-start) |
| `selftune evolve --skill <name>` | Propose, validate, and deploy improved descriptions (`--cheap-loop`, `--with-baseline`) |
| `selftune evolve-body --skill <name>` | Evolve full skill body or routing table (teacher-student, 3-gate validation) |
| `selftune watch --skill <name>` | Monitor after deploy. Auto-rollback on regression. |
| `selftune replay` | Backfill data from existing Claude Code transcripts |
| `selftune baseline --skill <name>` | Measure skill value vs no-skill baseline |
| `selftune unit-test --skill <name>` | Run or generate skill-level unit tests |
| `selftune composability --skill <name>` | Measure synergy and conflicts between co-occurring skills, with workflow-candidate hints |
| `selftune workflows` | Discover repeated multi-skill workflows and save a discovered workflow into `SKILL.md` |
| `selftune import-skillsbench` | Import external eval corpus from [SkillsBench](https://github.com/benchflow-ai/skillsbench) |
| `selftune badge --skill <name>` | Generate skill health badge SVG |
| `selftune watch --skill <name>` | Monitor after deploy. Auto-rollback on regression. |
| `selftune dashboard` | Open the React SPA dashboard (SQLite-backed) |
| `selftune replay` | Backfill data from existing Claude Code transcripts |
| `selftune doctor` | Health check: logs, hooks, config, permissions |
| `selftune cron setup` | Optional scheduler helper for OpenClaw-oriented automation |

Full command reference: `selftune --help`

Expand Down Expand Up @@ -141,13 +160,13 @@ Observability tools trace LLM calls. Skill authoring tools help you write skills

## Platforms

**Claude Code** (primary) — Hooks install automatically. `selftune replay` backfills existing transcripts. Full feature support.
**Claude Code** (primary) — Reads saved transcripts and telemetry directly. Hooks install automatically and add low-latency hints. `selftune replay` backfills older Claude Code sessions. Full feature support.

**Codex** — `selftune wrap-codex -- <args>` or `selftune ingest-codex`

**OpenCode** — `selftune ingest-opencode`

**OpenClaw** — `selftune ingest-openclaw` + `selftune cron setup` for autonomous evolution
**OpenClaw** — `selftune ingest-openclaw`. `selftune cron setup` remains available as an optional OpenClaw-oriented scheduler helper, but the main product loop is still `selftune orchestrate` plus generic scheduling.

Requires [Bun](https://bun.sh) or Node.js 18+. No extra API keys.

Expand Down
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
- Per-skill drilldown with evidence viewer, evolution timeline
- SQLite v2 API endpoints (`/api/v2/overview`, `/api/v2/skills/:name`)
- Dark/light theme toggle with selftune branding
- SPA served at `/`, legacy HTML dashboard at `/legacy/`
- SPA served at `/` as the supported local dashboard

## In Progress
- Multi-agent sandbox expansion
Expand Down
15 changes: 10 additions & 5 deletions apps/local-dashboard/HANDOFF.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,14 @@ JSONL logs → materializeIncremental() → SQLite → getOverviewPayload() / ge
## How to run

```bash
# From repo root
bun run dev
# → if 7888 is free, starts dashboard server on 7888 and SPA dev server on http://localhost:5199
# → if 7888 is already in use, reuses that dashboard server and starts only the SPA dev server

# Or run manually:
# Terminal 1: Start the dashboard server
selftune dashboard --port 7888
selftune dashboard --port 7888 --no-open

# Terminal 2: Start the SPA dev server (proxies /api to port 7888)
cd apps/local-dashboard
Expand All @@ -41,7 +47,7 @@ bunx vite
## What was rebased / changed

- **SPA types**: Rewritten to match `queries.ts` payload shapes (`OverviewResponse`, `SkillReportResponse`, `SkillSummary`, `EvidenceEntry`)
- **API layer**: Now calls `/api/v2/overview` and `/api/v2/skills/:name` instead of `/api/data` + `/api/evaluations/:name`
- **API layer**: Calls `/api/v2/overview` and `/api/v2/skills/:name`
- **SSE removed**: Replaced with 15s polling (SQLite reads are cheap, SSE was complex)
- **Overview page**: Uses `SkillSummary[]` from `getSkillsList()` for skill cards (pre-aggregated pass rate, check count, sessions)
- **Skill report page**: Single fetch to v2 endpoint instead of parallel overview + evaluations fetch. Shows evidence entries, evolution audit history per skill
Expand All @@ -61,13 +67,12 @@ bunx vite

## What still depends on old dashboard code

- The old v1 endpoints (`/api/data`, `/api/events`, `/api/evaluations/:name`) still work and are used by the legacy `dashboard/index.html`
- Badge endpoints (`/badge/:name`) and report HTML endpoints (`/report/:name`) use the old `computeStatus` + JSONL reader path
- Badge endpoints (`/badge/:name`) and report HTML endpoints (`/report/:name`) still use the status/evidence JSONL path rather than SQLite-backed view models
- Action endpoints (`/api/actions/*`) are unchanged

## What remains before this can become default

1. ~~**Serve built SPA from dashboard-server**~~: Done — `/` serves SPA, old dashboard at `/legacy/`
1. ~~**Serve built SPA from dashboard-server**~~: Done — `/` serves the SPA
2. ~~**Production build**~~: Done — `bun run build:dashboard` in root package.json
3. **Regression detection**: The SQLite layer doesn't compute regression detection yet — `deriveStatus()` currently only uses pass rate + check count. Add a `regression_detected` column to skill summaries when the monitoring snapshot computation moves to SQLite.
4. **Monitoring snapshot migration**: Move `computeMonitoringSnapshot()` logic into the SQLite materializer or a query helper (window sessions, false negative rate, baseline comparison)
Expand Down
2 changes: 1 addition & 1 deletion apps/local-dashboard/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"version": "0.1.0",
"type": "module",
"scripts": {
"dev": "concurrently \"cd ../.. && bun run cli/selftune/index.ts dashboard --serve --port 7888\" \"vite\"",
"dev": "concurrently \"cd ../.. && bun run cli/selftune/index.ts dashboard --port 7888 --no-open\" \"vite\"",
"build": "vite build",
"preview": "vite preview",
"typecheck": "tsc --noEmit"
Expand Down
Loading
Loading