Focus Group: Implementation Plan

A CLI tool for testing AI tool documentation against multiple models before shipping.

Part 1: Orientation

1.1 What This Is

Focus Group is a developer tool that sends skill/tool documentation to multiple AI models and collects their feedback on clarity, usability, and completeness. It answers the question: "Will models other than the one I developed with understand my documentation?"

1.2 What Good Looks Like

Following Goedecke's principles, success means:

Boring implementation: CLI that calls APIs and saves results to SQLite. No novel components.
Minimal state: One database file, one config file. Everything else is ephemeral.
Obvious operation: Run command, see results. Debug by reading logs. Rollback by deleting the database.
Self-effacing: After setup, you forget how it works. You just use it.

1.3 What This Is NOT

Not a hosted service (no auth, no billing, no multi-tenancy)
Not a CI/CD integration (that's a separate tool that calls this one)
Not an "orchestrator agent" (you decide what to test, not an LLM)
Not a synthesis engine (you read results and identify patterns)

1.4 The Primitives We're Using

Primitive	Implementation	Why
CLI framework	Commander.js (via Bun)	Well-tested, zero config
Database	SQLite (via better-sqlite3)	Single file, no server, SQL
HTTP client	Built-in fetch	No dependencies
Config	TOML file	Human-readable, standard
Output	Markdown + JSON	Readable by humans and machines

No caches. No queues. No event buses. No background jobs. This is a synchronous CLI tool.

Part 2: Pre-Design Questions

2.1 State Inventory

Data	Location	Owner	Changes	Staleness OK?	If Wrong?
Test run history	`~/.local/share/focus-group/focus-group.db`	CLI (write)	On each run	N/A	Delete and re-run
Model responses	Same DB, `responses` table	CLI (write)	Once per run	N/A	Re-run the test
User config	`~/.config/focus-group/config.toml`	User (write)	Rarely	Yes	CLI uses defaults
Skill doc content	Stored in DB per-run	CLI (write, snapshot)	Never after run	N/A	Historical record
API keys	Environment variables	User	Never via CLI	N/A	Auth fails, visible error

Ownership rule: The CLI is the only writer to the database. Users edit config. Environment provides secrets.

2.2 Hot Paths

Ordered by criticality:

Run a test (focus-group test <skill> <task>)
- Parse skill doc
- Send to N models via OpenRouter
- Collect responses
- Save to database
- Display summary
This is 90% of the tool's value. Design it first. Test it most.
View results (focus-group show <run-id>)
- Query database
- Format output
- Display
Simple read path. Hard to break.
Compare runs (focus-group diff <run1> <run2>)
- Query two runs
- Compute diff
- Display
Useful but secondary.
List history (focus-group history)
- Query database
- Format table
- Display

2.3 Failure Mode Analysis

Dependency	How It Fails	Blast Radius	Strategy	Fail Open/Closed
OpenRouter API	401, 429, 500, timeout, network error	One model's test	Log error, continue with other models, report partial results	Open (partial results are useful)
SQLite	Disk full, permissions, corruption	Whole tool	Show clear error, suggest fix	Closed (can't run without DB)
Config file	Missing, malformed	Config only	Use defaults, warn user	Open (defaults work)
Skill file	Missing, unreadable	One run	Fail immediately with clear error	Closed (can't test nothing)
Model returns garbage	Bad JSON, empty response, refusal	One model's test	Log raw response, mark as failed, continue	Open

Retry policy:

5xx errors: Retry once with 2s delay
429 (rate limit): Retry once with 5s delay
4xx errors: Do not retry (our request is wrong)
Timeout (30s): Do not retry (model is too slow)

No circuit breaker needed: This is a CLI tool run occasionally by a developer, not a high-volume service.

2.4 Auth Model

Principals:

Local user running the CLI (only principal)

Resources:

Local database (read/write)
OpenRouter API (via user's API key)

Permissions:

User can do everything
No multi-user, no roles, no ACLs

Secrets:

OPENROUTER_API_KEY from environment
Never stored in database or config
Never logged

2.5 Operational Boundaries

Deployment model: Local CLI tool

Installed via npm/bun
Runs on user's machine
No server component
No persistent connections

Latency requirements:

User expects test to complete in 30-120 seconds (depending on models)
Each model call: 10-60 seconds typical
Parallel execution to reduce wall-clock time

Throughput:

One user, occasional use
Maybe 5-20 runs per day during active development
No scaling concerns

Data volume:

~10KB per model response
~50KB per run (5 models)
1000 runs = 50MB
No volume concerns for years of use

Part 3: Database Schema

3.1 Schema Design

-- Runs: One row per test execution
CREATE TABLE runs (
    id TEXT PRIMARY KEY,                    -- ULID for sortability
    created_at TEXT NOT NULL,               -- ISO 8601 timestamp
    skill_path TEXT NOT NULL,               -- Path to skill file at test time
    skill_content TEXT NOT NULL,            -- Full content snapshot
    skill_hash TEXT NOT NULL,               -- SHA256 of content (for change detection)
    task TEXT NOT NULL,                     -- Task description given to models
    models TEXT NOT NULL,                   -- JSON array of model IDs requested
    status TEXT NOT NULL DEFAULT 'running', -- running, completed, failed
    completed_at TEXT,                      -- When run finished
    duration_ms INTEGER,                    -- Total wall-clock time
    total_cost_usd REAL                     -- Sum of model costs (if available)
);

-- Responses: One row per model per run
CREATE TABLE responses (
    id TEXT PRIMARY KEY,                    -- ULID
    run_id TEXT NOT NULL REFERENCES runs(id) ON DELETE CASCADE,
    model TEXT NOT NULL,                    -- Model ID (e.g., "openai/gpt-4o-mini")
    status TEXT NOT NULL,                   -- pending, success, error, timeout
    started_at TEXT,                        -- When request was sent
    completed_at TEXT,                      -- When response received
    duration_ms INTEGER,                    -- Request duration
    
    -- Request details
    prompt_tokens INTEGER,                  -- From API response
    completion_tokens INTEGER,              -- From API response
    cost_usd REAL,                          -- Calculated cost
    
    -- Response content
    raw_response TEXT,                      -- Full API response JSON
    content TEXT,                           -- Extracted message content
    
    -- Parsed analysis (if using structured output)
    task_understood INTEGER,                -- 0/1 boolean
    confusions TEXT,                        -- JSON array
    failures TEXT,                          -- JSON array
    suggestions TEXT,                       -- JSON array
    
    -- Error details (if status = error)
    error_code TEXT,
    error_message TEXT
);

-- Indexes for common queries
CREATE INDEX idx_runs_created_at ON runs(created_at DESC);
CREATE INDEX idx_runs_skill_hash ON runs(skill_hash);
CREATE INDEX idx_responses_run_id ON responses(run_id);
CREATE INDEX idx_responses_model ON responses(model);

3.2 Why This Schema

Readable: Someone can query the database directly and understand what happened.

Flexible:

raw_response preserves everything even if parsing fails
confusions, failures, suggestions are nullable (works with unstructured output too)
models is JSON array (no junction table complexity for simple use case)

Queryable:

Find all runs for a skill: WHERE skill_hash = ?
Find runs with failures: WHERE EXISTS (SELECT 1 FROM responses WHERE status = 'error' AND run_id = runs.id)
Compare across models: SELECT model, AVG(duration_ms) FROM responses GROUP BY model

No migrations needed for v1: Schema is simple enough to get right the first time. If we need changes later, we version the schema and migrate on startup.

3.3 Indexes

Query	Index
List recent runs	`idx_runs_created_at`
Find runs for same skill	`idx_runs_skill_hash`
Get responses for a run	`idx_responses_run_id`
Aggregate by model	`idx_responses_model`

Part 4: Configuration

4.1 Config File Location

Following XDG Base Directory spec:

Config: ~/.config/focus-group/config.toml
Data: ~/.local/share/focus-group/focus-group.db
Cache: None (no caching)

4.2 Config Schema

# ~/.config/focus-group/config.toml

# Default models to test against (can override with --models)
[models]
default = [
    "openai/gpt-4o-mini",
    "anthropic/claude-3-haiku",
    "google/gemini-1.5-flash"
]

# Preset model groups
[models.presets]
cheap = [
    "openai/gpt-4o-mini",
    "anthropic/claude-3-haiku", 
    "google/gemini-1.5-flash"
]
expensive = [
    "openai/gpt-4o",
    "anthropic/claude-3-5-sonnet",
    "google/gemini-1.5-pro"
]
comprehensive = [
    "openai/gpt-4o-mini",
    "openai/gpt-4o",
    "anthropic/claude-3-haiku",
    "anthropic/claude-3-5-sonnet",
    "google/gemini-1.5-flash",
    "google/gemini-1.5-pro",
    "meta-llama/llama-3.1-70b-instruct"
]

# API configuration
[api]
base_url = "https://openrouter.ai/api/v1"
timeout_seconds = 60
max_retries = 1
retry_delay_seconds = 2

# Output preferences
[output]
format = "markdown"  # markdown, json, or both
color = true
verbose = false

4.3 Config Precedence

CLI flags (highest priority)
Environment variables (FOCUS_GROUP_*)
Config file
Built-in defaults (lowest priority)

4.4 Environment Variables

Variable	Purpose	Required
`OPENROUTER_API_KEY`	API authentication	Yes
`FOCUS_GROUP_CONFIG`	Override config file path	No
`FOCUS_GROUP_DATA`	Override data directory	No
`NO_COLOR`	Disable colored output	No

Part 5: CLI Interface

5.1 Commands

focus-group <command> [options]

Commands:
  test <skill> <task>     Run a test against models
  show <run-id>           Show detailed results for a run
  history                 List recent runs
  diff <run1> <run2>      Compare two runs
  cost                    Show API cost summary
  config                  Show current configuration
  init                    Create config file with defaults

Options:
  -h, --help              Show help
  -v, --version           Show version
  --verbose               Show detailed output
  --quiet                 Suppress non-essential output

5.2 Command: test

focus-group test <skill-file> <task> [options]

Arguments:
  skill-file              Path to skill/tool documentation (.md file)
  task                    Task for models to attempt (quoted string)

Options:
  -m, --models <list>     Comma-separated model IDs
  -p, --preset <name>     Use a model preset (cheap, expensive, comprehensive)
  -o, --output <dir>      Also save results to directory
  --json                  Use structured JSON output from models
  --parallel <n>          Max parallel requests (default: 3)
  --timeout <sec>         Per-model timeout (default: 60)
  --no-save               Don't save to database (dry run)

Examples:
  focus-group test ./SKILL.md "Generate a 3-voice podcast"
  focus-group test ./SKILL.md "Generate audio" --preset expensive
  focus-group test ./SKILL.md "Use the tool" -m openai/gpt-4o,anthropic/claude-3-opus

Output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Focus Group Test
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Run ID:  01JGXYZ123456789
 Skill:   ./SKILL.md (sha256: abc123...)
 Task:    Generate a 3-voice podcast
 Models:  3
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Testing models...

  openai/gpt-4o-mini                 ✓  12.3s  $0.002
  anthropic/claude-3-haiku           ✓   8.7s  $0.001
  google/gemini-1.5-flash            ✓  15.2s  $0.001

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 Completed:   3/3 models
 Duration:    36.2s (parallel)
 Total cost:  $0.004

Common Confusions:
  • "voice path format unclear" (2/3 models)
  • "default output directory not specified" (1/3 models)

Suggested Improvements:
  • Add explicit full path example for --voice flag
  • Document default output location

View full results: focus-group show 01JGXYZ123456789

5.3 Command: show

focus-group show <run-id> [options]

Arguments:
  run-id                  Run ID to display (or "latest")

Options:
  -m, --model <id>        Show only results for specific model
  --raw                   Show raw API response
  --json                  Output as JSON

Examples:
  focus-group show 01JGXYZ123456789
  focus-group show latest
  focus-group show latest --model openai/gpt-4o-mini

Output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Run 01JGXYZ123456789
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Date:    2026-01-09 22:15:00
 Skill:   ./SKILL.md
 Task:    Generate a 3-voice podcast
 Status:  completed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## openai/gpt-4o-mini

**Status:** success (12.3s, $0.002)

### Understanding
I understand this is a TTS tool that can generate speech from text...

### Approach
1. First, I would check available voices with `ls ~/.chatter/voices/`
2. Create separate text files for each speaker...

### Confusions
- The documentation mentions `--voice` accepts a "preset or path" but only 
  shows the full path example. It's unclear if just the filename works.

### Potential Failures
- If I use `--voice morgan_freeman.wav` without the full path, it may fail

### Suggested Improvements
- Add explicit note that full paths are required for --voice
- Show example of listing available voices

---

## anthropic/claude-3-haiku

**Status:** success (8.7s, $0.001)

[... similar format ...]

---

## google/gemini-1.5-flash

**Status:** success (15.2s, $0.001)

[... similar format ...]

5.4 Command: history

focus-group history [options]

Options:
  -n, --limit <n>         Number of runs to show (default: 20)
  --skill <path>          Filter by skill file path
  --since <date>          Show runs since date
  --json                  Output as JSON

Examples:
  focus-group history
  focus-group history -n 50
  focus-group history --skill ./SKILL.md

Output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Run ID              Date                 Skill            Models  Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 01JGXYZ123456789    2026-01-09 22:15     speak-tts        3/3     ✓
 01JGXYZ123456788    2026-01-09 21:30     speak-tts        3/3     ✓
 01JGXYZ123456787    2026-01-09 20:45     audio-prod       5/5     ✓
 01JGXYZ123456786    2026-01-08 15:20     speak-tts        3/3     ✓
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

5.5 Command: diff

focus-group diff <run1> <run2> [options]

Arguments:
  run1                    First run ID
  run2                    Second run ID

Options:
  --model <id>            Compare only specific model
  --json                  Output as JSON

Examples:
  focus-group diff 01JGXYZ123456788 01JGXYZ123456789
  focus-group diff 01JGXYZ123456788 latest

Output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Diff: 01JGXYZ123456788 → 01JGXYZ123456789
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill changed: Yes (see below)
Task changed: No

## Skill Diff

@@ -45,6 +45,10 @@
 ### Using a Custom Voice
 
+**Always use the full path to voice files:**
+
 ```bash
-speak "Hello world" --voice morgan_freeman.wav --stream
+speak "Hello world" --voice ~/.chatter/voices/morgan_freeman.wav --stream

Confusion Changes

Fixed (no longer reported):

"voice path format unclear" (was 2/3, now 0/3)

New: (none)

Persistent:

"default output directory not specified" (still 1/3)


### 5.6 Command: cost

focus-group cost [options]

Options: --since Show costs since date --by-model Break down by model --json Output as JSON

Examples: focus-group cost focus-group cost --since 2026-01-01 focus-group cost --by-model


**Output**:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ API Costs (All Time) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Total runs: 47 Total cost: $0.42

By model: openai/gpt-4o-mini $0.08 (47 calls) anthropic/claude-3-haiku $0.06 (47 calls) google/gemini-1.5-flash $0.04 (47 calls) openai/gpt-4o $0.18 (12 calls) anthropic/claude-3-5-sonnet $0.06 (8 calls)


---

## Part 6: Core Implementation

### 6.1 Project Structure

focus-group/ ├── src/ │ ├── cli.ts # CLI entry point, command parsing │ ├── commands/ │ │ ├── test.ts # focus-group test │ │ ├── show.ts # focus-group show │ │ ├── history.ts # focus-group history │ │ ├── diff.ts # focus-group diff │ │ ├── cost.ts # focus-group cost │ │ └── config.ts # focus-group config │ ├── core/ │ │ ├── runner.ts # Executes tests against models │ │ ├── openrouter.ts # OpenRouter API client │ │ └── prompts.ts # System prompts for testing │ ├── db/ │ │ ├── store.ts # Database operations │ │ ├── schema.ts # Schema definition and migrations │ │ └── types.ts # Database types │ ├── config/ │ │ ├── loader.ts # Config file loading │ │ ├── defaults.ts # Default configuration │ │ └── types.ts # Config types │ ├── output/ │ │ ├── formatter.ts # Format results for display │ │ ├── table.ts # Table formatting │ │ └── colors.ts # Terminal colors │ └── utils/ │ ├── hash.ts # SHA256 hashing │ ├── ulid.ts # ULID generation │ ├── paths.ts # XDG path resolution │ └── logger.ts # Structured logging ├── test/ │ ├── runner.test.ts │ ├── store.test.ts │ └── fixtures/ │ └── sample-skill.md ├── package.json ├── tsconfig.json ├── README.md └── CHANGELOG.md


### 6.2 Core Module: runner.ts

The hot path. Design carefully.

```typescript
// src/core/runner.ts

import { OpenRouterClient } from './openrouter';
import { Store } from '../db/store';
import { buildPrompt } from './prompts';
import { ulid } from '../utils/ulid';
import { hashContent } from '../utils/hash';
import type { Run, Response, TestOptions } from '../db/types';

export interface TestResult {
  run: Run;
  responses: Response[];
  summary: {
    succeeded: number;
    failed: number;
    totalDuration: number;
    totalCost: number;
    commonConfusions: string[];
    commonSuggestions: string[];
  };
}

export async function runTest(
  skillPath: string,
  skillContent: string,
  task: string,
  models: string[],
  options: TestOptions,
  store: Store,
  client: OpenRouterClient,
  onProgress?: (model: string, status: 'started' | 'completed' | 'failed', result?: Response) => void
): Promise<TestResult> {
  
  // 1. Create run record
  const runId = ulid();
  const run: Run = {
    id: runId,
    created_at: new Date().toISOString(),
    skill_path: skillPath,
    skill_content: skillContent,
    skill_hash: hashContent(skillContent),
    task: task,
    models: JSON.stringify(models),
    status: 'running',
  };
  
  store.insertRun(run);
  
  // 2. Execute tests in parallel (with concurrency limit)
  const responses: Response[] = [];
  const concurrency = options.parallel ?? 3;
  
  const executeModel = async (model: string): Promise<Response> => {
    const responseId = ulid();
    const startedAt = new Date().toISOString();
    
    onProgress?.(model, 'started');
    
    try {
      const prompt = buildPrompt(skillContent, task, options.json ?? false);
      const startTime = Date.now();
      
      const apiResponse = await client.chat({
        model,
        messages: prompt.messages,
        response_format: options.json ? { type: 'json_object' } : undefined,
        timeout: options.timeout ?? 60000,
      });
      
      const duration = Date.now() - startTime;
      
      const response: Response = {
        id: responseId,
        run_id: runId,
        model,
        status: 'success',
        started_at: startedAt,
        completed_at: new Date().toISOString(),
        duration_ms: duration,
        prompt_tokens: apiResponse.usage?.prompt_tokens,
        completion_tokens: apiResponse.usage?.completion_tokens,
        cost_usd: calculateCost(model, apiResponse.usage),
        raw_response: JSON.stringify(apiResponse),
        content: apiResponse.choices[0]?.message?.content ?? '',
        // Parsed fields populated below if JSON mode
      };
      
      // Parse structured output if JSON mode
      if (options.json && response.content) {
        try {
          const parsed = JSON.parse(response.content);
          response.task_understood = parsed.task_understood ? 1 : 0;
          response.confusions = JSON.stringify(parsed.confusions ?? []);
          response.failures = JSON.stringify(parsed.failures ?? []);
          response.suggestions = JSON.stringify(parsed.suggestions ?? []);
        } catch {
          // JSON parsing failed, leave fields null
        }
      }
      
      onProgress?.(model, 'completed', response);
      return response;
      
    } catch (error) {
      const response: Response = {
        id: responseId,
        run_id: runId,
        model,
        status: 'error',
        started_at: startedAt,
        completed_at: new Date().toISOString(),
        error_code: error.code ?? 'UNKNOWN',
        error_message: error.message,
      };
      
      onProgress?.(model, 'failed', response);
      return response;
    }
  };
  
  // Execute with concurrency limit
  const chunks = [];
  for (let i = 0; i < models.length; i += concurrency) {
    chunks.push(models.slice(i, i + concurrency));
  }
  
  for (const chunk of chunks) {
    const chunkResults = await Promise.all(chunk.map(executeModel));
    responses.push(...chunkResults);
    
    // Save responses as they complete
    for (const response of chunkResults) {
      store.insertResponse(response);
    }
  }
  
  // 3. Update run record
  const completedAt = new Date().toISOString();
  const succeeded = responses.filter(r => r.status === 'success').length;
  const totalCost = responses.reduce((sum, r) => sum + (r.cost_usd ?? 0), 0);
  
  store.updateRun(runId, {
    status: succeeded === models.length ? 'completed' : 'partial',
    completed_at: completedAt,
    duration_ms: Date.now() - new Date(run.created_at).getTime(),
    total_cost_usd: totalCost,
  });
  
  // 4. Build summary
  const summary = buildSummary(responses);
  
  return {
    run: { ...run, status: 'completed', completed_at: completedAt },
    responses,
    summary,
  };
}

function buildSummary(responses: Response[]) {
  const succeeded = responses.filter(r => r.status === 'success');
  const failed = responses.filter(r => r.status !== 'success');
  
  // Extract and count confusions/suggestions
  const confusionCounts = new Map<string, number>();
  const suggestionCounts = new Map<string, number>();
  
  for (const r of succeeded) {
    // Try to extract from structured JSON first
    if (r.confusions) {
      try {
        const confusions = JSON.parse(r.confusions);
        for (const c of confusions) {
          const key = typeof c === 'string' ? c : c.issue ?? c.description ?? JSON.stringify(c);
          confusionCounts.set(key, (confusionCounts.get(key) ?? 0) + 1);
        }
      } catch {}
    }
    
    if (r.suggestions) {
      try {
        const suggestions = JSON.parse(r.suggestions);
        for (const s of suggestions) {
          const key = typeof s === 'string' ? s : s.suggestion ?? s.description ?? JSON.stringify(s);
          suggestionCounts.set(key, (suggestionCounts.get(key) ?? 0) + 1);
        }
      } catch {}
    }
  }
  
  // Sort by frequency
  const commonConfusions = [...confusionCounts.entries()]
    .sort((a, b) => b[1] - a[1])
    .slice(0, 5)
    .map(([issue, count]) => `${issue} (${count}/${succeeded.length} models)`);
  
  const commonSuggestions = [...suggestionCounts.entries()]
    .sort((a, b) => b[1] - a[1])
    .slice(0, 5)
    .map(([suggestion]) => suggestion);
  
  return {
    succeeded: succeeded.length,
    failed: failed.length,
    totalDuration: responses.reduce((sum, r) => sum + (r.duration_ms ?? 0), 0),
    totalCost: responses.reduce((sum, r) => sum + (r.cost_usd ?? 0), 0),
    commonConfusions,
    commonSuggestions,
  };
}

6.3 Core Module: openrouter.ts

// src/core/openrouter.ts

export interface ChatRequest {
  model: string;
  messages: { role: 'system' | 'user' | 'assistant'; content: string }[];
  response_format?: { type: 'json_object' };
  timeout?: number;
}

export interface ChatResponse {
  id: string;
  model: string;
  choices: {
    message: {
      role: string;
      content: string;
    };
    finish_reason: string;
  }[];
  usage?: {
    prompt_tokens: number;
    completion_tokens: number;
    total_tokens: number;
  };
}

export class OpenRouterClient {
  private apiKey: string;
  private baseUrl: string;
  private maxRetries: number;
  private retryDelay: number;
  
  constructor(options: {
    apiKey: string;
    baseUrl?: string;
    maxRetries?: number;
    retryDelay?: number;
  }) {
    this.apiKey = options.apiKey;
    this.baseUrl = options.baseUrl ?? 'https://openrouter.ai/api/v1';
    this.maxRetries = options.maxRetries ?? 1;
    this.retryDelay = options.retryDelay ?? 2000;
  }
  
  async chat(request: ChatRequest): Promise<ChatResponse> {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), request.timeout ?? 60000);
    
    let lastError: Error | null = null;
    
    for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
      try {
        const response = await fetch(`${this.baseUrl}/chat/completions`, {
          method: 'POST',
          headers: {
            'Authorization': `Bearer ${this.apiKey}`,
            'Content-Type': 'application/json',
            'HTTP-Referer': 'https://github.com/EmZod/focus-group',
            'X-Title': 'Focus Group',
          },
          body: JSON.stringify({
            model: request.model,
            messages: request.messages,
            response_format: request.response_format,
            temperature: 0.3,  // Lower temperature for more consistent analysis
          }),
          signal: controller.signal,
        });
        
        clearTimeout(timeout);
        
        if (!response.ok) {
          const body = await response.text();
          const error = new Error(`OpenRouter API error: ${response.status} ${body}`);
          (error as any).code = `HTTP_${response.status}`;
          (error as any).status = response.status;
          
          // Don't retry 4xx errors
          if (response.status >= 400 && response.status < 500 && response.status !== 429) {
            throw error;
          }
          
          // Retry 429 and 5xx
          lastError = error;
          if (attempt < this.maxRetries) {
            const delay = response.status === 429 ? 5000 : this.retryDelay;
            await sleep(delay);
            continue;
          }
          throw error;
        }
        
        return await response.json();
        
      } catch (error) {
        clearTimeout(timeout);
        
        if (error.name === 'AbortError') {
          const timeoutError = new Error(`Request timed out after ${request.timeout}ms`);
          (timeoutError as any).code = 'TIMEOUT';
          throw timeoutError;
        }
        
        lastError = error;
        if (attempt < this.maxRetries) {
          await sleep(this.retryDelay);
          continue;
        }
        throw error;
      }
    }
    
    throw lastError;
  }
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

6.4 Core Module: prompts.ts

// src/core/prompts.ts

export interface Prompt {
  messages: { role: 'system' | 'user'; content: string }[];
}

const SYSTEM_PROMPT_MARKDOWN = `You are a QA tester evaluating tool documentation for AI agents.

You will receive documentation for a tool and a task to attempt using that tool.

Your job:
1. Read the documentation carefully
2. Describe step-by-step exactly how you would complete the task
3. Note anything confusing, ambiguous, or missing from the docs
4. Note what would cause you to fail
5. Suggest specific improvements to the documentation

Be brutally honest. Quote the documentation when something is unclear.
Do not be polite—be useful. Identify real problems.

Format your response as:

## Understanding
[What you understood the tool does and how to use it]

## Approach
[Step-by-step how you would complete the task]

## Confusions
[List anything unclear, with quotes from the doc]

## Potential Failures
[What would cause you to fail this task]

## Suggested Improvements
[Specific changes to the documentation]`;

const SYSTEM_PROMPT_JSON = `You are a QA tester evaluating tool documentation for AI agents.

You will receive documentation for a tool and a task to attempt using that tool.

Your job:
1. Read the documentation carefully
2. Describe step-by-step exactly how you would complete the task
3. Note anything confusing, ambiguous, or missing from the docs
4. Note what would cause you to fail
5. Suggest specific improvements to the documentation

Be brutally honest. Identify real problems.

Respond with JSON in this exact format:
{
  "task_understood": boolean,
  "understanding": "string - what you understood the tool does",
  "steps": ["string - step 1", "string - step 2", ...],
  "confusions": [
    {"location": "quoted text from doc", "issue": "why it's confusing"}
  ],
  "potential_failures": [
    {"step": "which step", "reason": "why it would fail"}
  ],
  "suggestions": ["string - specific improvement 1", ...]
}`;

export function buildPrompt(skillContent: string, task: string, jsonMode: boolean): Prompt {
  const systemPrompt = jsonMode ? SYSTEM_PROMPT_JSON : SYSTEM_PROMPT_MARKDOWN;
  
  return {
    messages: [
      { role: 'system', content: systemPrompt },
      { 
        role: 'user', 
        content: `# Tool Documentation

${skillContent}

---

# Task to Complete

${task}` 
      },
    ],
  };
}

6.5 Database Module: store.ts

// src/db/store.ts

import Database from 'better-sqlite3';
import { SCHEMA } from './schema';
import type { Run, Response } from './types';

export class Store {
  private db: Database.Database;
  
  constructor(dbPath: string) {
    this.db = new Database(dbPath);
    this.db.pragma('journal_mode = WAL');  // Better concurrent access
    this.db.pragma('foreign_keys = ON');
    this.initialize();
  }
  
  private initialize() {
    this.db.exec(SCHEMA);
  }
  
  // Runs
  
  insertRun(run: Run): void {
    const stmt = this.db.prepare(`
      INSERT INTO runs (id, created_at, skill_path, skill_content, skill_hash, task, models, status)
      VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    `);
    stmt.run(run.id, run.created_at, run.skill_path, run.skill_content, run.skill_hash, run.task, run.models, run.status);
  }
  
  updateRun(id: string, updates: Partial<Run>): void {
    const fields = Object.keys(updates);
    const values = Object.values(updates);
    const setClause = fields.map(f => `${f} = ?`).join(', ');
    const stmt = this.db.prepare(`UPDATE runs SET ${setClause} WHERE id = ?`);
    stmt.run(...values, id);
  }
  
  getRun(id: string): Run | null {
    const stmt = this.db.prepare('SELECT * FROM runs WHERE id = ?');
    return stmt.get(id) as Run | null;
  }
  
  getLatestRun(): Run | null {
    const stmt = this.db.prepare('SELECT * FROM runs ORDER BY created_at DESC LIMIT 1');
    return stmt.get() as Run | null;
  }
  
  listRuns(options: { limit?: number; skillPath?: string; since?: string }): Run[] {
    let sql = 'SELECT * FROM runs WHERE 1=1';
    const params: any[] = [];
    
    if (options.skillPath) {
      sql += ' AND skill_path = ?';
      params.push(options.skillPath);
    }
    if (options.since) {
      sql += ' AND created_at >= ?';
      params.push(options.since);
    }
    
    sql += ' ORDER BY created_at DESC';
    
    if (options.limit) {
      sql += ' LIMIT ?';
      params.push(options.limit);
    }
    
    const stmt = this.db.prepare(sql);
    return stmt.all(...params) as Run[];
  }
  
  // Responses
  
  insertResponse(response: Response): void {
    const columns = Object.keys(response).filter(k => response[k] !== undefined);
    const placeholders = columns.map(() => '?').join(', ');
    const values = columns.map(k => response[k]);
    
    const stmt = this.db.prepare(`INSERT INTO responses (${columns.join(', ')}) VALUES (${placeholders})`);
    stmt.run(...values);
  }
  
  getResponsesForRun(runId: string): Response[] {
    const stmt = this.db.prepare('SELECT * FROM responses WHERE run_id = ? ORDER BY model');
    return stmt.all(runId) as Response[];
  }
  
  // Aggregates
  
  getCostSummary(since?: string): { total: number; byModel: Map<string, { cost: number; count: number }> } {
    let sql = 'SELECT model, SUM(cost_usd) as total_cost, COUNT(*) as count FROM responses';
    const params: any[] = [];
    
    if (since) {
      sql += ' WHERE started_at >= ?';
      params.push(since);
    }
    
    sql += ' GROUP BY model';
    
    const stmt = this.db.prepare(sql);
    const rows = stmt.all(...params) as { model: string; total_cost: number; count: number }[];
    
    const byModel = new Map<string, { cost: number; count: number }>();
    let total = 0;
    
    for (const row of rows) {
      byModel.set(row.model, { cost: row.total_cost ?? 0, count: row.count });
      total += row.total_cost ?? 0;
    }
    
    return { total, byModel };
  }
  
  close(): void {
    this.db.close();
  }
}

Part 7: Observability

7.1 Logging

Every decision logged. Using structured logging.

// src/utils/logger.ts

export type LogLevel = 'debug' | 'info' | 'warn' | 'error';

export interface LogEntry {
  level: LogLevel;
  message: string;
  timestamp: string;
  [key: string]: any;
}

class Logger {
  private verbose: boolean = false;
  
  setVerbose(verbose: boolean) {
    this.verbose = verbose;
  }
  
  private log(level: LogLevel, message: string, data?: Record<string, any>) {
    if (level === 'debug' && !this.verbose) return;
    
    const entry: LogEntry = {
      level,
      message,
      timestamp: new Date().toISOString(),
      ...data,
    };
    
    // In verbose mode, output structured JSON
    if (this.verbose) {
      console.error(JSON.stringify(entry));
    } else if (level === 'error') {
      console.error(`Error: ${message}`);
    } else if (level === 'warn') {
      console.error(`Warning: ${message}`);
    }
  }
  
  debug(message: string, data?: Record<string, any>) { this.log('debug', message, data); }
  info(message: string, data?: Record<string, any>) { this.log('info', message, data); }
  warn(message: string, data?: Record<string, any>) { this.log('warn', message, data); }
  error(message: string, data?: Record<string, any>) { this.log('error', message, data); }
}

export const logger = new Logger();

7.2 What We Log

Event	Level	Data
Run started	info	`{ run_id, skill_path, task, models }`
Model request started	debug	`{ run_id, model }`
Model request completed	debug	`{ run_id, model, duration_ms, tokens }`
Model request failed	warn	`{ run_id, model, error_code, error_message }`
Retry attempted	debug	`{ run_id, model, attempt, delay_ms }`
Run completed	info	`{ run_id, succeeded, failed, duration_ms, cost }`
Config loaded	debug	`{ path, models_count }`
Config not found, using defaults	info	`{ path }`
Database opened	debug	`{ path }`
Database migration applied	info	`{ version }`

7.3 Metrics

For a CLI tool, we don't need runtime metrics. The database IS our metrics store.

Queries to answer operational questions:

-- Average response time by model
SELECT model, AVG(duration_ms) as avg_ms, COUNT(*) as count
FROM responses WHERE status = 'success'
GROUP BY model;

-- Error rate by model
SELECT model,
  SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as success,
  SUM(CASE WHEN status != 'success' THEN 1 ELSE 0 END) as failure
FROM responses GROUP BY model;

-- Runs per day
SELECT DATE(created_at) as date, COUNT(*) as runs
FROM runs GROUP BY DATE(created_at);

-- Cost per day
SELECT DATE(created_at) as date, SUM(total_cost_usd) as cost
FROM runs GROUP BY DATE(created_at);

Part 8: Operations

8.1 Installation

# Via npm
npm install -g @emzod/focus-group

# Via bun
bun install -g @emzod/focus-group

# From source
git clone https://github.com/EmZod/focus-group
cd focus-group
bun install
bun run build
bun link

8.2 First Run

# Set API key
export OPENROUTER_API_KEY="sk-or-..."

# Create default config (optional)
focus-group init

# Run first test
focus-group test ./SKILL.md "Complete this task"

On first run:

Creates ~/.local/share/focus-group/ directory
Creates focus-group.db with schema
Creates ~/.config/focus-group/config.toml if not exists (with init)

8.3 Debugging

Problem: "API key invalid"

# Check key is set
echo $OPENROUTER_API_KEY

# Test directly
curl https://openrouter.ai/api/v1/models \
  -H "Authorization: Bearer $OPENROUTER_API_KEY"

Problem: "Model not found"

# List available models
curl https://openrouter.ai/api/v1/models \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" | jq '.data[].id'

Problem: "Database locked"

# Check for other processes
lsof ~/.local/share/focus-group/focus-group.db

# If corrupted, delete and re-run (loses history)
rm ~/.local/share/focus-group/focus-group.db

Problem: "Results look wrong"

# View raw API response
focus-group show <run-id> --raw

# Check what was sent
focus-group show <run-id> --verbose

8.4 Rollback

There's no deployment. If something is broken:

Downgrade the package: npm install -g @emzod/focus-group@previous-version
Delete corrupt database: rm ~/.local/share/focus-group/focus-group.db (loses history)
Reset config: rm ~/.config/focus-group/config.toml && focus-group init

8.5 Backup

# Backup database
cp ~/.local/share/focus-group/focus-group.db ~/backups/focus-group-$(date +%Y%m%d).db

# Export all runs as JSON
focus-group history --json > ~/backups/focus-group-history.json

Part 9: Testing

9.1 Unit Tests

// test/runner.test.ts

import { describe, it, expect, mock } from 'bun:test';
import { runTest } from '../src/core/runner';

describe('runTest', () => {
  it('creates run record before making API calls', async () => {
    const store = mockStore();
    const client = mockClient({ response: mockSuccessResponse() });
    
    await runTest('skill.md', 'content', 'task', ['model1'], {}, store, client);
    
    expect(store.insertRun).toHaveBeenCalledBefore(client.chat);
  });
  
  it('continues with other models if one fails', async () => {
    const store = mockStore();
    const client = mockClient({
      'model1': { error: new Error('failed') },
      'model2': { response: mockSuccessResponse() },
    });
    
    const result = await runTest('skill.md', 'content', 'task', ['model1', 'model2'], {}, store, client);
    
    expect(result.summary.succeeded).toBe(1);
    expect(result.summary.failed).toBe(1);
  });
  
  it('respects concurrency limit', async () => {
    const store = mockStore();
    const client = mockClient({ response: mockSuccessResponse() });
    
    await runTest('skill.md', 'content', 'task', ['m1', 'm2', 'm3', 'm4'], { parallel: 2 }, store, client);
    
    // Should have made 2 batches of 2, not 4 concurrent
    expect(client.maxConcurrent).toBe(2);
  });
  
  it('does not retry 4xx errors', async () => {
    const client = mockClient({ error: httpError(400) });
    
    await expect(client.chat({})).rejects.toThrow();
    expect(client.attempts).toBe(1);  // No retry
  });
  
  it('retries 5xx errors once', async () => {
    const client = mockClient({ 
      attempts: [httpError(500), mockSuccessResponse()] 
    });
    
    const result = await client.chat({});
    expect(result).toBeDefined();
    expect(client.attempts).toBe(2);
  });
});

9.2 Integration Tests

// test/integration.test.ts

import { describe, it, expect, beforeEach, afterEach } from 'bun:test';
import { execSync } from 'child_process';
import { mkdtempSync, writeFileSync, rmSync } from 'fs';
import { join } from 'path';

describe('CLI integration', () => {
  let tmpDir: string;
  
  beforeEach(() => {
    tmpDir = mkdtempSync('/tmp/focus-group-test-');
    process.env.FOCUS_GROUP_DATA = tmpDir;
  });
  
  afterEach(() => {
    rmSync(tmpDir, { recursive: true });
  });
  
  it('focus-group test creates run and responses', () => {
    const skillPath = join(tmpDir, 'skill.md');
    writeFileSync(skillPath, '# Test Skill\n\nDo the thing.');
    
    // This requires a real API key - skip in CI
    if (!process.env.OPENROUTER_API_KEY) {
      console.log('Skipping: OPENROUTER_API_KEY not set');
      return;
    }
    
    const output = execSync(
      `focus-group test "${skillPath}" "Use this tool" -m openai/gpt-4o-mini --json`,
      { encoding: 'utf-8' }
    );
    
    const result = JSON.parse(output);
    expect(result.run.status).toBe('completed');
    expect(result.responses).toHaveLength(1);
  });
  
  it('focus-group history shows previous runs', () => {
    // ... run a test first ...
    
    const output = execSync('focus-group history --json', { encoding: 'utf-8' });
    const runs = JSON.parse(output);
    
    expect(runs).toHaveLength(1);
  });
});

9.3 Test Fixtures

<!-- test/fixtures/sample-skill.md -->

# Sample Tool

A tool for testing.

## Usage

```bash
sample-tool do-thing --option value

Options

Option	Description
--option	Does something


---

## Part 10: Implementation Steps

### Phase 1: Foundation (Day 1)

1. **Initialize project**
   ```bash
   mkdir focus-group && cd focus-group
   bun init

Set up dependencies

{
  "dependencies": {
    "commander": "^12.0.0",
    "better-sqlite3": "^9.0.0",
    "toml": "^3.0.0"
  },
  "devDependencies": {
    "@types/better-sqlite3": "^7.0.0",
    "typescript": "^5.0.0"
  }
}

Create directory structure (as specified in 6.1)
Implement core modules:
- src/utils/ulid.ts
- src/utils/hash.ts
- src/utils/paths.ts
- src/utils/logger.ts
Implement database:
- src/db/schema.ts
- src/db/types.ts
- src/db/store.ts

Phase 2: Core Functionality (Day 2)

Implement OpenRouter client:
- src/core/openrouter.ts
- Include retry logic
- Include timeout handling
Implement prompts:
- src/core/prompts.ts
- Both markdown and JSON formats
Implement runner:
- src/core/runner.ts
- Parallel execution
- Progress callbacks
Write tests for core:
- test/openrouter.test.ts
- test/runner.test.ts
- test/store.test.ts

Phase 3: CLI (Day 3)

Implement config loading:
- src/config/defaults.ts
- src/config/loader.ts
- src/config/types.ts
Implement output formatting:
- src/output/formatter.ts
- src/output/table.ts
- src/output/colors.ts
Implement commands:
- src/commands/test.ts (highest priority)
- src/commands/show.ts
- src/commands/history.ts
- src/commands/config.ts
Wire up CLI:
- src/cli.ts

Phase 4: Polish (Day 4)

Implement remaining commands:
- src/commands/diff.ts
- src/commands/cost.ts
Add integration tests
Documentation:
- README.md
- CHANGELOG.md
Package configuration:
- package.json bin field
- Build script
- Publish script

Phase 5: Ship (Day 5)

Test on real skill docs:
- speak-tts SKILL.md
- audio-production SKILL.md
Fix issues found in testing
Publish:
```
npm publish
```
Create GitHub release

Part 11: Non-Goals and Deferrals

Explicitly Not Building

Feature	Why Not
Web UI	CLI is sufficient for primary user (developer). Web adds auth, hosting, complexity.
Orchestrator agent	User knows what to test better than an LLM. Adds complexity without clear value.
Auto-synthesis	Reading 3-5 model responses is fast. Synthesis is hard to get right.
CI integration	Can be built as separate tool that calls this CLI. Not core functionality.
Model comparison rankings	Not the point. Finding doc problems is the point.
Caching	Tests should be fresh. Historical data is in DB. No cache needed.
Background jobs	Everything is synchronous and fast enough. No queue needed.

Deferred to v2 (If Warranted)

Feature	Trigger
Skill doc rewriter	Core v2 feature — synthesize model feedback into improved skill doc with optimized progressive disclosure
Structured diff of confusions	After 50+ runs, if manual comparison is tedious
Model performance tracking	If you need to justify model choices
Export to markdown report	If you need to share results with non-CLI users
Custom prompt templates	If default prompts don't fit certain tool types

Part 12: Design Review Checklist

State

Clear owner for each piece of state (CLI owns DB, user owns config)
State inventory documented (Part 2.1)
Schema designed for actual queries (Part 3)
Indexes planned for high-volume queries (Part 3.1)

Operations

Slow operations handled appropriately (parallel API calls)
No background jobs needed (synchronous CLI)
Killswitch: Ctrl+C aborts
No events needed (single process)

Resilience

Timeouts set for all external calls (60s default)
Retry logic appropriate (backoff, no retry on 4xx)
Fail-open on individual model failure (continue with others)
Degradation: partial results are useful

Observability

Unhappy paths logged with context (Part 7.2)
Decisions logged (model started, completed, failed)
Metrics via DB queries (Part 7.3)
No distributed tracing needed (single process)

Hot Paths

Hot path identified: focus-group test (Part 2.2)
Hot path designed first (Part 6.2)
Hot path has tests (Part 9.1)

Operations

How is it deployed? npm install (Part 8.1)
How is it rolled back? npm install previous version (Part 8.4)
How do you know it's broken? Command fails with error (Part 8.3)
How do you debug? --verbose flag, --raw flag (Part 8.3)

Closing Notes

This plan follows Goedecke's principles:

Boring primitives: SQLite, fetch, Commander.js. Nothing novel.
Minimal state: One DB file. That's it.
One owner: CLI writes to DB. User writes to config. Clear boundaries.
Hot path first: test command designed in detail. Others are simple queries.
Failure modes explicit: Table in Part 2.3 covers every dependency.
Operational answers: Installation, debugging, rollback all documented.
No premature complexity: No caching, no queues, no events, no web UI.

The goal is a tool you install once and forget how it works. You just run focus-group test when you need feedback on your skill docs.

FilesExpand file tree

DESIGN.md

Latest commit

History

DESIGN.md

File metadata and controls

Focus Group: Implementation Plan

Part 1: Orientation

1.1 What This Is

1.2 What Good Looks Like

1.3 What This Is NOT

1.4 The Primitives We're Using

Part 2: Pre-Design Questions

2.1 State Inventory

2.2 Hot Paths

2.3 Failure Mode Analysis

2.4 Auth Model

2.5 Operational Boundaries

Part 3: Database Schema

3.1 Schema Design

3.2 Why This Schema

3.3 Indexes

Part 4: Configuration

4.1 Config File Location

4.2 Config Schema

4.3 Config Precedence

4.4 Environment Variables

Part 5: CLI Interface

5.1 Commands

5.2 Command: test

5.3 Command: show

5.4 Command: history

5.5 Command: diff

Confusion Changes

6.3 Core Module: openrouter.ts

6.4 Core Module: prompts.ts

6.5 Database Module: store.ts

Part 7: Observability

7.1 Logging

7.2 What We Log

7.3 Metrics

Part 8: Operations

8.1 Installation

8.2 First Run

8.3 Debugging

8.4 Rollback

8.5 Backup

Part 9: Testing

9.1 Unit Tests

9.2 Integration Tests

9.3 Test Fixtures

Options

Phase 2: Core Functionality (Day 2)

Phase 3: CLI (Day 3)

Phase 4: Polish (Day 4)

Phase 5: Ship (Day 5)

Part 11: Non-Goals and Deferrals

Explicitly Not Building

Deferred to v2 (If Warranted)

Part 12: Design Review Checklist

State

Operations

Resilience

Observability

Hot Paths

Operations

Closing Notes