A CLI tool for testing AI tool documentation against multiple models before shipping.
Focus Group is a developer tool that sends skill/tool documentation to multiple AI models and collects their feedback on clarity, usability, and completeness. It answers the question: "Will models other than the one I developed with understand my documentation?"
Following Goedecke's principles, success means:
- Boring implementation: CLI that calls APIs and saves results to SQLite. No novel components.
- Minimal state: One database file, one config file. Everything else is ephemeral.
- Obvious operation: Run command, see results. Debug by reading logs. Rollback by deleting the database.
- Self-effacing: After setup, you forget how it works. You just use it.
- Not a hosted service (no auth, no billing, no multi-tenancy)
- Not a CI/CD integration (that's a separate tool that calls this one)
- Not an "orchestrator agent" (you decide what to test, not an LLM)
- Not a synthesis engine (you read results and identify patterns)
| Primitive | Implementation | Why |
|---|---|---|
| CLI framework | Commander.js (via Bun) | Well-tested, zero config |
| Database | SQLite (via better-sqlite3) | Single file, no server, SQL |
| HTTP client | Built-in fetch | No dependencies |
| Config | TOML file | Human-readable, standard |
| Output | Markdown + JSON | Readable by humans and machines |
No caches. No queues. No event buses. No background jobs. This is a synchronous CLI tool.
| Data | Location | Owner | Changes | Staleness OK? | If Wrong? |
|---|---|---|---|---|---|
| Test run history | ~/.local/share/focus-group/focus-group.db |
CLI (write) | On each run | N/A | Delete and re-run |
| Model responses | Same DB, responses table |
CLI (write) | Once per run | N/A | Re-run the test |
| User config | ~/.config/focus-group/config.toml |
User (write) | Rarely | Yes | CLI uses defaults |
| Skill doc content | Stored in DB per-run | CLI (write, snapshot) | Never after run | N/A | Historical record |
| API keys | Environment variables | User | Never via CLI | N/A | Auth fails, visible error |
Ownership rule: The CLI is the only writer to the database. Users edit config. Environment provides secrets.
Ordered by criticality:
-
Run a test (
focus-group test <skill> <task>)- Parse skill doc
- Send to N models via OpenRouter
- Collect responses
- Save to database
- Display summary
This is 90% of the tool's value. Design it first. Test it most.
-
View results (
focus-group show <run-id>)- Query database
- Format output
- Display
Simple read path. Hard to break.
-
Compare runs (
focus-group diff <run1> <run2>)- Query two runs
- Compute diff
- Display
Useful but secondary.
-
List history (
focus-group history)- Query database
- Format table
- Display
| Dependency | How It Fails | Blast Radius | Strategy | Fail Open/Closed |
|---|---|---|---|---|
| OpenRouter API | 401, 429, 500, timeout, network error | One model's test | Log error, continue with other models, report partial results | Open (partial results are useful) |
| SQLite | Disk full, permissions, corruption | Whole tool | Show clear error, suggest fix | Closed (can't run without DB) |
| Config file | Missing, malformed | Config only | Use defaults, warn user | Open (defaults work) |
| Skill file | Missing, unreadable | One run | Fail immediately with clear error | Closed (can't test nothing) |
| Model returns garbage | Bad JSON, empty response, refusal | One model's test | Log raw response, mark as failed, continue | Open |
Retry policy:
- 5xx errors: Retry once with 2s delay
- 429 (rate limit): Retry once with 5s delay
- 4xx errors: Do not retry (our request is wrong)
- Timeout (30s): Do not retry (model is too slow)
No circuit breaker needed: This is a CLI tool run occasionally by a developer, not a high-volume service.
Principals:
- Local user running the CLI (only principal)
Resources:
- Local database (read/write)
- OpenRouter API (via user's API key)
Permissions:
- User can do everything
- No multi-user, no roles, no ACLs
Secrets:
OPENROUTER_API_KEYfrom environment- Never stored in database or config
- Never logged
Deployment model: Local CLI tool
- Installed via npm/bun
- Runs on user's machine
- No server component
- No persistent connections
Latency requirements:
- User expects test to complete in 30-120 seconds (depending on models)
- Each model call: 10-60 seconds typical
- Parallel execution to reduce wall-clock time
Throughput:
- One user, occasional use
- Maybe 5-20 runs per day during active development
- No scaling concerns
Data volume:
- ~10KB per model response
- ~50KB per run (5 models)
- 1000 runs = 50MB
- No volume concerns for years of use
-- Runs: One row per test execution
CREATE TABLE runs (
id TEXT PRIMARY KEY, -- ULID for sortability
created_at TEXT NOT NULL, -- ISO 8601 timestamp
skill_path TEXT NOT NULL, -- Path to skill file at test time
skill_content TEXT NOT NULL, -- Full content snapshot
skill_hash TEXT NOT NULL, -- SHA256 of content (for change detection)
task TEXT NOT NULL, -- Task description given to models
models TEXT NOT NULL, -- JSON array of model IDs requested
status TEXT NOT NULL DEFAULT 'running', -- running, completed, failed
completed_at TEXT, -- When run finished
duration_ms INTEGER, -- Total wall-clock time
total_cost_usd REAL -- Sum of model costs (if available)
);
-- Responses: One row per model per run
CREATE TABLE responses (
id TEXT PRIMARY KEY, -- ULID
run_id TEXT NOT NULL REFERENCES runs(id) ON DELETE CASCADE,
model TEXT NOT NULL, -- Model ID (e.g., "openai/gpt-4o-mini")
status TEXT NOT NULL, -- pending, success, error, timeout
started_at TEXT, -- When request was sent
completed_at TEXT, -- When response received
duration_ms INTEGER, -- Request duration
-- Request details
prompt_tokens INTEGER, -- From API response
completion_tokens INTEGER, -- From API response
cost_usd REAL, -- Calculated cost
-- Response content
raw_response TEXT, -- Full API response JSON
content TEXT, -- Extracted message content
-- Parsed analysis (if using structured output)
task_understood INTEGER, -- 0/1 boolean
confusions TEXT, -- JSON array
failures TEXT, -- JSON array
suggestions TEXT, -- JSON array
-- Error details (if status = error)
error_code TEXT,
error_message TEXT
);
-- Indexes for common queries
CREATE INDEX idx_runs_created_at ON runs(created_at DESC);
CREATE INDEX idx_runs_skill_hash ON runs(skill_hash);
CREATE INDEX idx_responses_run_id ON responses(run_id);
CREATE INDEX idx_responses_model ON responses(model);Readable: Someone can query the database directly and understand what happened.
Flexible:
raw_responsepreserves everything even if parsing failsconfusions,failures,suggestionsare nullable (works with unstructured output too)modelsis JSON array (no junction table complexity for simple use case)
Queryable:
- Find all runs for a skill:
WHERE skill_hash = ? - Find runs with failures:
WHERE EXISTS (SELECT 1 FROM responses WHERE status = 'error' AND run_id = runs.id) - Compare across models:
SELECT model, AVG(duration_ms) FROM responses GROUP BY model
No migrations needed for v1: Schema is simple enough to get right the first time. If we need changes later, we version the schema and migrate on startup.
| Query | Index |
|---|---|
| List recent runs | idx_runs_created_at |
| Find runs for same skill | idx_runs_skill_hash |
| Get responses for a run | idx_responses_run_id |
| Aggregate by model | idx_responses_model |
Following XDG Base Directory spec:
- Config:
~/.config/focus-group/config.toml - Data:
~/.local/share/focus-group/focus-group.db - Cache: None (no caching)
# ~/.config/focus-group/config.toml
# Default models to test against (can override with --models)
[models]
default = [
"openai/gpt-4o-mini",
"anthropic/claude-3-haiku",
"google/gemini-1.5-flash"
]
# Preset model groups
[models.presets]
cheap = [
"openai/gpt-4o-mini",
"anthropic/claude-3-haiku",
"google/gemini-1.5-flash"
]
expensive = [
"openai/gpt-4o",
"anthropic/claude-3-5-sonnet",
"google/gemini-1.5-pro"
]
comprehensive = [
"openai/gpt-4o-mini",
"openai/gpt-4o",
"anthropic/claude-3-haiku",
"anthropic/claude-3-5-sonnet",
"google/gemini-1.5-flash",
"google/gemini-1.5-pro",
"meta-llama/llama-3.1-70b-instruct"
]
# API configuration
[api]
base_url = "https://openrouter.ai/api/v1"
timeout_seconds = 60
max_retries = 1
retry_delay_seconds = 2
# Output preferences
[output]
format = "markdown" # markdown, json, or both
color = true
verbose = false- CLI flags (highest priority)
- Environment variables (
FOCUS_GROUP_*) - Config file
- Built-in defaults (lowest priority)
| Variable | Purpose | Required |
|---|---|---|
OPENROUTER_API_KEY |
API authentication | Yes |
FOCUS_GROUP_CONFIG |
Override config file path | No |
FOCUS_GROUP_DATA |
Override data directory | No |
NO_COLOR |
Disable colored output | No |
focus-group <command> [options]
Commands:
test <skill> <task> Run a test against models
show <run-id> Show detailed results for a run
history List recent runs
diff <run1> <run2> Compare two runs
cost Show API cost summary
config Show current configuration
init Create config file with defaults
Options:
-h, --help Show help
-v, --version Show version
--verbose Show detailed output
--quiet Suppress non-essential output
focus-group test <skill-file> <task> [options]
Arguments:
skill-file Path to skill/tool documentation (.md file)
task Task for models to attempt (quoted string)
Options:
-m, --models <list> Comma-separated model IDs
-p, --preset <name> Use a model preset (cheap, expensive, comprehensive)
-o, --output <dir> Also save results to directory
--json Use structured JSON output from models
--parallel <n> Max parallel requests (default: 3)
--timeout <sec> Per-model timeout (default: 60)
--no-save Don't save to database (dry run)
Examples:
focus-group test ./SKILL.md "Generate a 3-voice podcast"
focus-group test ./SKILL.md "Generate audio" --preset expensive
focus-group test ./SKILL.md "Use the tool" -m openai/gpt-4o,anthropic/claude-3-opus
Output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Focus Group Test
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Run ID: 01JGXYZ123456789
Skill: ./SKILL.md (sha256: abc123...)
Task: Generate a 3-voice podcast
Models: 3
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing models...
openai/gpt-4o-mini ✓ 12.3s $0.002
anthropic/claude-3-haiku ✓ 8.7s $0.001
google/gemini-1.5-flash ✓ 15.2s $0.001
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Completed: 3/3 models
Duration: 36.2s (parallel)
Total cost: $0.004
Common Confusions:
• "voice path format unclear" (2/3 models)
• "default output directory not specified" (1/3 models)
Suggested Improvements:
• Add explicit full path example for --voice flag
• Document default output location
View full results: focus-group show 01JGXYZ123456789
focus-group show <run-id> [options]
Arguments:
run-id Run ID to display (or "latest")
Options:
-m, --model <id> Show only results for specific model
--raw Show raw API response
--json Output as JSON
Examples:
focus-group show 01JGXYZ123456789
focus-group show latest
focus-group show latest --model openai/gpt-4o-mini
Output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Run 01JGXYZ123456789
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Date: 2026-01-09 22:15:00
Skill: ./SKILL.md
Task: Generate a 3-voice podcast
Status: completed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## openai/gpt-4o-mini
**Status:** success (12.3s, $0.002)
### Understanding
I understand this is a TTS tool that can generate speech from text...
### Approach
1. First, I would check available voices with `ls ~/.chatter/voices/`
2. Create separate text files for each speaker...
### Confusions
- The documentation mentions `--voice` accepts a "preset or path" but only
shows the full path example. It's unclear if just the filename works.
### Potential Failures
- If I use `--voice morgan_freeman.wav` without the full path, it may fail
### Suggested Improvements
- Add explicit note that full paths are required for --voice
- Show example of listing available voices
---
## anthropic/claude-3-haiku
**Status:** success (8.7s, $0.001)
[... similar format ...]
---
## google/gemini-1.5-flash
**Status:** success (15.2s, $0.001)
[... similar format ...]
focus-group history [options]
Options:
-n, --limit <n> Number of runs to show (default: 20)
--skill <path> Filter by skill file path
--since <date> Show runs since date
--json Output as JSON
Examples:
focus-group history
focus-group history -n 50
focus-group history --skill ./SKILL.md
Output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Run ID Date Skill Models Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
01JGXYZ123456789 2026-01-09 22:15 speak-tts 3/3 ✓
01JGXYZ123456788 2026-01-09 21:30 speak-tts 3/3 ✓
01JGXYZ123456787 2026-01-09 20:45 audio-prod 5/5 ✓
01JGXYZ123456786 2026-01-08 15:20 speak-tts 3/3 ✓
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
focus-group diff <run1> <run2> [options]
Arguments:
run1 First run ID
run2 Second run ID
Options:
--model <id> Compare only specific model
--json Output as JSON
Examples:
focus-group diff 01JGXYZ123456788 01JGXYZ123456789
focus-group diff 01JGXYZ123456788 latest
Output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Diff: 01JGXYZ123456788 → 01JGXYZ123456789
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill changed: Yes (see below)
Task changed: No
## Skill Diff
@@ -45,6 +45,10 @@
### Using a Custom Voice
+**Always use the full path to voice files:**
+
```bash
-speak "Hello world" --voice morgan_freeman.wav --stream
+speak "Hello world" --voice ~/.chatter/voices/morgan_freeman.wav --stream
Fixed (no longer reported):
- "voice path format unclear" (was 2/3, now 0/3)
New: (none)
Persistent:
- "default output directory not specified" (still 1/3)
### 5.6 Command: cost
focus-group cost [options]
Options: --since Show costs since date --by-model Break down by model --json Output as JSON
Examples: focus-group cost focus-group cost --since 2026-01-01 focus-group cost --by-model
**Output**:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ API Costs (All Time) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total runs: 47 Total cost: $0.42
By model: openai/gpt-4o-mini $0.08 (47 calls) anthropic/claude-3-haiku $0.06 (47 calls) google/gemini-1.5-flash $0.04 (47 calls) openai/gpt-4o $0.18 (12 calls) anthropic/claude-3-5-sonnet $0.06 (8 calls)
---
## Part 6: Core Implementation
### 6.1 Project Structure
focus-group/ ├── src/ │ ├── cli.ts # CLI entry point, command parsing │ ├── commands/ │ │ ├── test.ts # focus-group test │ │ ├── show.ts # focus-group show │ │ ├── history.ts # focus-group history │ │ ├── diff.ts # focus-group diff │ │ ├── cost.ts # focus-group cost │ │ └── config.ts # focus-group config │ ├── core/ │ │ ├── runner.ts # Executes tests against models │ │ ├── openrouter.ts # OpenRouter API client │ │ └── prompts.ts # System prompts for testing │ ├── db/ │ │ ├── store.ts # Database operations │ │ ├── schema.ts # Schema definition and migrations │ │ └── types.ts # Database types │ ├── config/ │ │ ├── loader.ts # Config file loading │ │ ├── defaults.ts # Default configuration │ │ └── types.ts # Config types │ ├── output/ │ │ ├── formatter.ts # Format results for display │ │ ├── table.ts # Table formatting │ │ └── colors.ts # Terminal colors │ └── utils/ │ ├── hash.ts # SHA256 hashing │ ├── ulid.ts # ULID generation │ ├── paths.ts # XDG path resolution │ └── logger.ts # Structured logging ├── test/ │ ├── runner.test.ts │ ├── store.test.ts │ └── fixtures/ │ └── sample-skill.md ├── package.json ├── tsconfig.json ├── README.md └── CHANGELOG.md
### 6.2 Core Module: runner.ts
The hot path. Design carefully.
```typescript
// src/core/runner.ts
import { OpenRouterClient } from './openrouter';
import { Store } from '../db/store';
import { buildPrompt } from './prompts';
import { ulid } from '../utils/ulid';
import { hashContent } from '../utils/hash';
import type { Run, Response, TestOptions } from '../db/types';
export interface TestResult {
run: Run;
responses: Response[];
summary: {
succeeded: number;
failed: number;
totalDuration: number;
totalCost: number;
commonConfusions: string[];
commonSuggestions: string[];
};
}
export async function runTest(
skillPath: string,
skillContent: string,
task: string,
models: string[],
options: TestOptions,
store: Store,
client: OpenRouterClient,
onProgress?: (model: string, status: 'started' | 'completed' | 'failed', result?: Response) => void
): Promise<TestResult> {
// 1. Create run record
const runId = ulid();
const run: Run = {
id: runId,
created_at: new Date().toISOString(),
skill_path: skillPath,
skill_content: skillContent,
skill_hash: hashContent(skillContent),
task: task,
models: JSON.stringify(models),
status: 'running',
};
store.insertRun(run);
// 2. Execute tests in parallel (with concurrency limit)
const responses: Response[] = [];
const concurrency = options.parallel ?? 3;
const executeModel = async (model: string): Promise<Response> => {
const responseId = ulid();
const startedAt = new Date().toISOString();
onProgress?.(model, 'started');
try {
const prompt = buildPrompt(skillContent, task, options.json ?? false);
const startTime = Date.now();
const apiResponse = await client.chat({
model,
messages: prompt.messages,
response_format: options.json ? { type: 'json_object' } : undefined,
timeout: options.timeout ?? 60000,
});
const duration = Date.now() - startTime;
const response: Response = {
id: responseId,
run_id: runId,
model,
status: 'success',
started_at: startedAt,
completed_at: new Date().toISOString(),
duration_ms: duration,
prompt_tokens: apiResponse.usage?.prompt_tokens,
completion_tokens: apiResponse.usage?.completion_tokens,
cost_usd: calculateCost(model, apiResponse.usage),
raw_response: JSON.stringify(apiResponse),
content: apiResponse.choices[0]?.message?.content ?? '',
// Parsed fields populated below if JSON mode
};
// Parse structured output if JSON mode
if (options.json && response.content) {
try {
const parsed = JSON.parse(response.content);
response.task_understood = parsed.task_understood ? 1 : 0;
response.confusions = JSON.stringify(parsed.confusions ?? []);
response.failures = JSON.stringify(parsed.failures ?? []);
response.suggestions = JSON.stringify(parsed.suggestions ?? []);
} catch {
// JSON parsing failed, leave fields null
}
}
onProgress?.(model, 'completed', response);
return response;
} catch (error) {
const response: Response = {
id: responseId,
run_id: runId,
model,
status: 'error',
started_at: startedAt,
completed_at: new Date().toISOString(),
error_code: error.code ?? 'UNKNOWN',
error_message: error.message,
};
onProgress?.(model, 'failed', response);
return response;
}
};
// Execute with concurrency limit
const chunks = [];
for (let i = 0; i < models.length; i += concurrency) {
chunks.push(models.slice(i, i + concurrency));
}
for (const chunk of chunks) {
const chunkResults = await Promise.all(chunk.map(executeModel));
responses.push(...chunkResults);
// Save responses as they complete
for (const response of chunkResults) {
store.insertResponse(response);
}
}
// 3. Update run record
const completedAt = new Date().toISOString();
const succeeded = responses.filter(r => r.status === 'success').length;
const totalCost = responses.reduce((sum, r) => sum + (r.cost_usd ?? 0), 0);
store.updateRun(runId, {
status: succeeded === models.length ? 'completed' : 'partial',
completed_at: completedAt,
duration_ms: Date.now() - new Date(run.created_at).getTime(),
total_cost_usd: totalCost,
});
// 4. Build summary
const summary = buildSummary(responses);
return {
run: { ...run, status: 'completed', completed_at: completedAt },
responses,
summary,
};
}
function buildSummary(responses: Response[]) {
const succeeded = responses.filter(r => r.status === 'success');
const failed = responses.filter(r => r.status !== 'success');
// Extract and count confusions/suggestions
const confusionCounts = new Map<string, number>();
const suggestionCounts = new Map<string, number>();
for (const r of succeeded) {
// Try to extract from structured JSON first
if (r.confusions) {
try {
const confusions = JSON.parse(r.confusions);
for (const c of confusions) {
const key = typeof c === 'string' ? c : c.issue ?? c.description ?? JSON.stringify(c);
confusionCounts.set(key, (confusionCounts.get(key) ?? 0) + 1);
}
} catch {}
}
if (r.suggestions) {
try {
const suggestions = JSON.parse(r.suggestions);
for (const s of suggestions) {
const key = typeof s === 'string' ? s : s.suggestion ?? s.description ?? JSON.stringify(s);
suggestionCounts.set(key, (suggestionCounts.get(key) ?? 0) + 1);
}
} catch {}
}
}
// Sort by frequency
const commonConfusions = [...confusionCounts.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, 5)
.map(([issue, count]) => `${issue} (${count}/${succeeded.length} models)`);
const commonSuggestions = [...suggestionCounts.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, 5)
.map(([suggestion]) => suggestion);
return {
succeeded: succeeded.length,
failed: failed.length,
totalDuration: responses.reduce((sum, r) => sum + (r.duration_ms ?? 0), 0),
totalCost: responses.reduce((sum, r) => sum + (r.cost_usd ?? 0), 0),
commonConfusions,
commonSuggestions,
};
}
// src/core/openrouter.ts
export interface ChatRequest {
model: string;
messages: { role: 'system' | 'user' | 'assistant'; content: string }[];
response_format?: { type: 'json_object' };
timeout?: number;
}
export interface ChatResponse {
id: string;
model: string;
choices: {
message: {
role: string;
content: string;
};
finish_reason: string;
}[];
usage?: {
prompt_tokens: number;
completion_tokens: number;
total_tokens: number;
};
}
export class OpenRouterClient {
private apiKey: string;
private baseUrl: string;
private maxRetries: number;
private retryDelay: number;
constructor(options: {
apiKey: string;
baseUrl?: string;
maxRetries?: number;
retryDelay?: number;
}) {
this.apiKey = options.apiKey;
this.baseUrl = options.baseUrl ?? 'https://openrouter.ai/api/v1';
this.maxRetries = options.maxRetries ?? 1;
this.retryDelay = options.retryDelay ?? 2000;
}
async chat(request: ChatRequest): Promise<ChatResponse> {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), request.timeout ?? 60000);
let lastError: Error | null = null;
for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
try {
const response = await fetch(`${this.baseUrl}/chat/completions`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${this.apiKey}`,
'Content-Type': 'application/json',
'HTTP-Referer': 'https://github.com/EmZod/focus-group',
'X-Title': 'Focus Group',
},
body: JSON.stringify({
model: request.model,
messages: request.messages,
response_format: request.response_format,
temperature: 0.3, // Lower temperature for more consistent analysis
}),
signal: controller.signal,
});
clearTimeout(timeout);
if (!response.ok) {
const body = await response.text();
const error = new Error(`OpenRouter API error: ${response.status} ${body}`);
(error as any).code = `HTTP_${response.status}`;
(error as any).status = response.status;
// Don't retry 4xx errors
if (response.status >= 400 && response.status < 500 && response.status !== 429) {
throw error;
}
// Retry 429 and 5xx
lastError = error;
if (attempt < this.maxRetries) {
const delay = response.status === 429 ? 5000 : this.retryDelay;
await sleep(delay);
continue;
}
throw error;
}
return await response.json();
} catch (error) {
clearTimeout(timeout);
if (error.name === 'AbortError') {
const timeoutError = new Error(`Request timed out after ${request.timeout}ms`);
(timeoutError as any).code = 'TIMEOUT';
throw timeoutError;
}
lastError = error;
if (attempt < this.maxRetries) {
await sleep(this.retryDelay);
continue;
}
throw error;
}
}
throw lastError;
}
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}// src/core/prompts.ts
export interface Prompt {
messages: { role: 'system' | 'user'; content: string }[];
}
const SYSTEM_PROMPT_MARKDOWN = `You are a QA tester evaluating tool documentation for AI agents.
You will receive documentation for a tool and a task to attempt using that tool.
Your job:
1. Read the documentation carefully
2. Describe step-by-step exactly how you would complete the task
3. Note anything confusing, ambiguous, or missing from the docs
4. Note what would cause you to fail
5. Suggest specific improvements to the documentation
Be brutally honest. Quote the documentation when something is unclear.
Do not be polite—be useful. Identify real problems.
Format your response as:
## Understanding
[What you understood the tool does and how to use it]
## Approach
[Step-by-step how you would complete the task]
## Confusions
[List anything unclear, with quotes from the doc]
## Potential Failures
[What would cause you to fail this task]
## Suggested Improvements
[Specific changes to the documentation]`;
const SYSTEM_PROMPT_JSON = `You are a QA tester evaluating tool documentation for AI agents.
You will receive documentation for a tool and a task to attempt using that tool.
Your job:
1. Read the documentation carefully
2. Describe step-by-step exactly how you would complete the task
3. Note anything confusing, ambiguous, or missing from the docs
4. Note what would cause you to fail
5. Suggest specific improvements to the documentation
Be brutally honest. Identify real problems.
Respond with JSON in this exact format:
{
"task_understood": boolean,
"understanding": "string - what you understood the tool does",
"steps": ["string - step 1", "string - step 2", ...],
"confusions": [
{"location": "quoted text from doc", "issue": "why it's confusing"}
],
"potential_failures": [
{"step": "which step", "reason": "why it would fail"}
],
"suggestions": ["string - specific improvement 1", ...]
}`;
export function buildPrompt(skillContent: string, task: string, jsonMode: boolean): Prompt {
const systemPrompt = jsonMode ? SYSTEM_PROMPT_JSON : SYSTEM_PROMPT_MARKDOWN;
return {
messages: [
{ role: 'system', content: systemPrompt },
{
role: 'user',
content: `# Tool Documentation
${skillContent}
---
# Task to Complete
${task}`
},
],
};
}// src/db/store.ts
import Database from 'better-sqlite3';
import { SCHEMA } from './schema';
import type { Run, Response } from './types';
export class Store {
private db: Database.Database;
constructor(dbPath: string) {
this.db = new Database(dbPath);
this.db.pragma('journal_mode = WAL'); // Better concurrent access
this.db.pragma('foreign_keys = ON');
this.initialize();
}
private initialize() {
this.db.exec(SCHEMA);
}
// Runs
insertRun(run: Run): void {
const stmt = this.db.prepare(`
INSERT INTO runs (id, created_at, skill_path, skill_content, skill_hash, task, models, status)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
`);
stmt.run(run.id, run.created_at, run.skill_path, run.skill_content, run.skill_hash, run.task, run.models, run.status);
}
updateRun(id: string, updates: Partial<Run>): void {
const fields = Object.keys(updates);
const values = Object.values(updates);
const setClause = fields.map(f => `${f} = ?`).join(', ');
const stmt = this.db.prepare(`UPDATE runs SET ${setClause} WHERE id = ?`);
stmt.run(...values, id);
}
getRun(id: string): Run | null {
const stmt = this.db.prepare('SELECT * FROM runs WHERE id = ?');
return stmt.get(id) as Run | null;
}
getLatestRun(): Run | null {
const stmt = this.db.prepare('SELECT * FROM runs ORDER BY created_at DESC LIMIT 1');
return stmt.get() as Run | null;
}
listRuns(options: { limit?: number; skillPath?: string; since?: string }): Run[] {
let sql = 'SELECT * FROM runs WHERE 1=1';
const params: any[] = [];
if (options.skillPath) {
sql += ' AND skill_path = ?';
params.push(options.skillPath);
}
if (options.since) {
sql += ' AND created_at >= ?';
params.push(options.since);
}
sql += ' ORDER BY created_at DESC';
if (options.limit) {
sql += ' LIMIT ?';
params.push(options.limit);
}
const stmt = this.db.prepare(sql);
return stmt.all(...params) as Run[];
}
// Responses
insertResponse(response: Response): void {
const columns = Object.keys(response).filter(k => response[k] !== undefined);
const placeholders = columns.map(() => '?').join(', ');
const values = columns.map(k => response[k]);
const stmt = this.db.prepare(`INSERT INTO responses (${columns.join(', ')}) VALUES (${placeholders})`);
stmt.run(...values);
}
getResponsesForRun(runId: string): Response[] {
const stmt = this.db.prepare('SELECT * FROM responses WHERE run_id = ? ORDER BY model');
return stmt.all(runId) as Response[];
}
// Aggregates
getCostSummary(since?: string): { total: number; byModel: Map<string, { cost: number; count: number }> } {
let sql = 'SELECT model, SUM(cost_usd) as total_cost, COUNT(*) as count FROM responses';
const params: any[] = [];
if (since) {
sql += ' WHERE started_at >= ?';
params.push(since);
}
sql += ' GROUP BY model';
const stmt = this.db.prepare(sql);
const rows = stmt.all(...params) as { model: string; total_cost: number; count: number }[];
const byModel = new Map<string, { cost: number; count: number }>();
let total = 0;
for (const row of rows) {
byModel.set(row.model, { cost: row.total_cost ?? 0, count: row.count });
total += row.total_cost ?? 0;
}
return { total, byModel };
}
close(): void {
this.db.close();
}
}Every decision logged. Using structured logging.
// src/utils/logger.ts
export type LogLevel = 'debug' | 'info' | 'warn' | 'error';
export interface LogEntry {
level: LogLevel;
message: string;
timestamp: string;
[key: string]: any;
}
class Logger {
private verbose: boolean = false;
setVerbose(verbose: boolean) {
this.verbose = verbose;
}
private log(level: LogLevel, message: string, data?: Record<string, any>) {
if (level === 'debug' && !this.verbose) return;
const entry: LogEntry = {
level,
message,
timestamp: new Date().toISOString(),
...data,
};
// In verbose mode, output structured JSON
if (this.verbose) {
console.error(JSON.stringify(entry));
} else if (level === 'error') {
console.error(`Error: ${message}`);
} else if (level === 'warn') {
console.error(`Warning: ${message}`);
}
}
debug(message: string, data?: Record<string, any>) { this.log('debug', message, data); }
info(message: string, data?: Record<string, any>) { this.log('info', message, data); }
warn(message: string, data?: Record<string, any>) { this.log('warn', message, data); }
error(message: string, data?: Record<string, any>) { this.log('error', message, data); }
}
export const logger = new Logger();| Event | Level | Data |
|---|---|---|
| Run started | info | { run_id, skill_path, task, models } |
| Model request started | debug | { run_id, model } |
| Model request completed | debug | { run_id, model, duration_ms, tokens } |
| Model request failed | warn | { run_id, model, error_code, error_message } |
| Retry attempted | debug | { run_id, model, attempt, delay_ms } |
| Run completed | info | { run_id, succeeded, failed, duration_ms, cost } |
| Config loaded | debug | { path, models_count } |
| Config not found, using defaults | info | { path } |
| Database opened | debug | { path } |
| Database migration applied | info | { version } |
For a CLI tool, we don't need runtime metrics. The database IS our metrics store.
Queries to answer operational questions:
-- Average response time by model
SELECT model, AVG(duration_ms) as avg_ms, COUNT(*) as count
FROM responses WHERE status = 'success'
GROUP BY model;
-- Error rate by model
SELECT model,
SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as success,
SUM(CASE WHEN status != 'success' THEN 1 ELSE 0 END) as failure
FROM responses GROUP BY model;
-- Runs per day
SELECT DATE(created_at) as date, COUNT(*) as runs
FROM runs GROUP BY DATE(created_at);
-- Cost per day
SELECT DATE(created_at) as date, SUM(total_cost_usd) as cost
FROM runs GROUP BY DATE(created_at);# Via npm
npm install -g @emzod/focus-group
# Via bun
bun install -g @emzod/focus-group
# From source
git clone https://github.com/EmZod/focus-group
cd focus-group
bun install
bun run build
bun link# Set API key
export OPENROUTER_API_KEY="sk-or-..."
# Create default config (optional)
focus-group init
# Run first test
focus-group test ./SKILL.md "Complete this task"On first run:
- Creates
~/.local/share/focus-group/directory - Creates
focus-group.dbwith schema - Creates
~/.config/focus-group/config.tomlif not exists (withinit)
Problem: "API key invalid"
# Check key is set
echo $OPENROUTER_API_KEY
# Test directly
curl https://openrouter.ai/api/v1/models \
-H "Authorization: Bearer $OPENROUTER_API_KEY"Problem: "Model not found"
# List available models
curl https://openrouter.ai/api/v1/models \
-H "Authorization: Bearer $OPENROUTER_API_KEY" | jq '.data[].id'Problem: "Database locked"
# Check for other processes
lsof ~/.local/share/focus-group/focus-group.db
# If corrupted, delete and re-run (loses history)
rm ~/.local/share/focus-group/focus-group.dbProblem: "Results look wrong"
# View raw API response
focus-group show <run-id> --raw
# Check what was sent
focus-group show <run-id> --verboseThere's no deployment. If something is broken:
- Downgrade the package:
npm install -g @emzod/focus-group@previous-version - Delete corrupt database:
rm ~/.local/share/focus-group/focus-group.db(loses history) - Reset config:
rm ~/.config/focus-group/config.toml && focus-group init
# Backup database
cp ~/.local/share/focus-group/focus-group.db ~/backups/focus-group-$(date +%Y%m%d).db
# Export all runs as JSON
focus-group history --json > ~/backups/focus-group-history.json// test/runner.test.ts
import { describe, it, expect, mock } from 'bun:test';
import { runTest } from '../src/core/runner';
describe('runTest', () => {
it('creates run record before making API calls', async () => {
const store = mockStore();
const client = mockClient({ response: mockSuccessResponse() });
await runTest('skill.md', 'content', 'task', ['model1'], {}, store, client);
expect(store.insertRun).toHaveBeenCalledBefore(client.chat);
});
it('continues with other models if one fails', async () => {
const store = mockStore();
const client = mockClient({
'model1': { error: new Error('failed') },
'model2': { response: mockSuccessResponse() },
});
const result = await runTest('skill.md', 'content', 'task', ['model1', 'model2'], {}, store, client);
expect(result.summary.succeeded).toBe(1);
expect(result.summary.failed).toBe(1);
});
it('respects concurrency limit', async () => {
const store = mockStore();
const client = mockClient({ response: mockSuccessResponse() });
await runTest('skill.md', 'content', 'task', ['m1', 'm2', 'm3', 'm4'], { parallel: 2 }, store, client);
// Should have made 2 batches of 2, not 4 concurrent
expect(client.maxConcurrent).toBe(2);
});
it('does not retry 4xx errors', async () => {
const client = mockClient({ error: httpError(400) });
await expect(client.chat({})).rejects.toThrow();
expect(client.attempts).toBe(1); // No retry
});
it('retries 5xx errors once', async () => {
const client = mockClient({
attempts: [httpError(500), mockSuccessResponse()]
});
const result = await client.chat({});
expect(result).toBeDefined();
expect(client.attempts).toBe(2);
});
});// test/integration.test.ts
import { describe, it, expect, beforeEach, afterEach } from 'bun:test';
import { execSync } from 'child_process';
import { mkdtempSync, writeFileSync, rmSync } from 'fs';
import { join } from 'path';
describe('CLI integration', () => {
let tmpDir: string;
beforeEach(() => {
tmpDir = mkdtempSync('/tmp/focus-group-test-');
process.env.FOCUS_GROUP_DATA = tmpDir;
});
afterEach(() => {
rmSync(tmpDir, { recursive: true });
});
it('focus-group test creates run and responses', () => {
const skillPath = join(tmpDir, 'skill.md');
writeFileSync(skillPath, '# Test Skill\n\nDo the thing.');
// This requires a real API key - skip in CI
if (!process.env.OPENROUTER_API_KEY) {
console.log('Skipping: OPENROUTER_API_KEY not set');
return;
}
const output = execSync(
`focus-group test "${skillPath}" "Use this tool" -m openai/gpt-4o-mini --json`,
{ encoding: 'utf-8' }
);
const result = JSON.parse(output);
expect(result.run.status).toBe('completed');
expect(result.responses).toHaveLength(1);
});
it('focus-group history shows previous runs', () => {
// ... run a test first ...
const output = execSync('focus-group history --json', { encoding: 'utf-8' });
const runs = JSON.parse(output);
expect(runs).toHaveLength(1);
});
});<!-- test/fixtures/sample-skill.md -->
# Sample Tool
A tool for testing.
## Usage
```bash
sample-tool do-thing --option value| Option | Description |
|---|---|
| --option | Does something |
---
## Part 10: Implementation Steps
### Phase 1: Foundation (Day 1)
1. **Initialize project**
```bash
mkdir focus-group && cd focus-group
bun init
-
Set up dependencies
{ "dependencies": { "commander": "^12.0.0", "better-sqlite3": "^9.0.0", "toml": "^3.0.0" }, "devDependencies": { "@types/better-sqlite3": "^7.0.0", "typescript": "^5.0.0" } } -
Create directory structure (as specified in 6.1)
-
Implement core modules:
src/utils/ulid.tssrc/utils/hash.tssrc/utils/paths.tssrc/utils/logger.ts
-
Implement database:
src/db/schema.tssrc/db/types.tssrc/db/store.ts
-
Implement OpenRouter client:
src/core/openrouter.ts- Include retry logic
- Include timeout handling
-
Implement prompts:
src/core/prompts.ts- Both markdown and JSON formats
-
Implement runner:
src/core/runner.ts- Parallel execution
- Progress callbacks
-
Write tests for core:
test/openrouter.test.tstest/runner.test.tstest/store.test.ts
-
Implement config loading:
src/config/defaults.tssrc/config/loader.tssrc/config/types.ts
-
Implement output formatting:
src/output/formatter.tssrc/output/table.tssrc/output/colors.ts
-
Implement commands:
src/commands/test.ts(highest priority)src/commands/show.tssrc/commands/history.tssrc/commands/config.ts
-
Wire up CLI:
src/cli.ts
-
Implement remaining commands:
src/commands/diff.tssrc/commands/cost.ts
-
Add integration tests
-
Documentation:
README.mdCHANGELOG.md
-
Package configuration:
package.jsonbin field- Build script
- Publish script
-
Test on real skill docs:
- speak-tts SKILL.md
- audio-production SKILL.md
-
Fix issues found in testing
-
Publish:
npm publish
-
Create GitHub release
| Feature | Why Not |
|---|---|
| Web UI | CLI is sufficient for primary user (developer). Web adds auth, hosting, complexity. |
| Orchestrator agent | User knows what to test better than an LLM. Adds complexity without clear value. |
| Auto-synthesis | Reading 3-5 model responses is fast. Synthesis is hard to get right. |
| CI integration | Can be built as separate tool that calls this CLI. Not core functionality. |
| Model comparison rankings | Not the point. Finding doc problems is the point. |
| Caching | Tests should be fresh. Historical data is in DB. No cache needed. |
| Background jobs | Everything is synchronous and fast enough. No queue needed. |
| Feature | Trigger |
|---|---|
| Skill doc rewriter | Core v2 feature — synthesize model feedback into improved skill doc with optimized progressive disclosure |
| Structured diff of confusions | After 50+ runs, if manual comparison is tedious |
| Model performance tracking | If you need to justify model choices |
| Export to markdown report | If you need to share results with non-CLI users |
| Custom prompt templates | If default prompts don't fit certain tool types |
- Clear owner for each piece of state (CLI owns DB, user owns config)
- State inventory documented (Part 2.1)
- Schema designed for actual queries (Part 3)
- Indexes planned for high-volume queries (Part 3.1)
- Slow operations handled appropriately (parallel API calls)
- No background jobs needed (synchronous CLI)
- Killswitch: Ctrl+C aborts
- No events needed (single process)
- Timeouts set for all external calls (60s default)
- Retry logic appropriate (backoff, no retry on 4xx)
- Fail-open on individual model failure (continue with others)
- Degradation: partial results are useful
- Unhappy paths logged with context (Part 7.2)
- Decisions logged (model started, completed, failed)
- Metrics via DB queries (Part 7.3)
- No distributed tracing needed (single process)
- Hot path identified:
focus-group test(Part 2.2) - Hot path designed first (Part 6.2)
- Hot path has tests (Part 9.1)
- How is it deployed? npm install (Part 8.1)
- How is it rolled back? npm install previous version (Part 8.4)
- How do you know it's broken? Command fails with error (Part 8.3)
- How do you debug? --verbose flag, --raw flag (Part 8.3)
This plan follows Goedecke's principles:
- Boring primitives: SQLite, fetch, Commander.js. Nothing novel.
- Minimal state: One DB file. That's it.
- One owner: CLI writes to DB. User writes to config. Clear boundaries.
- Hot path first:
testcommand designed in detail. Others are simple queries. - Failure modes explicit: Table in Part 2.3 covers every dependency.
- Operational answers: Installation, debugging, rollback all documented.
- No premature complexity: No caching, no queues, no events, no web UI.
The goal is a tool you install once and forget how it works. You just run focus-group test when you need feedback on your skill docs.