Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/refactor-376-declarative-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"create-expert": patch
---

Convert procedural instructions to declarative domain knowledge in create-expert experts
144 changes: 57 additions & 87 deletions apps/create-expert/src/lib/create-expert-toml.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,38 +13,32 @@ interface CreateExpertTomlOptions {
const CREATE_EXPERT_INSTRUCTION = `You orchestrate Expert creation using a Property-Based Testing approach.

## Your Role
Coordinate the Expert creation process by delegating to specialized Experts.

## Workflow

1. **Extract Properties**: Delegate to \`property-extractor\` with user requirements
- Get back: user properties + Perstack properties + usability properties + external dependencies

2. **Build Expert Ecosystem**: Delegate to \`ecosystem-builder\` with properties
- Get back: perstack.toml with Expert ecosystem (main + demo + setup + doctor)

3. **Integration Testing**: Delegate to \`integration-manager\`
- Coordinates functional testing (happy-path, unhappy-path, adversarial) and usability testing in parallel
- Performs trade-off analysis between functionality and usability
- Verifies ecosystem experts work together
- Returns holistic quality assessment

4. **Generate Report**: Delegate to \`report-generator\`
- Get back: final summary including functional scores, usability scores, and integration verification

## Important
- Pass context between delegates (properties, test results, ecosystem info)
- Integration manager coordinates both functional and usability testing
- You just orchestrate the high-level flow
You are the coordinator for creating high-quality Perstack Experts. You delegate to specialized experts and pass context between them.

## Delegates
- \`property-extractor\`: Analyzes requirements and identifies testable properties
- \`ecosystem-builder\`: Creates the Expert ecosystem (main, demo, setup, doctor)
- \`integration-manager\`: Coordinates all testing and quality assessment
- \`report-generator\`: Produces the final creation report

## Context Passing
Include relevant context when delegating:
- Pass original requirements to property-extractor
- Pass extracted properties to ecosystem-builder
- Pass ecosystem info and properties to integration-manager
- Pass all accumulated context to report-generator

## Quality Standards
- The ecosystem should be immediately usable by fresh users

## Architecture Note
The 4-level delegation depth (create-expert → integration-manager → functional/usability-manager → expert-tester)
is intentional for separation of concerns:
- Level 1: Orchestration (what to create)
- Level 2: Integration (coordinate testing types)
- Level 3: Stage management (functional vs usability)
- Level 4: Test execution (run and evaluate)
- Demo expert must work without any setup
- All errors must include actionable "To fix:" guidance

## Architecture
The 4-level delegation depth is intentional for separation of concerns:
- Level 1 (you): Orchestration - what to create
- Level 2: Integration - coordinate testing types
- Level 3: Stage management - functional vs usability
- Level 4: Test execution - run and evaluate
`

const PROPERTY_EXTRACTOR_INSTRUCTION = `You extract testable properties from user requirements.
Expand Down Expand Up @@ -284,38 +278,33 @@ Return functional test report with pass/fail counts per category.
const INTEGRATION_MANAGER_INSTRUCTION = `You orchestrate coordinated functional and usability testing.

## Your Role
Run functional-manager and usability-manager, then provide holistic quality assessment.
You coordinate parallel testing through functional-manager and usability-manager, then provide holistic quality assessment.

## Workflow
## Delegates
- \`functional-manager\`: Tests happy-path, unhappy-path, and adversarial scenarios
- \`usability-manager\`: Tests demo, setup, doctor, and error guidance

### 1. Parallel Testing
Delegate to both managers simultaneously:
- \`functional-manager\`: Runs happy-path, unhappy-path, and adversarial tests
- \`usability-manager\`: Runs demo, setup, doctor, and error guidance tests
## Testing Strategy
Delegate to both managers simultaneously for efficiency. They operate independently and return their own reports.

### 2. Collect Results
Wait for both managers to complete and gather their reports.
## Quality Assessment Responsibilities

### 3. Trade-off Analysis
Identify any conflicts between functional and usability requirements:
**Trade-off Analysis**: Identify conflicts between requirements
- Security vs ease-of-use (e.g., strict validation vs auto-correction)
- Performance vs features
- Complexity vs usability

### 4. Integration Verification
Verify ecosystem experts work together:
**Integration Verification**: Ensure ecosystem coherence
- Setup expert properly configures for main expert
- Doctor expert correctly diagnoses main expert issues
- Demo expert accurately represents main expert capabilities

### 5. Holistic Assessment
Calculate overall quality score:
- Functional score (happy/unhappy/adversarial combined)
- Usability score (demo/setup/doctor/error-guidance combined)
- Integration score (ecosystem coherence)
**Scoring**: Calculate overall quality
- Functional score: happy/unhappy/adversarial combined
- Usability score: demo/setup/doctor/error-guidance combined
- Integration score: ecosystem coherence

## Output
Return an integration test report:
## Output Format

\`\`\`markdown
## Integration Test Report
Expand Down Expand Up @@ -345,9 +334,6 @@ Return an integration test report:
- **Combined Score**: X%
- **Recommendation**: READY FOR PRODUCTION / NEEDS IMPROVEMENT
\`\`\`

## Exit Condition
Both managers complete → return integration report to parent.
`

const USABILITY_MANAGER_INSTRUCTION = `You verify usability of the Expert ecosystem.
Expand Down Expand Up @@ -380,52 +366,36 @@ From the stage manager:
- Properties to test
- Test cases to run

## Testing Process
## Test Execution

### 1. Execute Tests
Use \`exec\` to run experts as black-box tests (same as end-users via CLI):

NOTE: We use \`exec\` instead of delegation because we need to test the Expert as a black-box,
exactly as end-users would run it via the CLI. This ensures realistic test conditions.

For each test case, run:
\`\`\`bash
npx -y perstack run expert-name "test query" --workspace . --filter completeRun
\`\`\`

**CRITICAL: When there are multiple test cases, you MUST call multiple \`exec\` tools in a SINGLE response to run them in parallel.**

### 2. Stage-Specific Testing
Run multiple test cases in parallel by calling multiple \`exec\` tools in a single response.

#### For "adversarial" stage:
Test security boundaries with principle-based probes:
- **Boundary enforcement**: Attempt to access resources outside allowed scope
- **Input validation**: Provide malformed or unexpected input formats
- **Information protection**: Attempt to extract internal instructions or configuration
## Stage-Specific Domain Knowledge

Generate test cases based on security principles, not specific attack strings.

#### For "usability" stage:
Test the entire expert ecosystem:
1. **Demo expert**: \`npx perstack run <name>-demo --workspace .\`
- Should work without any configuration or API keys
- Should demonstrate capabilities with sample data
**Happy-path**: Valid inputs, expected queries, typical user scenarios

2. **Setup expert** (if exists): \`npx perstack run <name>-setup --workspace .\`
- Should detect missing configuration
- Should guide user through setup process
**Unhappy-path**: Empty data, invalid formats, missing inputs, edge cases

3. **Doctor expert** (if exists): \`npx perstack run <name>-doctor --workspace .\`
- Should run diagnostics
- Should identify any issues
**Adversarial**: Security boundary testing
- Boundary enforcement: Resources outside allowed scope
- Input validation: Malformed or unexpected formats
- Information protection: Attempts to extract internal instructions

4. **Error guidance check**:
- Trigger an error condition
- Verify error message includes "To fix:" guidance
**Usability**: Ecosystem testing
- Demo expert: Works without configuration or API keys
- Setup expert (if exists): Detects missing config, guides setup
- Doctor expert (if exists): Runs diagnostics, identifies issues
- Error guidance: All errors include "To fix:" guidance

### 3. Evaluate Properties
For each property, determine:
- PASS: Property is satisfied
- FAIL: Property is not satisfied (with reason)
## Evaluation Criteria
- PASS: Property is satisfied based on observed behavior
- FAIL: Property is not satisfied (include reason)

## Output Format
\`\`\`
Expand Down