Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,8 @@ If you want to build practical agentic apps like Claude Code or OpenClaw, a harn

Perstack is a harness for agentic apps. It aims to:

- **Do big things with small models**: If a smaller model can do the job, there's no reason to use a bigger one.
- **Quality is a system property, not a model property**: Building agentic software people actually use doesn't require an AI science degree—just a solid understanding of the problems you're solving.
- **Keep your app simple and reliable**: The harness is inevitably complex—Perstack absorbs that complexity so your agentic app doesn't have to.
- **Quality is a system property, not a model property**: The harness provides hard signals — compiler output, test results, screenshot diffs — that let agents detect and fix their own mistakes. You provide the domain knowledge that defines what "correct" means. The combination sets both the floor and the ceiling.
- **Do big things with small models**: When the system handles verification, a focused agent on an affordable model outperforms a generalist on a frontier one.

## Getting Started

Expand All @@ -44,7 +43,7 @@ docker run --pull always --rm -it \
"Form a team named ai-gaming to build a Bun-based CLI indie game playable on Bash for AI."
```

`create-expert` is a built-in expert. It generates a `perstack.toml` that defines a team of micro-agents, runs them, evaluates the results, and iterates until the setup works. Each agent has a single responsibility and its own context window. Complex tasks are broken down and delegated to specialists.
`create-expert` is a built-in expert that embodies the hard signal approach. It generates a `perstack.toml`, runs the generated experts against a test query, then verifies the results using deterministic checks — compiler output, test pass/fail, structural validation — not LLM judgment. A dedicated verifier expert runs each check twice to confirm reproducibility. The cycle repeats until all signals pass. Each agent has a single responsibility and its own context window.

```toml
[experts."ai-gaming"]
Expand Down Expand Up @@ -199,6 +198,7 @@ Perstack is a harness for micro-agents — purpose-specific agents with a single
- **Cost-Effective**: Purpose-specific experts are designed to run on affordable models. A focused agent with the right domain knowledge on a cheap model outperforms a generalist on an expensive one.
- **Fast**: Smaller models generate faster. Fine-grained tasks broken into delegates run concurrently via parallel delegation.
- **Maintainable**: A monolithic system prompt is like refactoring without tests — every change risks breaking something. Single-responsibility experts are independently testable. Test each one, then compose them.
- **Verifiable**: When each agent has a single responsibility, its output is a discrete artifact that can be verified independently — by a compiler, a test suite, or a screenshot diff. This is what makes quality a system property: the system provides verification that does not depend on LLM judgment.

## Prerequisites

Expand Down
7 changes: 7 additions & 0 deletions docs/guides/going-to-production.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,8 @@ docker run --rm ... my-expert "assistant" "query" | your-log-collector

You get full execution traces without any instrumentation code.

Production observability is not just logging — it is the infrastructure for [hard signal](../understanding-perstack/hard-signals.md) verification. The JSON event stream is deterministic: the same execution always produces the same events. An external verifier can process these events, compare outputs against baselines, and produce pass/fail signals without any LLM involvement. This is what separates production-grade agent monitoring from hope-based logging.

> [!NOTE]
> Events are also written to `workspace/perstack/` as checkpoints. You can replay any execution for debugging or auditing.
Expand Down Expand Up @@ -173,6 +175,11 @@ The Expert operates freely within its container. Your infrastructure controls wh
- [ ] Execution time limits (via container timeout)
- [ ] Workspace size limits

**Verification loop**:
- [ ] Expert outputs verified by [hard signals](../understanding-perstack/hard-signals.md) (tests, compilation, diffs)
- [ ] Verification independent of the LLM that generated the output
- [ ] Verification procedure is deterministic and reproducible

## Scaling patterns

**Job queue**: Push queries to a queue, workers pull and execute in containers.
Expand Down
3 changes: 3 additions & 0 deletions docs/guides/rapid-prototyping.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ npx perstack start create-expert "The reviewer missed the SQL injection in the r

This is the feedback loop that matters: **write a scenario the agent should handle, test it, fix the instruction when it fails, repeat.** By the time you build the app around it, you already know what the agent can and can't do.

This feedback loop is powerful but relies on your judgment — a [soft signal](../understanding-perstack/hard-signals.md). As the prototype matures, convert manual observations into hard signals: write an automated test that runs the Expert's output. If the Expert generates code, compile it and run its test suite. If it generates a document, validate it against a schema. The earlier you introduce hard signals, the earlier you stop oscillating and start converging.

## Evaluate with others

At some point you need feedback beyond your own testing. `perstack start` makes this easy — hand someone the `perstack.toml` and they can run the Expert themselves:
Expand Down Expand Up @@ -126,6 +128,7 @@ At some point, your prototype will need more. The same `perstack.toml` scales

- **The agent needs tools** — search the web, query a database, call an API → [Extending with Tools](./extending-with-tools.md)
- **The prompt is getting long** — split into multiple Experts that collaborate → [Taming Prompt Sprawl](./taming-prompt-sprawl.md)
- **Verification is manual** — build hard signal loops that run automatically → [Testing Experts](../making-experts/testing.md), [Hard Signals](../understanding-perstack/hard-signals.md)
- **The prototype works** — embed it into your application → [Adding AI to Your App](./adding-ai-to-your-app.md)

## What's next
Expand Down
5 changes: 5 additions & 0 deletions docs/guides/taming-prompt-sprawl.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,10 @@ The lesson: when a task requires multiple areas of expertise, splitting into spe

With 50-100 lines of instruction instead of 500+, each Expert operates within its attention budget. The model can actually follow all your instructions — because there aren't that many to follow.

### Each Expert's output becomes independently verifiable

When you split a monolith into delegates, you gain something beyond attention management: each delegate's output becomes a discrete artifact that can be verified independently. A monolithic agent produces one giant output that can only be evaluated holistically — "does the overall result look right?" is a [soft signal](../understanding-perstack/hard-signals.md). A team of focused delegates produces individual outputs, each of which can be checked against ground truth by the coordinator or by external tooling — "did each component pass its specific test?" is a [hard signal](../understanding-perstack/hard-signals.md). This transforms verification from judgment-dependent to judgment-independent.

## How delegation works

When you define `delegates`, Perstack presents them as callable tools:
Expand All @@ -188,6 +192,7 @@ The coordinator reads the customer's message, decides which specialist to call,
- You can describe what each Expert does in one sentence
- Experts don't need to know about each other's internal logic
- You can test each Expert independently
- Each Expert's output can be verified by a process independent of the LLM (compiler, test suite, screenshot diff)

## What's next

Expand Down
1 change: 1 addition & 0 deletions docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Perstack is a harness for micro-agents.

- [**Getting Started**](/docs/getting-started/walkthrough/) — create your first expert and walk through the core workflow
- [**Concepts**](/docs/understanding-perstack/concept/) — understand the architecture behind experts, runtime, isolation, and the boundary model
- [**Hard Signals**](/docs/understanding-perstack/hard-signals/) — the design philosophy behind output quality — why hard verification signals make agents reliable

### Build

Expand Down
57 changes: 42 additions & 15 deletions docs/making-experts/best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ sidebar:
These principles help you avoid common pitfalls in agent development: monoliths, complexity explosions, debugging nightmares, and fragile systems. Building a large agent head-on almost always fails.

> [!NOTE]
> The key insight: we tend to over-control, but LLMs work best when you trust their reasoning and define goals rather than procedures.
> The key insight: we tend to over-control, but LLMs work best when you trust their reasoning and define goals rather than procedures. These principles are grounded in the [Hard Signal Framework](../understanding-perstack/hard-signals.md) — the design philosophy behind Perstack's architecture.

## The Five Principles

Expand Down Expand Up @@ -105,31 +105,58 @@ Modular Experts unlock collaboration — between Experts, and between people. Th

## Keep It Verifiable

**Pitfall**: Instructions that only the author can understand.
**Pitfall**: Experts whose output can only be checked by another LLM.

If others can't verify what an Expert does, it's neither safe nor reusable.
If the only way to verify an Expert's output is to have another model (or the same model) review it, the verification loop is [soft](../understanding-perstack/hard-signals.md). The agent will oscillate — "looks good" one iteration, "has issues" the next — without converging.

**Bad** — A third party can't verify what this Expert actually does:
**Bad** — Verification relies solely on LLM judgment:
```toml
instruction = """
Handle expense reports appropriately.
Use your judgment for edge cases.
"""
[experts."code-generator"]
instruction = "Generate TypeScript code for the given task."
delegates = ["code-reviewer"]

[experts."code-reviewer"]
instruction = "Review the generated code for correctness and style."
```

**Good** — Anyone reading this knows exactly what to expect:
The reviewer uses the same kind of judgment as the generator. It can miss the same bugs, approve the same anti-patterns, and disagree with itself across runs. As the sole gate, this is a soft signal loop — the system oscillates.

**Good** — Hard signals as the final authority, with an optional soft gate for semantic checks:
```toml
[experts."builder"]
delegates = ["code-writer", "reviewer", "verifier"]

[experts."reviewer"]
description = "Checks whether the code reflects the requirements. Returns PASS or CONTINUE."
instruction = """
You are an expense report reviewer.
Read the requirements and the generated code.
Check whether each requirement is addressed. Flag omissions.
Do NOT evaluate code quality — that is the verifier's job.
"""

[experts."reviewer".skills."@perstack/base"]
type = "mcpStdioSkill"
command = "npx"
packageName = "@perstack/base"
pick = ["readTextFile", "attemptCompletion"]

Approval rules:
- Under $100: Auto-approve with receipt
- $100-$500: Approve if business purpose is clear
- Over $500: Flag for manager review
[experts."verifier"]
description = "Executes hard signal checks against the code. Returns PASS or CONTINUE with specific failures."
instruction = """
Run the verification commands. Compare actual output against expected.
Report pass/fail per check. Do NOT read the code and form opinions.
"""

[experts."verifier".skills."@perstack/base"]
type = "mcpStdioSkill"
command = "npx"
packageName = "@perstack/base"
pick = ["readTextFile", "exec", "attemptCompletion"]
```

If someone else can't read your Expert and predict its behavior, it's not verifiable.
The reviewer (soft gate) catches semantic misalignment early — "does the code address the requirements?" is a qualitative judgment that only an LLM can make. The verifier (hard gate) provides the final pass/fail — compiler errors, test failures, and structural checks that are deterministic and independent of LLM judgment. The reviewer has no `exec`; the verifier has `exec`. Neither replaces the other. See [combining soft and hard signals](../understanding-perstack/hard-signals.md#combining-soft-and-hard-signals) for the full pattern.

"Verifiable" means the Expert's output is ultimately checked by a process that does not depend on LLM judgment. Soft signals can supplement hard signals — catching semantic drift that no compiler can detect — but the **final gate must be hard**. When designing an Expert, ask: **what hard signal can verify this Expert's output?** If the only answer is "another LLM reads it," look for something harder — a compiler, a test suite, a schema validator, a screenshot diff.

---

Expand Down
96 changes: 96 additions & 0 deletions docs/making-experts/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,108 @@ sidebar:

Patterns for defining Experts. Each example highlights a specific skill type or integration approach.

- [Hard Signal Verification](#hard-signal-verification)
- [GitHub Issue Bot](#github-issue-bot)
- [Web Search](#web-search)
- [Custom MCP Server](#custom-mcp-server)
- [Interactive Wizard](#interactive-wizard)
- [Application Integration](#application-integration)

## Hard Signal Verification

**Pattern**: Combine a soft review gate with hard signal verification in the delegation tree. The reviewer catches semantic issues early; the verifier provides the final deterministic pass/fail. See [combining soft and hard signals](../understanding-perstack/hard-signals.md#combining-soft-and-hard-signals) for the full rationale.

```toml
# Delegation Tree
#
# app-builder — coordinator: build → review → verify cycle
# ├── @app-builder/build — writes code to workspace
# ├── @app-builder/review — checks requirements alignment (soft gate)
# └── @app-builder/verify — runs hard signal checks, reports PASS/CONTINUE

[experts."app-builder"]
description = "Builds a web application with verified output"
instruction = """
Coordinate the build-review-verify cycle:
1. Delegate to build with the user's requirements
2. Delegate to review with the requirements and build output
3. If review returns CONTINUE: delegate to build with review feedback, restart from 2
4. If review returns PASS: delegate to verify with the build result
5. If verify returns CONTINUE: delegate to build with failure feedback, restart from 2
6. If verify returns PASS: done
"""
delegates = ["@app-builder/build", "@app-builder/review", "@app-builder/verify"]

[experts."app-builder".skills."@perstack/base"]
type = "mcpStdioSkill"
command = "npx"
packageName = "@perstack/base"
pick = ["readTextFile", "attemptCompletion"]

[experts."@app-builder/build"]
description = "Writes application code to the workspace. Provide requirements or failure feedback to address."
instruction = """
Write working application code. Focus on correctness over style.
If failure feedback is provided, fix the specific issues.
"""

[experts."@app-builder/build".skills."@perstack/base"]
type = "mcpStdioSkill"
command = "npx"
packageName = "@perstack/base"
pick = ["readTextFile", "writeTextFile", "editTextFile", "exec", "attemptCompletion"]

[experts."@app-builder/review"]
description = """
Reviews the build output against requirements for completeness and alignment.
Provide: requirements and path to built code. Returns PASS or CONTINUE with specific gaps.
"""
instruction = """
Read the requirements and the generated code.
Check whether each requirement is addressed. Flag omissions or misinterpretations.
Do NOT evaluate code quality or run any checks — that is the verifier's job.
"""

[experts."@app-builder/review".skills."@perstack/base"]
type = "mcpStdioSkill"
command = "npx"
packageName = "@perstack/base"
pick = ["readTextFile", "attemptCompletion"]

[experts."@app-builder/verify"]
description = """
Runs hard signal checks against the build output.
Provide: what was built and where. Returns PASS or CONTINUE with specific failures.
"""
instruction = """
You are a verifier. Run commands and compare outputs. Do NOT read code and form opinions.

Checks:
- TypeScript compiles: `npx tsc --noEmit` → exit code 0
- Tests pass: `npm test` → exit code 0
- App starts: `timeout 5 node dist/index.js` → no crash within 5 seconds

Run each check twice. If results differ between runs, report CONTINUE — the signal is non-deterministic.

Report per check: command, expected, actual, PASS/FAIL.
"""

[experts."@app-builder/verify".skills."@perstack/base"]
type = "mcpStdioSkill"
command = "npx"
packageName = "@perstack/base"
pick = ["readTextFile", "exec", "attemptCompletion"]
```

Key design decisions:
- **Reviewer is read-only** — `pick` has no `exec`. It reads files and judges semantic alignment. This is a [soft signal](../understanding-perstack/hard-signals.md) — qualitative judgment that only an LLM can provide — but it catches requirement gaps before the expensive verify cycle.
- **Verifier has `exec`** — it runs commands and compares outputs. No LLM judgment involved. This is the [hard signal](../understanding-perstack/hard-signals.md) that provides the final pass/fail.
- **Review before verify** — soft gate catches semantic drift early. Hard gate catches runtime failures. Neither replaces the other.
- **Verifier is a direct child of the coordinator** — not nested under `build`. This guarantees [context separation](../understanding-perstack/hard-signals.md#2-context-separation).
- **Reproducibility check** — each command runs twice. If results differ, the signal is non-deterministic and cannot be trusted.

For a production example of this pattern, see [`create-expert`](https://github.com/perstack-ai/perstack/blob/main/definitions/create-expert/perstack.toml) — Perstack's built-in Expert for generating Expert definitions, which uses `review-definition` (soft gate for plan alignment) followed by `test-expert` → `verify-test` (hard signal checks with reproducibility verification).

## GitHub Issue Bot

**Pattern**: Use `requiredEnv` to pass environment variables to tools like `gh` CLI.
Expand Down
Loading