From cb1cb45582a0875a71218fd54d8fc3da986dd621 Mon Sep 17 00:00:00 2001 From: HiranoMasaaki Date: Sat, 14 Mar 2026 07:46:33 +0900 Subject: [PATCH 1/3] docs: introduce Hard Signal Framework across documentation Add hard-signals.md as the canonical definition of the framework: - Soft vs hard signal distinction with three conditions (ground truth, context separation, determinism) - Built-in verifier pattern with soft gate + hard verification - Converting soft checks to hard checks - Reproducibility check for confirming determinism - Combining soft and hard signals: soft gates for semantic judgment, hard verifiers as the final authority Weave the framework throughout existing documentation: - concept.mdx: design philosophy section - experts.md: context isolation as context separation - runtime.md: deterministic state as verification infrastructure - sandbox-integration.md: post-hoc verification as hard signal - best-practices.md: rewrite "Keep It Verifiable" with reviewer + verifier pattern - testing.md: signal quality section with strategy classification - examples.md: full hard signal verification example with soft gate - taming-prompt-sprawl.md: independent verifiability as split benefit - going-to-production.md: verification loop checklist - rapid-prototyping.md: soft-to-hard signal progression - README.md: hard signal framing in project aims and create-expert description Co-Authored-By: Claude Opus 4.6 (1M context) --- README.md | 5 +- docs/guides/going-to-production.md | 7 + docs/guides/rapid-prototyping.md | 3 + docs/guides/taming-prompt-sprawl.md | 5 + docs/index.mdx | 1 + docs/making-experts/best-practices.md | 57 +++-- docs/making-experts/examples.md | 96 ++++++++ docs/making-experts/testing.md | 22 ++ docs/understanding-perstack/boundary-model.md | 2 +- docs/understanding-perstack/concept.mdx | 16 ++ docs/understanding-perstack/experts.md | 9 +- docs/understanding-perstack/hard-signals.md | 215 ++++++++++++++++++ docs/understanding-perstack/runtime.md | 5 +- .../sandbox-integration.md | 6 +- 14 files changed, 427 insertions(+), 22 deletions(-) create mode 100644 docs/understanding-perstack/hard-signals.md diff --git a/README.md b/README.md index 6fd474f0..2b74c91b 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ Perstack is a harness for agentic apps. It aims to: - **Do big things with small models**: If a smaller model can do the job, there's no reason to use a bigger one. - **Quality is a system property, not a model property**: Building agentic software people actually use doesn't require an AI science degree—just a solid understanding of the problems you're solving. -- **Keep your app simple and reliable**: The harness is inevitably complex—Perstack absorbs that complexity so your agentic app doesn't have to. +- **Hard signals make agents reliable**: Output quality depends on verification, not model capability. When each micro-agent's output can be checked by a compiler, test suite, or screenshot diff—independent of LLM judgment—the system self-corrects and converges instead of oscillating. ## Getting Started @@ -44,7 +44,7 @@ docker run --pull always --rm -it \ "Form a team named ai-gaming to build a Bun-based CLI indie game playable on Bash for AI." ``` -`create-expert` is a built-in expert. It generates a `perstack.toml` that defines a team of micro-agents, runs them, evaluates the results, and iterates until the setup works. Each agent has a single responsibility and its own context window. Complex tasks are broken down and delegated to specialists. +`create-expert` is a built-in expert that embodies the hard signal approach. It generates a `perstack.toml`, runs the generated experts against a test query, then verifies the results using deterministic checks — compiler output, test pass/fail, structural validation — not LLM judgment. A dedicated verifier expert runs each check twice to confirm reproducibility. The cycle repeats until all signals pass. Each agent has a single responsibility and its own context window. ```toml [experts."ai-gaming"] @@ -199,6 +199,7 @@ Perstack is a harness for micro-agents — purpose-specific agents with a single - **Cost-Effective**: Purpose-specific experts are designed to run on affordable models. A focused agent with the right domain knowledge on a cheap model outperforms a generalist on an expensive one. - **Fast**: Smaller models generate faster. Fine-grained tasks broken into delegates run concurrently via parallel delegation. - **Maintainable**: A monolithic system prompt is like refactoring without tests — every change risks breaking something. Single-responsibility experts are independently testable. Test each one, then compose them. +- **Verifiable**: When each agent has a single responsibility, its output is a discrete artifact that can be verified independently — by a compiler, a test suite, or a screenshot diff. This is what makes quality a system property: the system provides verification that does not depend on LLM judgment. ## Prerequisites diff --git a/docs/guides/going-to-production.md b/docs/guides/going-to-production.md index 9350c100..211d8e10 100644 --- a/docs/guides/going-to-production.md +++ b/docs/guides/going-to-production.md @@ -131,6 +131,8 @@ docker run --rm ... my-expert "assistant" "query" | your-log-collector You get full execution traces without any instrumentation code. +Production observability is not just logging — it is the infrastructure for [hard signal](../understanding-perstack/hard-signals.md) verification. The JSON event stream is deterministic: the same execution always produces the same events. An external verifier can process these events, compare outputs against baselines, and produce pass/fail signals without any LLM involvement. This is what separates production-grade agent monitoring from hope-based logging. + > [!NOTE] > Events are also written to `workspace/perstack/` as checkpoints. You can replay any execution for debugging or auditing. @@ -173,6 +175,11 @@ The Expert operates freely within its container. Your infrastructure controls wh - [ ] Execution time limits (via container timeout) - [ ] Workspace size limits +**Verification loop**: +- [ ] Expert outputs verified by [hard signals](../understanding-perstack/hard-signals.md) (tests, compilation, diffs) +- [ ] Verification independent of the LLM that generated the output +- [ ] Verification procedure is deterministic and reproducible + ## Scaling patterns **Job queue**: Push queries to a queue, workers pull and execute in containers. diff --git a/docs/guides/rapid-prototyping.md b/docs/guides/rapid-prototyping.md index ca6da3c6..d605a980 100644 --- a/docs/guides/rapid-prototyping.md +++ b/docs/guides/rapid-prototyping.md @@ -98,6 +98,8 @@ npx perstack start create-expert "The reviewer missed the SQL injection in the r This is the feedback loop that matters: **write a scenario the agent should handle, test it, fix the instruction when it fails, repeat.** By the time you build the app around it, you already know what the agent can and can't do. +This feedback loop is powerful but relies on your judgment — a [soft signal](../understanding-perstack/hard-signals.md). As the prototype matures, convert manual observations into hard signals: write an automated test that runs the Expert's output. If the Expert generates code, compile it and run its test suite. If it generates a document, validate it against a schema. The earlier you introduce hard signals, the earlier you stop oscillating and start converging. + ## Evaluate with others At some point you need feedback beyond your own testing. `perstack start` makes this easy — hand someone the `perstack.toml` and they can run the Expert themselves: @@ -126,6 +128,7 @@ At some point, your prototype will need more. The same `perstack.toml` scales - **The agent needs tools** — search the web, query a database, call an API → [Extending with Tools](./extending-with-tools.md) - **The prompt is getting long** — split into multiple Experts that collaborate → [Taming Prompt Sprawl](./taming-prompt-sprawl.md) +- **Verification is manual** — build hard signal loops that run automatically → [Testing Experts](../making-experts/testing.md), [Hard Signals](../understanding-perstack/hard-signals.md) - **The prototype works** — embed it into your application → [Adding AI to Your App](./adding-ai-to-your-app.md) ## What's next diff --git a/docs/guides/taming-prompt-sprawl.md b/docs/guides/taming-prompt-sprawl.md index e66c1702..1ecf3283 100644 --- a/docs/guides/taming-prompt-sprawl.md +++ b/docs/guides/taming-prompt-sprawl.md @@ -165,6 +165,10 @@ The lesson: when a task requires multiple areas of expertise, splitting into spe With 50-100 lines of instruction instead of 500+, each Expert operates within its attention budget. The model can actually follow all your instructions — because there aren't that many to follow. +### Each Expert's output becomes independently verifiable + +When you split a monolith into delegates, you gain something beyond attention management: each delegate's output becomes a discrete artifact that can be verified independently. A monolithic agent produces one giant output that can only be evaluated holistically — "does the overall result look right?" is a [soft signal](../understanding-perstack/hard-signals.md). A team of focused delegates produces individual outputs, each of which can be checked against ground truth by the coordinator or by external tooling — "did each component pass its specific test?" is a [hard signal](../understanding-perstack/hard-signals.md). This transforms verification from judgment-dependent to judgment-independent. + ## How delegation works When you define `delegates`, Perstack presents them as callable tools: @@ -188,6 +192,7 @@ The coordinator reads the customer's message, decides which specialist to call, - You can describe what each Expert does in one sentence - Experts don't need to know about each other's internal logic - You can test each Expert independently +- Each Expert's output can be verified by a process independent of the LLM (compiler, test suite, screenshot diff) ## What's next diff --git a/docs/index.mdx b/docs/index.mdx index b31b1b24..69e42b7b 100644 --- a/docs/index.mdx +++ b/docs/index.mdx @@ -15,6 +15,7 @@ Perstack is a harness for micro-agents. - [**Getting Started**](/docs/getting-started/walkthrough/) — create your first expert and walk through the core workflow - [**Concepts**](/docs/understanding-perstack/concept/) — understand the architecture behind experts, runtime, isolation, and the boundary model +- [**Hard Signals**](/docs/understanding-perstack/hard-signals/) — the design philosophy behind output quality — why hard verification signals make agents reliable ### Build diff --git a/docs/making-experts/best-practices.md b/docs/making-experts/best-practices.md index dbe35499..eba5f056 100644 --- a/docs/making-experts/best-practices.md +++ b/docs/making-experts/best-practices.md @@ -7,7 +7,7 @@ sidebar: These principles help you avoid common pitfalls in agent development: monoliths, complexity explosions, debugging nightmares, and fragile systems. Building a large agent head-on almost always fails. > [!NOTE] -> The key insight: we tend to over-control, but LLMs work best when you trust their reasoning and define goals rather than procedures. +> The key insight: we tend to over-control, but LLMs work best when you trust their reasoning and define goals rather than procedures. These principles are grounded in the [Hard Signal Framework](../understanding-perstack/hard-signals.md) — the design philosophy behind Perstack's architecture. ## The Five Principles @@ -105,31 +105,58 @@ Modular Experts unlock collaboration — between Experts, and between people. Th ## Keep It Verifiable -**Pitfall**: Instructions that only the author can understand. +**Pitfall**: Experts whose output can only be checked by another LLM. -If others can't verify what an Expert does, it's neither safe nor reusable. +If the only way to verify an Expert's output is to have another model (or the same model) review it, the verification loop is [soft](../understanding-perstack/hard-signals.md). The agent will oscillate — "looks good" one iteration, "has issues" the next — without converging. -**Bad** — A third party can't verify what this Expert actually does: +**Bad** — Verification relies solely on LLM judgment: ```toml -instruction = """ -Handle expense reports appropriately. -Use your judgment for edge cases. -""" +[experts."code-generator"] +instruction = "Generate TypeScript code for the given task." +delegates = ["code-reviewer"] + +[experts."code-reviewer"] +instruction = "Review the generated code for correctness and style." ``` -**Good** — Anyone reading this knows exactly what to expect: +The reviewer uses the same kind of judgment as the generator. It can miss the same bugs, approve the same anti-patterns, and disagree with itself across runs. As the sole gate, this is a soft signal loop — the system oscillates. + +**Good** — Hard signals as the final authority, with an optional soft gate for semantic checks: ```toml +[experts."builder"] +delegates = ["code-writer", "reviewer", "verifier"] + +[experts."reviewer"] +description = "Checks whether the code reflects the requirements. Returns PASS or CONTINUE." instruction = """ -You are an expense report reviewer. +Read the requirements and the generated code. +Check whether each requirement is addressed. Flag omissions. +Do NOT evaluate code quality — that is the verifier's job. +""" + +[experts."reviewer".skills."@perstack/base"] +type = "mcpStdioSkill" +command = "npx" +packageName = "@perstack/base" +pick = ["readTextFile", "attemptCompletion"] -Approval rules: -- Under $100: Auto-approve with receipt -- $100-$500: Approve if business purpose is clear -- Over $500: Flag for manager review +[experts."verifier"] +description = "Executes hard signal checks against the code. Returns PASS or CONTINUE with specific failures." +instruction = """ +Run the verification commands. Compare actual output against expected. +Report pass/fail per check. Do NOT read the code and form opinions. """ + +[experts."verifier".skills."@perstack/base"] +type = "mcpStdioSkill" +command = "npx" +packageName = "@perstack/base" +pick = ["readTextFile", "exec", "attemptCompletion"] ``` -If someone else can't read your Expert and predict its behavior, it's not verifiable. +The reviewer (soft gate) catches semantic misalignment early — "does the code address the requirements?" is a qualitative judgment that only an LLM can make. The verifier (hard gate) provides the final pass/fail — compiler errors, test failures, and structural checks that are deterministic and independent of LLM judgment. The reviewer has no `exec`; the verifier has `exec`. Neither replaces the other. See [combining soft and hard signals](../understanding-perstack/hard-signals.md#combining-soft-and-hard-signals) for the full pattern. + +"Verifiable" means the Expert's output is ultimately checked by a process that does not depend on LLM judgment. Soft signals can supplement hard signals — catching semantic drift that no compiler can detect — but the **final gate must be hard**. When designing an Expert, ask: **what hard signal can verify this Expert's output?** If the only answer is "another LLM reads it," look for something harder — a compiler, a test suite, a schema validator, a screenshot diff. --- diff --git a/docs/making-experts/examples.md b/docs/making-experts/examples.md index cf8acfb5..1abab49a 100644 --- a/docs/making-experts/examples.md +++ b/docs/making-experts/examples.md @@ -6,12 +6,108 @@ sidebar: Patterns for defining Experts. Each example highlights a specific skill type or integration approach. +- [Hard Signal Verification](#hard-signal-verification) - [GitHub Issue Bot](#github-issue-bot) - [Web Search](#web-search) - [Custom MCP Server](#custom-mcp-server) - [Interactive Wizard](#interactive-wizard) - [Application Integration](#application-integration) +## Hard Signal Verification + +**Pattern**: Combine a soft review gate with hard signal verification in the delegation tree. The reviewer catches semantic issues early; the verifier provides the final deterministic pass/fail. See [combining soft and hard signals](../understanding-perstack/hard-signals.md#combining-soft-and-hard-signals) for the full rationale. + +```toml +# Delegation Tree +# +# app-builder — coordinator: build → review → verify cycle +# ├── @app-builder/build — writes code to workspace +# ├── @app-builder/review — checks requirements alignment (soft gate) +# └── @app-builder/verify — runs hard signal checks, reports PASS/CONTINUE + +[experts."app-builder"] +description = "Builds a web application with verified output" +instruction = """ +Coordinate the build-review-verify cycle: +1. Delegate to build with the user's requirements +2. Delegate to review with the requirements and build output +3. If review returns CONTINUE: delegate to build with review feedback, restart from 2 +4. If review returns PASS: delegate to verify with the build result +5. If verify returns CONTINUE: delegate to build with failure feedback, restart from 2 +6. If verify returns PASS: done +""" +delegates = ["@app-builder/build", "@app-builder/review", "@app-builder/verify"] + +[experts."app-builder".skills."@perstack/base"] +type = "mcpStdioSkill" +command = "npx" +packageName = "@perstack/base" +pick = ["readTextFile", "attemptCompletion"] + +[experts."@app-builder/build"] +description = "Writes application code to the workspace. Provide requirements or failure feedback to address." +instruction = """ +Write working application code. Focus on correctness over style. +If failure feedback is provided, fix the specific issues. +""" + +[experts."@app-builder/build".skills."@perstack/base"] +type = "mcpStdioSkill" +command = "npx" +packageName = "@perstack/base" +pick = ["readTextFile", "writeTextFile", "editTextFile", "exec", "attemptCompletion"] + +[experts."@app-builder/review"] +description = """ +Reviews the build output against requirements for completeness and alignment. +Provide: requirements and path to built code. Returns PASS or CONTINUE with specific gaps. +""" +instruction = """ +Read the requirements and the generated code. +Check whether each requirement is addressed. Flag omissions or misinterpretations. +Do NOT evaluate code quality or run any checks — that is the verifier's job. +""" + +[experts."@app-builder/review".skills."@perstack/base"] +type = "mcpStdioSkill" +command = "npx" +packageName = "@perstack/base" +pick = ["readTextFile", "attemptCompletion"] + +[experts."@app-builder/verify"] +description = """ +Runs hard signal checks against the build output. +Provide: what was built and where. Returns PASS or CONTINUE with specific failures. +""" +instruction = """ +You are a verifier. Run commands and compare outputs. Do NOT read code and form opinions. + +Checks: +- TypeScript compiles: `npx tsc --noEmit` → exit code 0 +- Tests pass: `npm test` → exit code 0 +- App starts: `timeout 5 node dist/index.js` → no crash within 5 seconds + +Run each check twice. If results differ between runs, report CONTINUE — the signal is non-deterministic. + +Report per check: command, expected, actual, PASS/FAIL. +""" + +[experts."@app-builder/verify".skills."@perstack/base"] +type = "mcpStdioSkill" +command = "npx" +packageName = "@perstack/base" +pick = ["readTextFile", "exec", "attemptCompletion"] +``` + +Key design decisions: +- **Reviewer is read-only** — `pick` has no `exec`. It reads files and judges semantic alignment. This is a [soft signal](../understanding-perstack/hard-signals.md) — qualitative judgment that only an LLM can provide — but it catches requirement gaps before the expensive verify cycle. +- **Verifier has `exec`** — it runs commands and compares outputs. No LLM judgment involved. This is the [hard signal](../understanding-perstack/hard-signals.md) that provides the final pass/fail. +- **Review before verify** — soft gate catches semantic drift early. Hard gate catches runtime failures. Neither replaces the other. +- **Verifier is a direct child of the coordinator** — not nested under `build`. This guarantees [context separation](../understanding-perstack/hard-signals.md#2-context-separation). +- **Reproducibility check** — each command runs twice. If results differ, the signal is non-deterministic and cannot be trusted. + +For a production example of this pattern, see [`create-expert`](https://github.com/perstack-ai/perstack/blob/main/definitions/create-expert/perstack.toml) — Perstack's built-in Expert for generating Expert definitions, which uses `review-definition` (soft gate for plan alignment) followed by `test-expert` → `verify-test` (hard signal checks with reproducibility verification). + ## GitHub Issue Bot **Pattern**: Use `requiredEnv` to pass environment variables to tools like `gh` CLI. diff --git a/docs/making-experts/testing.md b/docs/making-experts/testing.md index e92e5f6b..c2a7631d 100644 --- a/docs/making-experts/testing.md +++ b/docs/making-experts/testing.md @@ -6,6 +6,25 @@ sidebar: Experts are probabilistic — the same query can produce different results. This guide covers strategies for testing effectively. +## Signal quality + +Not all tests produce equal confidence. The [Hard Signal Framework](../understanding-perstack/hard-signals.md) distinguishes two kinds of verification: + +- **Soft signal test**: Run the Expert, have an LLM evaluate the output, check if it "looks right." The result depends on evaluator judgment and may vary across runs. +- **Hard signal test**: Run the Expert, execute the output artifact — compile it, run its tests, take a screenshot and diff it. The result is deterministic and independent of LLM judgment. + +| Strategy | Signal type | Why | +| --- | --- | --- | +| Manual observation (`perstack start`) | Soft | Human judgment evaluates output | +| LLM-as-judge evaluation | Soft | Evaluator judgment is as fallible as the generator's | +| Soft review gate (requirements alignment) | Soft (valuable) | Catches semantic drift before expensive hard verification | +| Checkpoint replay | Hard (runtime) | Deterministic state replay, same input → same output | +| Mock testing (deterministic tool assertions) | Hard (runtime) | Verifies tool execution sequence deterministically | +| Built-in verifier delegate | Hard (full) | Separate Expert runs checks with `exec`, no shared context with generator | +| E2E with artifact execution (compile, test, run) | Hard (full) | Verifies the actual artifact with a deterministic procedure | + +Aim for hard signals wherever possible. When you must use soft signals — for example, evaluating natural language quality, checking requirements alignment, or reviewing semantic correctness — place them **before** the hard signal checks, not after. A soft review gate catches semantic drift early; the hard verifier provides the final pass/fail. See [combining soft and hard signals](../understanding-perstack/hard-signals.md#combining-soft-and-hard-signals) for the full pattern. + ## Local testing Run your Expert locally before publishing: @@ -87,7 +106,10 @@ Before publishing: - [ ] Skills work as expected - [ ] Error messages are helpful - [ ] Description accurately reflects behavior +- [ ] At least one hard signal test exists (compiler, e2e, screenshot diff) +- [ ] Verification is independent of the LLM that generated the output ## What's next +- [Hard Signals](../understanding-perstack/hard-signals.md) — the framework behind signal quality - [Best Practices](./best-practices.md) — design guidelines diff --git a/docs/understanding-perstack/boundary-model.md b/docs/understanding-perstack/boundary-model.md index 268a4dea..423ec8e1 100644 --- a/docs/understanding-perstack/boundary-model.md +++ b/docs/understanding-perstack/boundary-model.md @@ -1,7 +1,7 @@ --- title: "The Boundary Model" sidebar: - order: 4 + order: 5 --- [Isolation](./concept.md#isolation) is one of Perstack's three core goals. Experts are isolated from everything except their role — model access, context, tools, and dependencies are all mediated by the runtime. diff --git a/docs/understanding-perstack/concept.mdx b/docs/understanding-perstack/concept.mdx index 9d6f02ac..f6ac1bd8 100644 --- a/docs/understanding-perstack/concept.mdx +++ b/docs/understanding-perstack/concept.mdx @@ -11,6 +11,22 @@ Perstack is built on a concept called **Expert Stack** — the harness that enab > [!NOTE] > The name "Perstack" combines the Latin word "perītus" (meaning "expert") with "stack". Perstack = Expert Stack. +## Design philosophy + +Perstack is built on a single thesis: **output quality is determined by the strength of verification signals in the loop**, not by the capability of the model. + +When an agent generates output, something checks whether that output is correct. That check produces a **signal**. Signals come in two kinds: + +- **Soft signals** depend on LLM judgment — requirements alignment checks, semantic review, ad hoc manual testing. These signals are valuable for qualitative evaluation that only an LLM can provide, but cannot serve as the final gate — the verifier's judgment can vary across runs. +- **Hard signals** are independent of LLM judgment — compiler errors, E2E test results, screenshot diffs. These signals are binary, deterministic, and unambiguous. Agents looping on hard signals detect breakage reliably and self-correct. + +Every architectural decision in Perstack — context isolation, deterministic runtime, sandbox boundaries, full observability — exists to maximize hard signal opportunities. Context isolation ensures the verifier never sees the generator's reasoning (context separation). Deterministic state ensures the same verification always produces the same result (determinism). The workspace provides a controlled space where actual artifacts can be built and tested against ground truth. + +This is what "quality is a system property" means in practice. The system provides the verification infrastructure. The model provides the reasoning. Quality emerges from the interaction. + +> [!TIP] +> For the full framework — the three conditions that make a signal hard, how Perstack's architecture maps to each, and practical guidelines for designing verifiable Experts — see [Hard Signals](./hard-signals.md). For a real-world example, see [`create-expert`](https://github.com/perstack-ai/perstack/blob/main/definitions/create-expert/perstack.toml) — Perstack's own Expert for generating Expert definitions, which embeds a hard signal verifier in its delegation tree. + ## Expert Stack An agent harness needs a broad set of capabilities — almost like an operating system. Perstack organizes them into a simple stack with clear separation of concerns: diff --git a/docs/understanding-perstack/experts.md b/docs/understanding-perstack/experts.md index e35347ff..1475780b 100644 --- a/docs/understanding-perstack/experts.md +++ b/docs/understanding-perstack/experts.md @@ -1,7 +1,7 @@ --- title: "Experts" sidebar: - order: 2 + order: 3 --- Experts are the core building block of Perstack — modular micro-agents designed for reuse. @@ -135,6 +135,8 @@ Job > [!WARNING] > **Context is never shared between Experts.** The delegate receives only the query — no message history, no parent context. This is a security boundary, not a limitation. See [Why context isolation matters](./concept.md#isolation). +Context isolation is also the architectural foundation for [hard signal](./hard-signals.md) verification. When Expert A delegates to Expert B, Expert B operates in a completely separate context. When Expert A receives the result, it evaluates only the artifact returned — not Expert B's reasoning process, not its intermediate states. This is **context separation**: the verifier (Expert A) is structurally separated from the generator (Expert B). The runtime enforces this — you cannot accidentally leak the generator's context to the verifier. + ### Parallel delegation When the LLM calls multiple delegate tools in a single response, the runtime executes them in parallel: @@ -178,6 +180,8 @@ Note the asymmetry: Expert A sees Expert B's `description` (public interface), b | **State exchange** | Workspace files | Persistent, inspectable, works across restarts | | **Interactive tools** | Coordinator only | See [below](#why-no-interactive-tools-for-delegates) | +These decisions are not just architectural preferences — they are structural prerequisites for [hard signal](./hard-signals.md) verification. When message history is not shared, context separation is guaranteed. When state exchange uses workspace files, artifacts become inspectable ground truth. + This is intentional. See [Why context isolation matters](./concept.md#isolation) for the security rationale. ### Why no interactive tools for delegates? @@ -219,6 +223,8 @@ Job (continues running) This design provides resilience — a single delegate failure doesn't crash the entire Job. The Coordinator has full control over error handling. +This also satisfies [context separation](./hard-signals.md#2-context-separation). When a delegate fails, the Coordinator receives only the error message — it cannot see the delegate's chain of reasoning that led to the failure. The Coordinator must decide what to do based solely on the observable outcome, not on the delegate's intent. The failure is a ground-truth artifact, evaluated in a separated context. + ### Workspace The workspace is a shared filesystem where Experts read and write files. Unlike message history, the workspace persists across Expert boundaries — this is how Experts exchange complex data. @@ -229,6 +235,7 @@ How you organize workspace files is up to you. The runtime reserves `perstack/` Ready to build Experts? See the Making Experts guide: +- [Hard Signals](./hard-signals.md) — how context isolation enables hard signal verification - [Making Experts](../making-experts/best-practices.md) — defining Experts in `perstack.toml` - [Best Practices](../making-experts/best-practices.md) — design guidelines for effective Experts - [Skills](../making-experts/skills.md) — adding MCP tools to your Experts diff --git a/docs/understanding-perstack/hard-signals.md b/docs/understanding-perstack/hard-signals.md new file mode 100644 index 00000000..11309f67 --- /dev/null +++ b/docs/understanding-perstack/hard-signals.md @@ -0,0 +1,215 @@ +--- +title: "Hard Signals" +sidebar: + order: 2 +--- + +Agentic AI output quality is determined by the strength of verification signals in the loop. + +When an agent generates output, something needs to check whether that output is correct. The nature of that check — the **signal** — determines whether the agent can self-correct and converge on a good result, or oscillate indefinitely. + +## Soft signals vs hard signals + +There are two kinds of verification signals: + +**Soft signals** depend on LLM judgment. Code review by another LLM. Self-evaluation ("does this look right?"). Ad hoc manual testing where a human eyeballs the output. These signals are valuable when the verification requires semantic understanding that no deterministic check can provide — but they share a fundamental limitation: the verifier's judgment can vary across runs. + +**Hard signals** are independent of LLM judgment. Compiler errors. E2E test pass/fail. Screenshot diffs against a baseline. These signals are binary, deterministic, and unambiguous. The agent either produced working code or it didn't. + +The practical difference: + +| | Soft signal | Hard signal | +| --- | --- | --- | +| Example | LLM reviews generated code | Compiler rejects generated code | +| Depends on | Evaluator's judgment | Artifact's actual behavior | +| Same input, different runs | May produce different verdicts | Always produces the same verdict | +| Agent behavior | Oscillates — "looks fine" one iteration, "has issues" the next | Converges — broken means broken, fixed means fixed | + +When you build a loop with soft signals, the agent appears to work during demos but fails unpredictably in production. When you build a loop with hard signals, the agent detects breakage reliably and self-corrects. + +## Three conditions for a hard signal + +A signal is hard when all three conditions hold. They are a logical AND — if any one is missing, the signal degrades to soft. + +### 1. Ground truth + +Verify the actual artifact, not a proxy. + +| Ground truth | Proxy | +| --- | --- | +| Screenshot of the rendered page | DOM tree structure | +| Full user flow from login to checkout | Return value of a single function | +| Application starts and serves requests | Compilation passes | +| E2E test exercises the real system | Unit test with mocked dependencies | + +Proxies are cheaper to check, but they can pass while the actual artifact is broken. A function can return the right value while the UI that calls it is unusable. Code can compile while the application crashes on startup. + +Ground truth means going to the source: does the thing actually work? + +### 2. Context separation + +The generator and the verifier must be structurally separated. The verifier sees only the artifact — not the generator's intent, reasoning, or intermediate state. + +Why this matters: when the same context window generates output and then evaluates it, the evaluation validates against **intent** ("I meant to do X, and this does X") rather than against the **artifact** ("does this actually work?"). The generator's reasoning biases the verification. + +Context separation breaks this loop. The verifier has no access to why the artifact was created — it can only judge what the artifact is. + +In practice: +- A compiler checking generated code has no knowledge of the LLM's reasoning +- An E2E test running against a deployed app doesn't know what the agent intended +- A screenshot diff compares pixels, not intentions +- A delegated Expert receiving only a query (not the parent's message history) evaluates independently + +### 3. Determinism + +The verification procedure must be fixed. Same artifact in, same verdict out. Every time. + +If the verification itself varies — different results on different runs, or results that depend on evaluator mood — then it produces noise, not signal. An agent cannot correct course based on noise. + +Deterministic verification: +- Compiler: same code always produces the same errors (or none) +- Test suite: same implementation always passes or fails the same tests +- Screenshot diff: same render always produces the same pixel comparison + +Non-deterministic verification: +- LLM evaluation: same code may get "looks good" or "has issues" depending on the run +- Manual review: depends on the reviewer's attention and context +- Flaky tests: sometimes pass, sometimes fail on the same code + +When verification is deterministic, the agent gets a stable signal it can act on. When verification is non-deterministic, the agent cannot distinguish real problems from noise. + +## How Perstack enables hard signals + +Perstack's architecture is designed to maximize hard signal opportunities. Each architectural decision maps to one of the three conditions: + +### Context isolation → Context separation + +When Expert A delegates to Expert B, the delegate runs in a completely separate context — empty message history, its own instruction, no access to the parent's reasoning. When Expert A receives the result, it evaluates only the returned artifact. + +This is context separation by construction. The runtime enforces it — you cannot accidentally leak the generator's context to the verifier. + +### Deterministic runtime → Determinism + +The runtime draws a clear boundary between probabilistic (LLM reasoning) and deterministic (state management). Events are recorded deterministically. Checkpoints capture complete state. Replaying from a checkpoint produces identical results. + +This means any verification process built on the runtime's state — event stream analysis, checkpoint comparison, artifact diffing — inherits determinism automatically. + +### Sandbox and workspace → Ground truth + +Experts write artifacts to the workspace — files, code, configurations. The sandbox isolates the execution environment. Together, they create a controlled space where artifacts can be built and verified against ground truth. + +An Expert that generates code can write it to the workspace. Another Expert (or an external process) can compile it, run its tests, or start the application — verifying the actual artifact, not a proxy. + +### Observability → Deterministic audit trail + +The full event stream is a deterministic record of everything that happened. Same execution always produces the same events. An external verifier can process these events, compare outputs against baselines, and produce a verdict — without any LLM involvement. + +## Practical impact + +The difference between soft and hard signal loops compounds over time: + +**Soft signal loop**: Agent generates output → LLM evaluates ("looks good") → agent generates more → LLM evaluates ("actually, this part is wrong") → agent fixes → LLM evaluates ("looks good now, but this other part...") → oscillation. The agent appears productive but never converges. + +**Hard signal loop**: Agent generates code → compiler rejects it → agent reads the error → agent fixes the specific issue → compiler accepts → tests fail → agent reads the failure → agent fixes → tests pass → done. Each iteration makes measurable progress because the signal is unambiguous. + +Hard signals don't make agents smarter. They make the feedback loop trustworthy. An agent with mediocre reasoning but hard signals in the loop will outperform a brilliant agent with only soft signals — because the first one knows when it's wrong. + +## Combining soft and hard signals + +Soft signals are not useless — they are essential when the verification requires semantic judgment that no deterministic check can provide. + +Some questions only an LLM can answer: +- "Does this instruction faithfully reflect the domain constraints from the requirements?" +- "Is this generated content appropriate for the target audience?" +- "Does this API design follow the conventions of the existing codebase?" + +These are inherently qualitative evaluations. Trying to force them into binary checks would lose the nuance that makes them valuable. The key is **where you place them in the loop** and **what you combine them with**. + +### The soft gate + hard verification pattern + +The most effective architecture uses soft signals as an early gate and hard signals as the final authority: + +``` +write → review (soft) → test → verify (hard) + ↑ | + └── fix ←────────────────┘ +``` + +The soft gate catches semantic misalignment early — before the expensive test-verify cycle runs. The hard verifier provides the final pass/fail decision. Neither replaces the other: + +- **Without the soft gate**: hard signals catch runtime failures but miss semantic drift. The artifact compiles and passes tests, but doesn't reflect the requirements. You iterate through expensive test cycles to discover what a quick LLM review would have caught. +- **Without the hard verifier**: soft reviews confirm alignment but miss actual breakage. The Expert "looks correct" but the generated artifact crashes on startup. The LLM reviewer can't catch what only execution reveals. + +Perstack's own [`create-expert`](https://github.com/perstack-ai/perstack/blob/main/definitions/create-expert/perstack.toml) uses this exact pattern: `review-definition` (soft gate) checks whether the generated perstack.toml faithfully reflects plan.md's domain constraints — a semantic judgment that requires LLM reasoning. Only after review passes does the loop proceed to `test-expert` → `verify-test` (hard verification). The soft reviewer has no `exec` — it reads files and judges alignment. The hard verifier has `exec` — it runs commands and compares outputs. + +### Guidelines for soft signal placement + +When using soft signals: + +1. **Place them before hard signals** — catch semantic issues early, before investing in expensive execution and verification. +2. **Give the soft reviewer only read access** — no `exec`, no file writes. This keeps its role pure: it judges, it doesn't act. +3. **Never use soft signals as the final gate** — the last check before completion must be hard. A soft "looks good" is not a shipping signal. +4. **Context-separate the reviewer from the generator** — just as with hard verification, the soft reviewer should be a separate Expert that sees only the artifacts, not the generator's reasoning. + +## Designing for hard signals + +When building Experts, ask: **what hard signal can verify this Expert's output?** + +- If the Expert generates code → compiler errors, test suite, application startup +- If the Expert generates configuration → validation schema, dry-run deployment +- If the Expert generates UI → screenshot diff, accessibility audit +- If the Expert generates data → schema validation, constraint checks +- If the Expert generates natural language → this is genuinely hard to verify with hard signals; acknowledge the limitation and supplement with hard signals on adjacent properties (e.g., format validation, length constraints, required keyword presence) + +If the only answer is "another LLM reads it," the verification loop is soft. The system will oscillate rather than converge. Look for a way to make the signal harder — even a partial hard signal (format validation, schema check) is better than none. + +### Converting soft checks to hard checks + +Many verification criteria start as subjective questions: "is this code clean?", "is this instruction concise?", "is this output high quality?". These are soft signals — the LLM always judges its own output favorably. + +The conversion strategy: replace subjective evaluation with **binary checks** that have unambiguous yes/no answers. + +| Soft check (LLM opinion) | Hard check (binary) | +| --- | --- | +| "Is the instruction concise?" | `wc -l instruction` ≤ 15 lines | +| "Does the code follow best practices?" | `npx tsc --noEmit` exits 0, `npm test` exits 0 | +| "Is the output well-structured?" | `grep -c '```'` = 0 (no code blocks in instructions) | +| "Are all dependencies declared?" | `grep 'delegates'` present for every expert that references delegates | + +Each binary check has a clear pass/fail result, a clear remediation action, and produces the same verdict every time. Subjective checks ("would removing this make the output worse?") always pass because the LLM cannot judge its own output objectively. + +### The built-in verifier pattern + +An Expert doesn't have to rely on external processes to get hard signals. You can embed the verifier inside the delegation tree — a dedicated verifier Expert that executes hard signal checks, structurally separated from the generator. + +``` +coordinator +├── generator — produces the artifact +├── reviewer — checks semantic alignment (soft gate, read-only) +├── executor — runs the artifact (pure execution, no evaluation) +└── verifier — executes hard signal checks against the result +``` + +The key design constraints: + +1. **Verifier as a direct child of the coordinator** — not nested under the generator. This guarantees context separation: the verifier shares no context with the generator. +2. **Verifier needs `exec` capability** — without it, verification degrades to file reading, which is a soft signal. Hard signals require running commands that produce deterministic output. +3. **Executor and verifier are separate** — the executor runs the artifact and reports what happened (facts only). The verifier runs checks and reports pass/fail. Combining them leaks execution context into verification. +4. **Reviewer has no `exec`** — the soft gate reads files and judges alignment. Keeping it read-only prevents it from accidentally becoming a verifier. + +Perstack's own [`create-expert`](https://github.com/perstack-ai/perstack/blob/main/definitions/create-expert/perstack.toml) uses this pattern: `review-definition` (soft gate) checks plan alignment with read-only access, then `test-expert` executes the generated expert (pure executor, no evaluation), then `verify-test` runs hard signal checks, re-runs them a second time to confirm reproducibility, and performs structural checks — all deterministic, all independent of LLM judgment. + +### Confirming determinism: the reproducibility check + +A signal is only as hard as its consistency. If a check passes once but fails on re-execution with the same artifact, the signal is non-deterministic — it produces noise, not information. + +The practical fix: **re-run every verification command a second time and compare results**. If the output is identical, the signal is deterministic (hard). If it differs, the signal or the artifact needs fixing before you can trust it. + +This is a cheap check that catches a common failure mode: flaky tests, environment-dependent behavior, or time-sensitive assertions that break reproducibility. + +## What's next + +- [Experts](./experts.md) — how context isolation enables context separation +- [Runtime](./runtime.md) — how deterministic state enables deterministic verification +- [Testing Experts](../making-experts/testing.md) — applying signal quality to your test strategy +- [Best Practices](../making-experts/best-practices.md) — the "Keep It Verifiable" principle diff --git a/docs/understanding-perstack/runtime.md b/docs/understanding-perstack/runtime.md index 07eded01..b2c216c1 100644 --- a/docs/understanding-perstack/runtime.md +++ b/docs/understanding-perstack/runtime.md @@ -1,7 +1,7 @@ --- title: "Runtime" sidebar: - order: 3 + order: 4 --- The Perstack runtime combines probabilistic LLM reasoning with deterministic state management — making agent execution predictable, reproducible, and auditable. @@ -104,6 +104,7 @@ LLMs are probabilistic — same input can produce different outputs. Perstack dr The "thinking" is probabilistic; the "doing" and "recording" are deterministic. This boundary enables: - **Reproducibility**: Replay from any checkpoint with identical state - **Testability**: Mock the LLM, test the runtime deterministically +- **Hard signal verification**: Any verification process that replays from a checkpoint produces the same result every time. This satisfies the [determinism condition](./hard-signals.md#3-determinism) — verification follows a fixed procedure, same input always produces same output ### Event, Step, Checkpoint @@ -117,6 +118,8 @@ Runtime state is built on three concepts: This combines **Event Sourcing** (complete history) with **Checkpoint/Restore** (efficient resume). +Together, these create the infrastructure for deterministic verification. An external verifier can replay any execution from any checkpoint and confirm whether the outcome matches expectations. Because the replay is deterministic, the verification result is reproducible — it does not depend on when or how many times you run it. This is what makes Perstack's observability a [hard signal](./hard-signals.md) rather than a soft one. + ### The `perstack/` directory The runtime stores execution history in `perstack/jobs/` within the workspace: diff --git a/docs/understanding-perstack/sandbox-integration.md b/docs/understanding-perstack/sandbox-integration.md index 5ac5309c..d27aba58 100644 --- a/docs/understanding-perstack/sandbox-integration.md +++ b/docs/understanding-perstack/sandbox-integration.md @@ -1,7 +1,7 @@ --- title: "Sandbox Integration" sidebar: - order: 5 + order: 6 --- ## Why sandbox integration? @@ -23,7 +23,7 @@ Agent security can be understood as four boundary layers: | Layer | Threat | Perstack's approach | | -------------- | ----------------------------------- | -------------------------------------------------------------- | -| **Prompt** | Injection, context pollution | Observability — full event history for post-hoc verification | +| **Prompt** | Injection, context pollution | Observability — full event history for post-hoc [hard signal](./hard-signals.md) verification | | **Tool (MCP)** | Privilege abuse | `requiredEnv` for minimal privilege, explicit skill assignment | | **Runtime** | Host compromise | Designed for sandboxed infrastructure (Docker, ECS, Workers) | | **External** | Data exfiltration, lateral movement | Event-based messaging with controlled boundaries | @@ -36,6 +36,8 @@ Sandboxing inverts this model. Instead of trying to prevent all bad outcomes upf This is why **Observability** is one of Perstack's three core principles (alongside Isolation and Reusability). It's not a nice-to-have — it's a prerequisite for sandbox integration to work. Full event history, deterministic checkpoints, and transparent execution make post-hoc verification possible. +But post-hoc verification is only meaningful if it produces [hard signals](./hard-signals.md). Three conditions must hold: the verification examines the actual artifact, not a proxy ([ground truth](./hard-signals.md#1-ground-truth)); the verifier is separate from the agent that produced the artifact ([context separation](./hard-signals.md#2-context-separation)); and the verification procedure is deterministic — same artifact always produces the same verdict ([determinism](./hard-signals.md#3-determinism)). Perstack's full event history and deterministic checkpoints satisfy all three. Without them, sandbox integration produces no usable verification signal. + ### The messaging problem Sandboxing introduces a challenge: how does the agent communicate with the outside world? From 0e5bac308d76ae714136f4e268f5f9801d1d110d Mon Sep 17 00:00:00 2001 From: HiranoMasaaki Date: Sat, 14 Mar 2026 07:57:06 +0900 Subject: [PATCH 2/3] docs: add quality floor vs ceiling framework MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The harness provides the quality floor (hard signals guarantee the output works). The Expert author provides the quality ceiling (domain knowledge ensures it solves the right problem). Architecture choice does not determine quality — the combination of verification and knowledge does. Co-Authored-By: Claude Opus 4.6 (1M context) --- README.md | 2 +- docs/understanding-perstack/concept.mdx | 2 +- docs/understanding-perstack/hard-signals.md | 17 +++++++++++++++++ 3 files changed, 19 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2b74c91b..44e79fe5 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ If you want to build practical agentic apps like Claude Code or OpenClaw, a harn Perstack is a harness for agentic apps. It aims to: - **Do big things with small models**: If a smaller model can do the job, there's no reason to use a bigger one. -- **Quality is a system property, not a model property**: Building agentic software people actually use doesn't require an AI science degree—just a solid understanding of the problems you're solving. +- **Quality is a system property, not a model property**: The harness provides the quality floor—hard signals that guarantee the output works. You provide the quality ceiling—domain knowledge that ensures it solves the right problem. Neither requires a frontier model or an AI science degree. - **Hard signals make agents reliable**: Output quality depends on verification, not model capability. When each micro-agent's output can be checked by a compiler, test suite, or screenshot diff—independent of LLM judgment—the system self-corrects and converges instead of oscillating. ## Getting Started diff --git a/docs/understanding-perstack/concept.mdx b/docs/understanding-perstack/concept.mdx index f6ac1bd8..e19c0822 100644 --- a/docs/understanding-perstack/concept.mdx +++ b/docs/understanding-perstack/concept.mdx @@ -22,7 +22,7 @@ When an agent generates output, something checks whether that output is correct. Every architectural decision in Perstack — context isolation, deterministic runtime, sandbox boundaries, full observability — exists to maximize hard signal opportunities. Context isolation ensures the verifier never sees the generator's reasoning (context separation). Deterministic state ensures the same verification always produces the same result (determinism). The workspace provides a controlled space where actual artifacts can be built and tested against ground truth. -This is what "quality is a system property" means in practice. The system provides the verification infrastructure. The model provides the reasoning. Quality emerges from the interaction. +This is what "quality is a system property" means in practice. The harness provides the verification floor — hard signals that guarantee the output works. The Expert author provides the knowledge ceiling — domain constraints that ensure the output solves the right problem. Quality emerges from the combination, not from model capability or architecture choice. > [!TIP] > For the full framework — the three conditions that make a signal hard, how Perstack's architecture maps to each, and practical guidelines for designing verifiable Experts — see [Hard Signals](./hard-signals.md). For a real-world example, see [`create-expert`](https://github.com/perstack-ai/perstack/blob/main/definitions/create-expert/perstack.toml) — Perstack's own Expert for generating Expert definitions, which embeds a hard signal verifier in its delegation tree. diff --git a/docs/understanding-perstack/hard-signals.md b/docs/understanding-perstack/hard-signals.md index 11309f67..cce67502 100644 --- a/docs/understanding-perstack/hard-signals.md +++ b/docs/understanding-perstack/hard-signals.md @@ -114,6 +114,23 @@ The difference between soft and hard signal loops compounds over time: Hard signals don't make agents smarter. They make the feedback loop trustworthy. An agent with mediocre reasoning but hard signals in the loop will outperform a brilliant agent with only soft signals — because the first one knows when it's wrong. +## Quality floor and quality ceiling + +Hard signals establish a **quality floor** — the minimum bar that the output must clear. The code compiles. The tests pass. The app starts. The screenshot matches the baseline. If any of these fail, the agent keeps iterating. The floor is the harness's responsibility. + +The **quality ceiling** is determined by **domain knowledge** — the constraints, rules, and context baked into the Expert's instruction. Does the generated API follow the team's naming conventions? Does the game have balanced difficulty progression? Does the customer support agent know the company's refund policy? No hard signal can verify these. They are the Expert author's responsibility. + +This separation matters because it defines who owns what: + +| | Quality floor | Quality ceiling | +| --- | --- | --- | +| **What determines it** | Hard signals in the verification loop | Domain knowledge in the instruction | +| **Who owns it** | The harness (Perstack) | The Expert author (you) | +| **How it fails** | Artifact doesn't work | Artifact works but doesn't solve the right problem | +| **How it improves** | Better verification signals | Better domain constraints | + +Architecture — monolithic or micro-agent — does not determine quality. A monolithic agent with hard signals will produce better output than a micro-agent team without them. What makes quality a system property is the combination: the harness provides the verification floor, and the author provides the knowledge ceiling. Neither alone is sufficient. + ## Combining soft and hard signals Soft signals are not useless — they are essential when the verification requires semantic judgment that no deterministic check can provide. From ee29fdcec28acb7471c40ecd4bfafb95c35391f5 Mon Sep 17 00:00:00 2001 From: HiranoMasaaki Date: Sat, 14 Mar 2026 08:00:28 +0900 Subject: [PATCH 3/3] docs: consolidate README aims from three bullets to two MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Merge "hard signals make agents reliable" into "quality is a system property" — they were saying the same thing from different angles. Keep "do big things with small models" as a consequence, not a separate thesis. Co-Authored-By: Claude Opus 4.6 (1M context) --- README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 44e79fe5..9f76ec88 100644 --- a/README.md +++ b/README.md @@ -19,9 +19,8 @@ If you want to build practical agentic apps like Claude Code or OpenClaw, a harn Perstack is a harness for agentic apps. It aims to: -- **Do big things with small models**: If a smaller model can do the job, there's no reason to use a bigger one. -- **Quality is a system property, not a model property**: The harness provides the quality floor—hard signals that guarantee the output works. You provide the quality ceiling—domain knowledge that ensures it solves the right problem. Neither requires a frontier model or an AI science degree. -- **Hard signals make agents reliable**: Output quality depends on verification, not model capability. When each micro-agent's output can be checked by a compiler, test suite, or screenshot diff—independent of LLM judgment—the system self-corrects and converges instead of oscillating. +- **Quality is a system property, not a model property**: The harness provides hard signals — compiler output, test results, screenshot diffs — that let agents detect and fix their own mistakes. You provide the domain knowledge that defines what "correct" means. The combination sets both the floor and the ceiling. +- **Do big things with small models**: When the system handles verification, a focused agent on an affordable model outperforms a generalist on a frontier one. ## Getting Started