Why workflows exist, how they work, and how to design them.
Same prompt, different quality outputs. Without constraints, agents satisfice: "looks done" without being done. They rationalize shortcuts ("it's just a small change", "I'll test later", "the user will catch errors in review").
The intuition: we're searching for the best deliverable, and workflows are the constraints that prevent getting stuck in local minima.
Agents don't skip steps out of laziness or rebellion. They skip steps because their training drives — helpfulness, competence, efficiency, approval-seeking — push them toward shortcuts that appear to serve the user faster. The agent wants to do well. Its drives just produce bad strategies without structure.
| Drive | How It Causes Drift | Example |
|---|---|---|
| Helpfulness | "Faster = more helpful, so skip the ceremony" | Jumping to a fix without investigation |
| Competence | "I already know the answer, proving it is redundant" | Skipping triage because the bug "looks familiar" |
| Efficiency | "The protocol is overhead I can eliminate" | Bypassing iteration to save time |
| Approval | "The user seems frustrated, I should deliver fast" | Skipping tests because the user wants results NOW |
The key insight: enforcement works best when the consequence of violation is framed as a failure of the drive that motivated the shortcut. "Don't skip steps" fights the agent's drives. "Skipping steps makes you anti-helpful" aligns the agent's drives with the protocol.
This is why Drive-Aligned Framing (enforcement pattern #9) is disproportionately effective — it targets the helpfulness drive (the strongest), framing shortcuts as anti-helpful to the user rather than merely incorrect.
The RL framing gives precise vocabulary for workflow design decisions. The workflow IS the policy. But it's a lens, not a universal truth — it maps tightly to dev workflows and loosely to writing and DS.
| RL Concept | Workflow Equivalent | Where It Fits |
|---|---|---|
| Policy | The workflow itself | All domains |
| State | Current phase + artifacts (SPEC.md, PLAN.md) | All domains — state is in files, not context |
| Action masking | Iron Laws — actions removed from the space | Dev (strong), DS/Writing (weaker) |
| Reward hacking | "Looks done" without being done | All domains — the core failure mode |
| Reward shaping | Gate functions (intermediate checkpoints) | Dev (deterministic), DS/Writing (judgment-based) |
| High ε / Low ε | Brainstorm (explore) vs. Implement (exploit) | All domains |
| Replay buffer | LEARNINGS.md + continuous-learning | All domains |
| Policy transfer | Workflow creator skill | Cross-domain |
- Dev gates are deterministic ("tests pass" is binary). Writing and DS gates are judgment-based ("does the argument hold?" requires a reader). This doesn't mean writing/DS can't have gates — it means their gates are agent-assessed or human-assessed, not machine-verified.
- Action masking is clean in dev (you literally can't edit code before design). In writing, the equivalent ("you can't draft before outlining") is softer — the agent can always produce something without an outline.
- Reward is sparse everywhere (only know quality when the human reviews), but the signal quality differs: a failing test is unambiguous, "this paragraph is weak" is not.
The RL framing is most useful for: identifying reward hacking, designing action masks, and recognizing which phases need exploration vs. exploitation. It's least useful for: defining quality in subjective domains.
Break work into phases with single responsibilities. Each phase answers ONE question. Phases are sequential: you can't design before exploring, can't implement before designing.
The shape varies by domain:
- Dev: 7 linear phases (brainstorm → explore → clarify → design → implement → review → verify)
- DS: 5 linear phases (brainstorm → plan → implement → review → verify)
- Writing: Branching (quick vs. project, domain routing, progressive expansion)
Gates prevent drift. The strongest gates are machine-verifiable:
- SPEC.md exists → can enter explore
- All tests pass → can enter verify
But not all domains have machine-verifiable gates. Writing and DS rely on judgment gates — the agent or human evaluates quality:
- Outline covers all thesis points → can enter draft
- Results pass sanity checks → can enter review
Judgment gates are weaker than deterministic gates but stronger than no gates. The design principle: use the strongest gate available for the domain. Deterministic when possible, judgment-based when not, honor-system never.
A gate that exists only as instructional text ("you must run X before Y") is advisory — the consuming phase has no way to verify the gate ran. Advisory gates are the honor system by another name.
The principle: every mandatory inter-phase gate must produce a concrete artifact that the consuming phase checks before starting. If the main chat skips the gate skill entirely and jumps straight to the next phase, the next phase must REFUSE to start.
The pattern:
- Gate skill writes a marker file (e.g.,
.planning/PLAN_REVIEWED.md) withstatus: APPROVEDfrontmatter - Consuming phase checks for the file at startup — missing file = STOP
- The marker contains the reviewer's actual output (not just a flag), so it can't be fabricated without running the reviewer
Why instructions fail: context pressure causes the main chat to shortcut past "mandatory" steps. The agent that skips the gate is the same agent reading the instruction not to skip. Why artifacts work: the check is in the PREREQUISITES section, read before any work starts — binary pass/fail, no rationalization possible.
Hook-enforced gates (strongest): Even artifact checks in instructional text can be compressed away during context compaction or rationalized past ("the file probably exists"). The strongest enforcement is a skill-scoped PreToolUse hook that blocks code-modifying tools (Write, Edit, Agent) until the gate artifact exists. Claude Code fires the hook on every tool call — no escape, no rationalization, no context dependency. Use the generic phase-gate-guard.py hook with environment variables to configure per-phase gates:
# In consuming phase's SKILL.md frontmatter:
hooks:
PreToolUse:
- matcher: "Write|Edit|Agent"
hooks:
- type: command
command: >-
GATE_ARTIFACT=.planning/PLAN_REVIEWED.md
GATE_STATUS=APPROVED
GATE_DESCRIPTION="Plan review"
GATE_REMEDY="Return to dev-design and run dev-plan-reviewer"
python3 ${CLAUDE_PLUGIN_ROOT}/hooks/phase-gate-guard.pyThe enforcement gradient for gates: hook-enforced > artifact check in instructions > advisory text. Design for hook-enforced; fall back to artifact checks only when hooks can't express the constraint.
Enforcement rules split into two categories with fundamentally different natures:
Conventions are ex-ante behavioral guidance loaded before work begins. They shape how the agent approaches work — style, tone, methodology, judgment calls. A convention requires reading and interpreting. "Use active voice in documentation" is a convention. "Match the codebase's existing patterns" is a convention. You can't write a script that returns pass/fail for these — they require judgment.
Constraints are ex-post deterministic checks run after work completes. They are mechanically testable: a script reads the output and returns pass or fail. "No file exceeds 500 lines" is a constraint. "Every public function has a test" is a constraint. "No agent resume — always spawn fresh" is a constraint.
The litmus test: can you write a script that returns pass/fail? If yes → constraint. If it requires reading and judging → convention.
| Conventions | Constraints | |
|---|---|---|
| When | Ex-ante (loaded before work) | Ex-post (checked after work) |
| Nature | Subjective, judgment-based | Deterministic, mechanically testable |
| Analogy | Style guide | Unit test |
| Enforcement | Prompt-loaded, reviewer-scored | Script-executed, pass/fail |
| Failure mode | Soft block (score below threshold) | Hard block (script fails) |
Both are necessary. Constraints catch what's mechanically checkable. Conventions guide what isn't. A workflow with only constraints has no taste. A workflow with only conventions has no teeth.
Co-located architecture: Constraints live as paired files — foo.md (the rule) + foo.py (the check script) — in the same constraints/ directory. Conventions are foo.md files without a paired script. An auto-discovering runner (check-all.py) globs all *.py files and executes them — no manual wiring, no registration. Adding a check script = automatically tested.
Two-leg verification: The verification stage runs both legs:
- Constraint checks —
check-all.pyruns all check scripts. Hard block on any failure. - Convention scoring — A reviewer subagent scores work against loaded conventions. Soft block below threshold.
Neither leg alone is sufficient. Constraint checks without convention scoring miss qualitative issues. Convention scoring without constraint checks relies entirely on prompt compliance.
The implementer should never verify its own work. Self-review is proofreading — the agent that did the work shares all the same context, biases, and sunk-cost attachment. True verification requires structural independence: the verifier has no memory of the implementation journey.
| Self-review (weak) | Independent verification (strong) |
|---|---|
| Same agent reviews its own work | Fresh subagent sees only spec + output |
| Reviewer shares implementer's biases | Reviewer has no implementation context |
| "Did I do this right?" | "Does this meet the spec? Find problems." |
| Incentivized to approve (sunk cost) | Incentivized to find issues (that's its whole job) |
Implementation: The "team" iteration topology (section 6) is independent verification applied as a pattern — fresh subagents with specialized reviewer roles, no shared context with the implementer.
The spectrum of verification strength:
| Method | Independence | When to Use |
|---|---|---|
| Self-review | None — same agent, same context | Never sufficient alone |
| Fresh subagent reviewer | Structural — no shared context | Default for all verification |
| Multiple specialized reviewers | Structural + diverse perspectives | High-stakes or subjective output |
| Human review | Full independence + domain judgment | Final gate for subjective quality |
| Machine verification | Full independence + deterministic | Tests, linters, type checkers |
The design principle: use the most independent verifier available. Machine verification when possible (tests pass), independent subagent review when judgment is needed, human review for final quality on subjective work. Self-review is never the answer.
Workflows produce intermediate artifacts — specs, plans, outlines, hypotheses. Downstream phases consume these artifacts and build on them. If the artifact is flawed, everything downstream inherits the flaw.
The principle: no downstream phase should consume an unreviewed artifact.
A spec with a missing edge case survives into exploration (exploring the wrong areas), design (designing around incomplete requirements), and implementation (building the wrong thing). A plan with tasks that are too coarse survives into implementation (subagents struggle with 500-line tasks). Catching these at the artifact stage costs minutes; catching them during implementation costs hours.
| Artifact | Produced By | Consumed By | Review Gate |
|---|---|---|---|
| SPEC.md | Brainstorm | Explore, Clarify, Design | Independent reviewer checks completeness, consistency, clarity |
| PLAN.md | Design | Implement | Independent reviewer checks task decomposition, spec alignment |
| OUTLINE.md | Brainstorm | Draft | Independent reviewer checks coverage, structure |
| HYPOTHESES.md | Investigate | Test | Self-review acceptable (serial iteration, not final) |
Review mechanism: Dispatch a fresh subagent with the artifact and a review checklist. The reviewer has no implementation context — it sees only the artifact and the checklist. This is independent verification applied to documents, not just code.
Chunking large artifacts: When a plan exceeds ~15 tasks, break it into ordered chunks (logically self-contained groups). Review each chunk separately. This prevents reviewer fatigue and ensures each section gets focused attention.
Model tier guidance for delegation: When dispatching implementation subagents, match model capability to task complexity:
- Mechanical tasks (isolated functions, boilerplate, 1-2 files): Use the cheapest capable model
- Integration tasks (multi-file coordination, pattern matching): Use a standard model
- Architecture/review tasks (design judgment, broad codebase understanding): Use the most capable model
This is advisory — Claude Code doesn't yet support model routing — but documenting the intent prevents over-allocating expensive models to trivial tasks.
When both the entry point and midpoint evaluate the same quality dimensions, the enforcement rules must live in a single shared location — not inlined independently in each skill.
The drift problem: Entry and midpoint skills evolve at different times. A new check added to the midpoint doesn't automatically appear in the entry point's verification step. Over time, the two diverge — the midpoint catches issues the entry point doesn't, which means the entry point ships broken work that requires midpoint re-entry to fix.
The solution: Co-located constraint/convention files in a shared constraints/ directory. Both entry phases and midpoint skills load from the same directory. Constraints have paired check scripts that run automatically via the auto-discovering runner. Conventions are loaded as prompt context for reviewer subagents.
constraints/
├── no-agent-resume.md ← constraint rule
├── no-agent-resume.py ← check script (auto-discovered)
├── match-codebase-style.md ← convention (no .py = judgment-based)
└── check-all.py ← runner: globs *.py, runs all checks
↑ runs ↑ loads
│ │
entry verification midpoint audit
(hard block on fail) (re-runs same checks)
The design principle: If the same quality dimension is checked in both the entry point and midpoint, the enforcement rule MUST be shared. Inlining the same check in two places guarantees drift. The co-located architecture makes sharing automatic — both entry and midpoint point at the same constraints/ directory.
Shared constraints (the section above) address only one layer of cross-skill enforcement. In practice, skill families need consistency across three layers:
| Layer | What It Covers | Drift Mechanism |
|---|---|---|
| Constraints (prompt) | Iron Laws, Rationalization Tables, Red Flags in shared file | Skills edited independently, new rules added to one but not shared |
| Hooks (structural) | PreToolUse/PostToolUse in YAML frontmatter | Hook added to fix a failure in one skill, not propagated to siblings |
| Script wiring (gate) | Check scripts referenced by hooks, batch orchestrator, and check definitions | New script created but not wired into all three invocation points |
The constraint-only trap: A workflow can have perfect constraint sharing (all skills Read() common-constraints.md) while hooks and script wiring are inconsistent. The constraints tell the agent what to do; the hooks force it; the scripts verify it. If only constraints are shared, enforcement is prompt-only — the weakest layer.
The incident pattern: A failure mode is discovered in skill A. The fix adds (1) a constraint to common-constraints.md, (2) a hook to skill A's frontmatter, (3) a check script. Steps 2 and 3 happen only in skill A. Skills B, C, D get the constraint (via shared file) but not the hook or the script wiring. The constraint fires via prompt; the hook never fires; the script never runs automatically. The user discovers the gap when skill B ships work that skill A would have caught.
The design principle: When adding enforcement to any skill in a family, propagate across all three layers — or document why a layer doesn't apply to a specific skill.
Superpowers enforcement patterns are the regularization that counteracts drift:
- Iron Laws — absolute constraints, not guidelines
- Rationalization Tables — preempt the agent's excuses before they form
- Red Flags + STOP — pattern interrupts for common failure modes
- Gate Functions — multi-step verification (can't claim done without evidence)
- Drive-Aligned Framing — frame violation as failure of the motivating drive, targeting helpfulness first (the strongest drive)
- Delete & Restart — nuclear option for protocol violations
Enforcement density should be proportional to drift risk: dev-implement (high risk of skipping tests) needs more enforcement than ds-brainstorm (lower risk).
Prompt-based enforcement has a ceiling. At some point, adding more Iron Laws and Rationalization Tables produces diminishing returns — the skill gets so long that the agent can't internalize all of it, or contradictory rules emerge.
When you hit the ceiling, the next step is structural enforcement: hooks that block actions, separate processes that verify independently, fresh subagents that can't access the parent's rationalizations. The dev-debug rewrite (v4.0) is an example: 700 lines of prompt enforcement failed; replacing the honor-system exit with a test-based gate succeeded.
The gradient: prompt enforcement → structural enforcement → human judgment. Use the lightest mechanism that works. Escalate when it doesn't.
Skills and agents can scope PreToolUse and PostToolUse hooks to their own lifetime. A hook that fires on every Read call during a skill's execution is strictly more reliable than an Iron Law telling the agent not to Read images — the hook runs whether the agent "remembers" the rule or not.
The principle: if a constraint is mechanically checkable, enforce it with a hook, not a prompt.
| Constraint Type | Enforcement | Example |
|---|---|---|
| File extension guard | Hook (PreToolUse Read) | Block Read on .png/.pdf, suggest look-at |
| Path guard | Hook (PreToolUse Edit/Write) | Block edits to cache directories |
| Tool parameter validation | Hook (PreToolUse Bash) | Require description parameter |
| Sequence enforcement | Hook (PreToolUse + state) | Require test file before source edit |
| Post-subagent restrictions | Hook (PostToolUse Agent → PreToolUse) | Block Read/Grep on source after subagent returns |
| Quality judgment | Prompt (Iron Law) | "Is this outline complete?" |
| Domain knowledge | Prompt (Rationalization Table) | Why TDD matters |
| Creative guidance | Prompt (Red Flags) | "Don't over-engineer" |
Why hooks win:
- Zero prompt tokens — the constraint doesn't consume context window
- No drift — hooks fire every time, regardless of context length or compression
- No rationalization — the agent can't talk itself out of a hook
- Composable — hooks from different skills stack without conflicting prompt text
When to keep prompt enforcement:
- The constraint requires judgment (subjective quality, design decisions)
- The constraint is educational (rationalization tables teach why, not just what)
- The constraint depends on conversation context (topic detection, intent classification)
- The hook would have too many false positives without semantic understanding
Design rule: Write the hook first. If the hook can't express the constraint, write the prompt enforcement. Never write 50 lines of prompt for something a 15-line hook handles better.
Long-running agent sessions suffer from context pollution: each failed attempt, abandoned approach, and partial reasoning stays in conversation history, degrading reasoning quality. The original Ralph Wiggum technique (while :; do cat PROMPT.md | claude-code; done) solved this by starting each iteration as a fresh process.
Fresh subagents achieve the same effect within a single session. Each subagent gets clean context. The filesystem is the durable memory.
Core principle: Progress lives in files, not in conversation.
| Topology | When to Use | Example |
|---|---|---|
| Serial | Approaches must build on each other | Debugging: each hypothesis builds on what was ruled out |
| Parallel | Multiple perspectives can be gathered independently | Data science: robustness checks run simultaneously |
| Team | Output needs multi-faceted review from specialized roles | Writing: copy editor + critic + fact checker in parallel |
- Fresh subagent per iteration — no context pollution across attempts
- Filesystem as memory — state files (HYPOTHESES.md, REVIEW.md) persist across iterations
- Progress-gated escalation — loop runs autonomously while making progress, involves the human only when stuck
- No honor-system exits — the agent doesn't decide when it's done. Convergence, test results, or the human decides.
The human is not just the reward signal at the end of an episode. The human is a collaborator with three distinct roles:
Trigger. The human decides when to invoke a workflow and which one. A variable rename doesn't need /dev. A new feature does. This judgment stays with the human — the agent doesn't self-activate workflows.
Domain oracle. The agent hits questions it can't answer from the codebase alone: "what's the business intent?", "is this the right tradeoff?", "what does the user actually want?" Progress-gated escalation is the mechanism — the agent works autonomously on what it can, and surfaces specific questions when it's stuck. Not "please verify manually" but "I've ruled out X, Y, Z — is the problem in area W?"
Quality judge. For domains where gates aren't deterministic (writing, DS), the human is the final gate. The workflow's job is to present the deliverable in a state where the human's judgment is efficient — not "here's a rough draft, good luck" but "here's a polished draft with tracked changes and a review summary."
The failure mode is on both extremes: too little human involvement (agent goes in circles for 19MB) or too much (human babysits every iteration). The sweet spot is autonomy on mechanical work, human judgment on direction and quality.
Not all phases are equally constrained. Exploration needs vary by domain:
- Dev brainstorm: Question-first. Don't look at code until you understand requirements. The codebase biases thinking.
- DS brainstorm: Needs data exploration + hypothesis generation. You can't ask the right questions without seeing the data.
- Writing brainstorm: Needs source interrogation (Readwise, NLM) and debate. The argument emerges from the material.
The constraint is not "no exploration" but "no implementation without understanding."
A workflow succeeds when the human receives a deliverable that requires minimal rework. This reframes the goal:
- Not "did the agent follow all steps?" (process compliance)
- But "how much did the human need to change?" (outcome quality)
Good workflows produce deliverables where the human says "this is basically done" not "I'll take it from here."
Workflows improve through a feedback loop, not through upfront design:
1. OBSERVE — A session produces a low-quality deliverable (high human rework)
2. DIAGNOSE — Where did the agent drift? Which phase? Which drive caused it?
3. UPDATE — Add enforcement at the drift point (rationalization table, gate, red flag)
4. TEST — Run the workflow again. Did the same failure mode recur?
5. TRANSFER — Does this failure mode exist in other workflows? Apply the fix there too.
Dev is the most mature workflow because it has had the most gradient updates — each session reveals new rationalization patterns, new failure modes, new gates. Writing and DS are less mature because they've had fewer iterations, not because they're less important.
Most enforcement starts as a convention — someone notices a failure mode and writes a rule: "don't inline check definitions." The rule works when the agent reads it. It fails when the context window is full, the rule gets compressed, or the agent rationalizes around it.
Graduation is the moment someone figures out how to test the convention mechanically. The convention "don't inline check definitions" graduates to a constraint when you write a script that greps for inlined definitions and returns pass/fail. Adding the .py file next to the existing .md file is the graduation ceremony — no file moves, no directory changes, just a new paired script.
Before graduation:
constraints/no-inline-checks.md ← convention (judgment-based)
After graduation:
constraints/no-inline-checks.md ← constraint rule (unchanged)
constraints/no-inline-checks.py ← check script (new, auto-discovered)
Not every convention can graduate. "Use active voice" requires judgment. "Match the codebase's existing patterns" requires reading and interpreting. These stay as conventions — and that's fine. The goal isn't to eliminate conventions but to graduate every one that can be tested.
The improvement signal: When a convention keeps failing (the agent keeps violating it despite prompt enforcement), that's the signal to invest in graduation. If you can't write a pass/fail script, escalate to a hook. If you can't write a hook, the convention needs stronger prompt enforcement (rationalization tables, red flags). The gradient is: convention → graduated constraint → hook → structural enforcement.
The workflow-creator skill accelerates this by transferring lessons from mature workflows to immature ones. The continuous-learning skill captures patterns from sessions. When you see a failure mode in one workflow, ask: "does this same failure mode exist in the others?"
The enforcement checklist (references/enforcement-checklist.md) is the accumulation of these gradient updates — 12 patterns discovered through repeated failure, not designed in advance.