Skip to content

feat(runbooks): runbook execution engine + pup workflows PoC#146

Draft
platinummonkey wants to merge 4 commits intomainfrom
feat/runbooks-poc
Draft

feat(runbooks): runbook execution engine + pup workflows PoC#146
platinummonkey wants to merge 4 commits intomainfrom
feat/runbooks-poc

Conversation

@platinummonkey
Copy link
Collaborator

@platinummonkey platinummonkey commented Mar 3, 2026

Summary

Proof-of-concept implementation of the runbook execution engine proposed in #143 (discussion). Adds two new command groups — pup runbooks and pup workflows — with no new dependencies.

What's included

pup runbooks

YAML runbooks live in ~/.config/pup/runbooks/. Each file defines sequential steps that mix pup commands, shell tools, Datadog Workflow triggers, HTTP calls, and interactive confirm gates.

pup runbooks list [--tag=key:value ...]   # discover by tag
pup runbooks describe <name>              # show steps + vars
pup runbooks run <name> [--set K=V ...]   # execute
pup runbooks validate <name>              # lint without running
pup runbooks import <path-or-url>         # fetch into runbooks dir

Example runbook (~/.config/pup/runbooks/hello.yaml):

name: hello
description: Test runbook
vars:
  NAME:
    default: world
steps:
  - name: Say hello
    kind: shell
    run: echo "Hello, {{ NAME }}!"
  - name: List monitors
    kind: pup
    run: monitors list --limit=3

Output while running:

runbook: hello  (2 steps)  2026-03-03 00:16:08 UTC
  Test runbook

[1/2] Say hello  (shell)  2026-03-03 00:16:08 UTC
  $ echo "Hello, pup!"
  stdout:
Hello, pup!
  ✓ done  10ms  ·  next: step 2/2 — List monitors (pup)

[2/2] List monitors  (pup)  2026-03-03 00:16:08 UTC
  $ monitors list --limit=3
  stdout:
{ ... }
  ✓ done  320ms  ·  last step

✓ done  hello  2/2 steps  330ms  2026-03-03 00:16:08 UTC

pup workflows

Raw REST access to the Datadog Workflows API (not covered by the typed SDK client):

pup workflows run --id=<id> [--input k=v ...] [--watch]
pup workflows instances list --workflow-id=<id>
pup workflows instances get  --workflow-id=<id> --instance-id=<id>

--watch polls every 15 s until terminal state, printing elapsed time and status to stderr. Without --watch, agent mode includes a watch_command metadata hint.

Step kinds

Kind What it does
pup Shells out to the current pup binary with --output json; supports poll: loops
shell sh -c "..." with template rendering; surfaces stderr even on success
datadog-workflow POST trigger + auto-poll to terminal state; emits watch_command hint in agent mode
confirm Prompts [y/N]; bypassed by --yes / agent mode
http Authenticated GET/POST via existing client::raw_get / raw_post

Control flow

  • {{ VAR }} and {{ VAR | default: "x" }} template substitution in all string fields
  • on_failure: warn | confirm | fail per step
  • when: always | on_success to run cleanup steps after failure
  • optional: true to swallow errors silently
  • capture: VAR_NAME to pipe step stdout into a variable for later steps
  • poll: { interval, timeout, until } with conditions: empty, status == X, value < N, decreasing

Reference runbooks

Three annotated examples in docs/examples/runbooks/:

  • deploy-service.yaml — SLO check → incident gate → DD Workflow trigger → monitor poll → Slack notify
  • incident-triage.yaml — fetch incident → search logs → check monitors → auto-mitigation workflow → shell diagnostics
  • maintenance-window.yaml — create downtime (capture ID) → drain → metric poll → confirm → delete downtime

Not in this PoC

  • Parallel step execution
  • Step retry logic
  • Remote runbook registry / sync
  • Web UI

Platform support

pup runbooks and pup workflows are native-only and excluded from all wasm builds via #[cfg(not(target_arch = "wasm32"))]. They rely on capabilities that wasm targets don't provide:

Dependency Why it rules out wasm
tokio::process::Command subprocess spawning — unavailable in wasm
std::fs + dirs filesystem access for runbook discovery and import — unavailable in browser wasm, restricted in WASI
chrono::Utc (step timestamps) fine in native/WASI but not pulled in for browser
client::raw_post / raw_get these work in wasm, but they're only reached via the engine which can't run

The gating covers every touch point — the mod runbooks declaration, both pub mod entries in commands/mod.rs, the Commands::Runbooks and Commands::Workflows enum variants, the three subcommand enums, and the dispatch arms in main_inner(). All three build targets pass cleanly:

  • cargo build (native) ✓
  • cargo check --target wasm32-wasip2 --features wasi
  • cargo check --target wasm32-unknown-unknown --lib --features browser

Testing

# Build
cargo build

# Create a test runbook
mkdir -p ~/.config/pup/runbooks
cat > ~/.config/pup/runbooks/hello.yaml <<'YAML'
name: hello
description: Test runbook
vars:
  NAME:
    default: world
steps:
  - name: Say hello
    kind: shell
    run: echo "Hello, {{ NAME }}!"
YAML

pup runbooks list
pup runbooks validate hello
pup runbooks run hello --set NAME=pup

All existing tests pass (cargo test, cargo clippy -- -D warnings, cargo fmt --check).


Discussion: #143

🤖 Generated with Claude Code

platinummonkey and others added 4 commits March 2, 2026 18:03
Implements the runbooks PoC as specified:

- `pup runbooks list/describe/run/validate/import` — execute YAML
  runbooks from ~/.config/pup/runbooks/ with {{ VAR }} templating,
  sequential step execution, poll loops, and confirm gates
- `pup workflows run/instances list/get` — trigger Datadog Workflows
  and poll to completion via raw REST (POST/GET /api/v2/workflows/...)

New files:
- src/runbooks/mod.rs     — Runbook, Step, VarDef, PollConfig types
- src/runbooks/template.rs — {{ VAR }} and | default: "x" rendering
- src/runbooks/loader.rs  — scan runbooks dir, load/import by name
- src/runbooks/engine.rs  — sequential executor with polling, confirm,
                             on_failure handling, variable capture
- src/commands/runbooks.rs — list/describe/run/validate/import CLI
- src/commands/workflows.rs — trigger + watch, instances list/get
- docs/examples/runbooks/  — deploy-service, incident-triage,
                              maintenance-window reference templates

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…hints

Each step now shows:
- Header: ► Step N/M  ·  <name>  ·  <kind>  [HH:MM:SS]  with command preview
- Labeled sections: ── stdout ── and ── stderr ── blocks wrapping output
- Footer: ✓/✗/⊘  <elapsed>  ·  next: step N/M — <name> (<kind>)
- Summary line with total elapsed time and pass/fail count

Shell steps surface non-empty stderr even on success so warnings
from curl, grep, etc. aren't silently dropped.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Remove RULE constant and all long separator lines
- Timestamps now show full UTC date+time: 2026-03-02 18:11:01 UTC
- Step header: [N/M] <name>  (<kind>)  <timestamp>
- Output labeled with indented "stdout:" / "stderr:" markers
- Summary line: ✓/⚠ done  <name>  N/M steps  <elapsed>  <timestamp>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ch = "wasm32"))]

Neither feature is compatible with wasm targets:
- loader.rs uses std::fs and reqwest::Client::new()
- engine.rs uses tokio::process::Command and chrono::Utc
- both require dirs-based config path resolution (native-only)

Gated items:
- mod runbooks; in main.rs
- pub mod runbooks/workflows; in commands/mod.rs
- Commands::Runbooks/Workflows variants and their subcommand enums
- dispatch arms in main_inner()

Verified: native build, wasm32-wasip2 (wasi feature), and
wasm32-unknown-unknown --lib (browser feature) all pass.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request product:automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant