Skip to content

KIM3310/stage-pilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

340 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StagePilot reliability runtime banner

StagePilot: Stage-Gated Tool-Calling Reliability Runtime

npm - parser npm downloads - parser

StagePilot is a TypeScript runtime and benchmark harness for stabilizing tool calls across provider families. The name comes from stage-gating unstable runs through parse, repair, replay, and review so reliability claims stay inspectable instead of hand-wavy.

The repo brings together three connected surfaces:

  1. @ai-sdk-tool/parser: AI SDK middleware for parsing tool calls from models that do not natively support tools.
  2. StagePilot: a multi-agent orchestration vertical with benchmark, API, demo UI, and Cloud Run path.
  3. BenchLab: prompt-mode BFCL experiment tooling, forensics, and local operator APIs.

Portfolio posture

  • Read this repo as a reliability runtime with evidence, not as a generic agent demo.
  • The strongest signals are checked-in benchmark lift, parser middleware behavior, replayable review APIs, and explicit regression and perf surfaces.
  • It is the clearest first-pass repo for AI engineer interviews, while still carrying useful platform and solutions-architecture signal.

Role signals

  • AI engineer: parser hardening, benchmark discipline, explicit failure classes, replayable traces, and bounded retry posture.
  • Solutions architect: package, API, benchmark, and review surfaces stay separate enough to discuss production posture honestly.
  • Recruiter / hiring manager: this is the fastest repo for showing I can turn unstable model behavior into reviewable runtime evidence.

Portfolio context

  • Portfolio family: AI reliability and incident systems
  • This repo's role: flagship reliability and tool-calling runtime in the portfolio.
  • Related repos: AegisOps, Aegis-Air, ogx

Big-Tech Elevation Track

  • Canonical execution plan: docs/BIGTECH_ELEVATION_PLAN.md
  • Goal: turn this repo into the clearest public proof of tool-runtime reliability across unstable providers.

Best target-team fit

This repo is strongest for frontier/runtime reliability and platform-infrastructure teams. For broader enterprise platform discussions, pair it with enterprise-llm-adoption-kit.

Team lens What should stand out fast Start here
Frontier / runtime reliability parser hardening, benchmark lift, provider tradeoff posture, replayable traces, explicit regression gates, bounded retry posture, explicit failure classes, checked-in perf rehearsal docs/reviewer-proof-guide.md, GET /v1/provider-benchmark-scorecard, GET /v1/trace-observability-pack, GET /v1/regression-gate-pack, GET /v1/perf-evidence-pack, GET /v1/failure-taxonomy, GET /v1/review-pack, docs/benchmarks/stagepilot-latest.json
Big tech / AI infra runtime/API separation, contract visibility, developer-ops posture, honest benchmark boundary, bounded load evidence, replay-backed debugging posture GET /v1/runtime-brief, GET /v1/perf-evidence-pack, GET /v1/trace-observability-pack, GET /v1/regression-gate-pack, GET /v1/provider-benchmark-scorecard, GET /v1/failure-taxonomy, GET /v1/schema/plan-report, GET /v1/developer-ops-pack
Evaluation systems benchmark artifacts, variant comparison, local experiment loop, failure forensics GET /v1/benchlab/review-pack, docs/benchlab/LOCAL_OLLAMA_SWEEP_20260311.md, experiments/
High-trust workflow systems reviewable workflow history, replayable proof surfaces, explicit parser-to-handoff boundary GET /v1/workflow-runs, GET /v1/workflow-run-replay, docs/solution-architecture.md

Project links

60-second reviewer start

pnpm install
pnpm review:proof
pnpm api:stagepilot
# open http://127.0.0.1:8080/demo

Then read, in order:

If you only want the flagship story: treat this repo as the public proof that I can evaluate unreliable tool calls, harden the parser/runtime boundary, and present the results in a reviewer-safe way.

Reviewer Front Door

  • Recruiter / hiring manager: read docs/reviewer-proof-guide.md, then inspect docs/benchmarks/stagepilot-latest.json.
  • Bounded public live lane: POST /v1/live-review-run with a fixed scenarioId when you want reviewer-safe OpenAI proof instead of the full operator mutation surface.
  • AI engineer: open GET /v1/provider-benchmark-scorecard -> GET /v1/trace-observability-pack -> GET /v1/regression-gate-pack -> GET /v1/perf-evidence-pack -> GET /v1/failure-taxonomy -> GET /v1/review-pack -> GET /v1/schema/plan-report -> src/.
  • Platform / solutions architect: open GET /v1/runtime-brief -> GET /v1/perf-evidence-pack -> GET /v1/trace-observability-pack -> GET /v1/regression-gate-pack -> GET /v1/provider-benchmark-scorecard -> GET /v1/failure-taxonomy -> GET /v1/developer-ops-pack -> docs/solution-architecture.md.
  • Eval / research reviewer: open GET /v1/benchlab/review-pack -> docs/benchlab/ -> experiments/.

Choose Your First Lane

  • Parser package (real shipping surface): start in src/ + package.json; this is the npm/runtime path that actually hardens tool-call parsing.
  • StagePilot runtime/API (real runtime surface): start with /v1/runtime-brief and /v1/review-pack, then trace the implementation in src/stagepilot/ and src/api/.
  • BenchLab proof surface: use docs/benchmarks/, docs/benchlab/, and /v1/benchlab/* for checked-in experiment evidence.
  • Static/docs-only helpers: docs/review-pack.svg, site/, and narrative docs are reviewer aids, not the benchmark or runtime source of truth.

Review Pack At A Glance

  • StagePilot reviewer API: GET /v1/runtime-brief, GET /v1/review-pack, GET /v1/schema/plan-report
  • Provider benchmark scorecard: GET /v1/provider-benchmark-scorecard
  • Runtime perf evidence pack: GET /v1/perf-evidence-pack
  • Trace observability pack: GET /v1/trace-observability-pack
  • Regression gate pack: GET /v1/regression-gate-pack
  • Failure review surface: GET /v1/failure-taxonomy
  • StagePilot developer workflow pack: GET /v1/developer-ops-pack
  • StagePilot workflow history: GET /v1/workflow-runs, GET /v1/workflow-runs/:requestId
  • StagePilot workflow replay surface: GET /v1/workflow-run-replay
  • Local reviewer summary: pnpm review:proof
  • BenchLab reviewer API: GET /v1/benchlab/runtime-brief, GET /v1/benchlab/review-pack, GET /v1/benchlab/schema/job-report
  • Checked-in 24-case benchmark proof: baseline 29.17% -> middleware 87.50% -> Ralph loop 100.00%
  • Checked-in BenchLab claims: runtime compare, variant leaderboard, best artifacts, and failure forensics
  • Latest no-key local validation: llama3.1:8b, llama3.2:latest, qwen3.5:4b all moved from 7.83 to 8.33 with tuned RALPH variants on a 5 cases/category sweep
  • Llama follow-up hunt: on llama3.2:latest, schema-lock stayed ahead while parallel-safe, coverage, strict, call-count, and compact all stayed flat in a 3 cases/category search; a wider 10 cases/category replay still kept schema-lock positive at 7.50 -> 7.75 (+0.25pp)

Review Flow

  1. GET /v1/runtime-brief -> confirm orchestration readiness and integration posture.
  2. GET /v1/perf-evidence-pack -> inspect checked-in k6 rehearsal, latency posture, and release guardrails before scale claims.
  3. GET /v1/trace-observability-pack -> inspect replayable traces and operator escalation posture before frontier-runtime claims.
  4. GET /v1/regression-gate-pack -> inspect explicit promotion logic, watch items, and release posture.
  5. GET /v1/provider-benchmark-scorecard -> inspect provider-family contract confidence, latency/cost posture, and strongest protocol surfaces.
  6. GET /v1/failure-taxonomy -> inspect parser drift, retry exhaustion, delivery gaps, and observed runtime regressions in one place.
  7. GET /v1/developer-ops-pack -> inspect MR / pipeline / release lanes before demoing automation.
  8. GET /v1/workflow-runs -> verify recent developer workflow runs.
  9. GET /v1/workflow-run-replay -> inspect replay-ready proof routes and recent workflow timeline.
  10. GET /v1/review-pack -> inspect benchmark lift and parser/handoff boundary.
  11. GET /v1/schema/plan-report -> verify contract before trusting downstream routing output.
  12. GET /v1/benchlab/review-pack -> inspect checked-in runtime and artifact claims.
  13. docs/reviewer-proof-guide.md + docs/review-pack.svg + docs/benchmarks/stagepilot-latest.json + docs/benchmarks/stagepilot-runtime-load-latest.json + docs/benchmarks/stagepilot-trace-observability-latest.json + docs/benchmarks/stagepilot-regression-gate-latest.json -> read the strongest proof assets first.

StagePilot Review Pack

Further Reading

References and attribution

This repo keeps attribution explicit while treating stage-pilot as the canonical working surface for new development.

Project context

This repo focuses on tool-calling reliability, benchmarked success-rate improvement, and operational handoff readiness.

If API integration is needed, you can connect and use it immediately through the provided API endpoints (/v1/plan, /v1/benchmark, /v1/insights, /v1/whatif, /v1/notify), either locally or on Cloud Run.

Why this repo exists

Many models still output tool calls as loose text (<tool_call>...</tool_call>, relaxed JSON, trailing tokens, mixed formatting). This project hardens that path so tool execution remains stable instead of silently failing.

For the parser layer, this means:

  • parsing malformed tool-call text safely
  • coercing payloads to schema-compatible shapes
  • streaming tool inputs without depending on native provider tooling

For StagePilot, this directly improves operation routing reliability by:

  • applying a bounded Ralph-loop retry when the first call is invalid.

For BenchLab, it creates a repeatable environment to test prompt-mode tool-calling strategies and inspect error buckets instead of relying on anecdotal wins.

Repository layout

stage-pilot/
  src/
    api/
    bin/
    stagepilot/
  tests/
  docs/
    benchmarks/
    benchlab/
  experiments/
  scripts/

StagePilot benchmark (latest)

Source: docs/benchmarks/stagepilot-latest.json
Generated at: 2026-03-02T11:15:13.733Z
Cases: 24 (BENCHMARK_SEED=20260228, BENCHMARK_LOOP_ATTEMPTS=2)

Strategy Parse/Plan Success Success Rate Avg Latency (ms) P95 Latency (ms) Avg Attempts
baseline 7 / 24 29.17% 0.02 0.03 1.00
middleware 21 / 24 87.50% 0.14 0.45 1.00
middleware+ralph-loop 24 / 24 100.00% 0.04 0.08 1.13

Improvement deltas:

  • Middleware vs Baseline: +58.33pp
  • Ralph Loop vs Middleware: +12.50pp
  • Ralph Loop vs Baseline: +70.83pp

Ralph-loop point (what changed):

  • middleware is already robust on malformed payloads.
  • middleware+ralph-loop adds one bounded retry pass (default max 2 attempts), letting the second corrected output recover remaining failures.
  • In this current checked-in 24-case benchmark, that closes the gap from 87.50% to 100.00%.

Latency note: these numbers come from deterministic in-process benchmark harness execution (parser + planning), not network LLM round-trip latency.

Review-pack surfaces now expose this benchmark delta directly through /v1/review-pack so reviewers can inspect the lift without parsing the raw JSON file first.

Supporting Files

  • docs/reviewer-proof-guide.md
  • docs/review-pack.svg
  • docs/DEVELOPER_OPS_PACK.md
  • docs/benchmarks/stagepilot-latest.json
  • docs/STAGEPILOT.md
  • docs/benchlab/TOOL_CALLING_GAINS.md
  • docs/benchlab/FAILURE_TAXONOMY.md

Quick start

1) Install

pnpm install

2) Print the reviewer proof summary

pnpm review:proof

3) Run StagePilot demo flow

pnpm demo:stagepilot

4) Run local API + judge demo UI

pnpm api:stagepilot
# open http://127.0.0.1:8080/demo

5) Reproduce benchmark

pnpm bench:stagepilot

Optional benchmark knobs:

BENCHMARK_CASES=24 BENCHMARK_SEED=20260228 BENCHMARK_LOOP_ATTEMPTS=2 pnpm bench:stagepilot

BenchLab quick start

Run the local BenchLab operator API:

pnpm api:benchlab
# open http://127.0.0.1:8090/benchlab

BenchLab surfaces:

  • GET /benchlab
  • GET /health
  • GET /v1/benchlab/runtime-brief
  • GET /v1/benchlab/review-pack
  • GET /v1/benchlab/schema/job-report
  • GET /v1/benchlab/configs
  • GET /v1/benchlab/jobs
  • GET /v1/benchlab/jobs/:id
  • GET /v1/benchlab/jobs/:id/logs
  • POST /v1/benchlab/jobs/:id/cancel

BenchLab repo assets:

  • research notes under docs/benchlab/
  • runnable prompt-mode experiments under experiments/
  • local operator scripts under scripts/

StagePilot architecture (high-level)

  • EligibilityAgent: triage eligibility and constraints
  • SafetyAgent: risk and urgency assessment
  • PlannerAgent: route/action plan generation
  • OutreachAgent: execution-ready outreach actions
  • JudgeAgent: final consistency gate
  • StagePilotEngine: orchestration runtime
  • simulateStagePilotTwin: what-if simulation for staffing/demand/contact-rate deltas
  • GeminiGateway (optional): narrative summarization layer

Core files:

  • src/stagepilot/types.ts
  • src/stagepilot/ontology.ts
  • src/stagepilot/agents.ts
  • src/stagepilot/orchestrator.ts
  • src/stagepilot/twin.ts
  • src/stagepilot/benchmark.ts

API surface

Run API:

pnpm api:stagepilot

Endpoints:

  • GET /demo
  • GET /health
  • GET /v1/meta
  • GET /v1/runtime-brief
  • GET /v1/review-pack
  • GET /v1/schema/plan-report
  • POST /v1/plan
  • POST /v1/benchmark
  • POST /v1/insights
  • POST /v1/whatif
  • POST /v1/notify
  • POST /v1/openclaw/inbox

See full behavior and payload examples in docs/STAGEPILOT.md.

Service-Grade Surfaces

  • /v1/runtime-brief, /v1/review-pack, and /v1/schema/plan-report expose StagePilot readiness, benchmark proof, parser/orchestration posture, and report contract.
  • /v1/benchlab/runtime-brief, /v1/benchlab/review-pack, and /v1/benchlab/schema/job-report expose BenchLab evidence counts, checked-in claim proof, dominant failure buckets, and job-report expectations.
  • /demo and /benchlab now render review-pack surfaces directly in the UI so reviewers can validate posture without reading code first.

BenchLab API entrypoint:

pnpm api:benchlab

Cloud Run deployment (Google-only)

pnpm deploy:stagepilot

Post-deploy smoke test:

STAGEPILOT_BASE_URL="https://<your-cloud-run-url>" pnpm smoke:stagepilot

Runtime notes:

  • CPU-only enforced: USE_GPU=0
  • Secret Manager key mapping expected for GEMINI_API_KEY
  • safety timeouts supported:
    • GEMINI_HTTP_TIMEOUT_MS
    • STAGEPILOT_REQUEST_BODY_TIMEOUT_MS
    • OPENCLAW_WEBHOOK_TIMEOUT_MS
    • OPENCLAW_CLI_TIMEOUT_MS

@ai-sdk-tool/parser usage

Install package only:

pnpm add @ai-sdk-tool/parser

Quick example:

import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
import { morphXmlToolMiddleware } from "@ai-sdk-tool/parser";
import { stepCountIs, streamText, wrapLanguageModel } from "ai";
import { z } from "zod";

const model = createOpenAICompatible({
  name: "openrouter",
  apiKey: process.env.OPENROUTER_API_KEY,
  baseURL: "https://openrouter.ai/api/v1",
})("arcee-ai/trinity-large-preview:free");

const result = streamText({
  model: wrapLanguageModel({
    model,
    middleware: morphXmlToolMiddleware,
  }),
  stopWhen: stepCountIs(4),
  prompt: "What is the weather in Seoul?",
  tools: {
    get_weather: {
      description: "Get weather by city name",
      inputSchema: z.object({ city: z.string() }),
      execute: async ({ city }) => ({ city, condition: "sunny", celsius: 23 }),
    },
  },
});

for await (const part of result.fullStream) {
  // text-delta / tool-input-start / tool-input-delta / tool-input-end / tool-call / tool-result
}

Preconfigured middleware exports:

Middleware Best for
hermesToolMiddleware JSON-style tool payloads
morphXmlToolMiddleware XML-style payloads + schema-aware coercion
yamlXmlToolMiddleware XML tool tags + YAML bodies
qwen3CoderToolMiddleware Qwen/UI-TARS style <tool_call> markup

AI SDK compatibility

Fact-checked from this repo CHANGELOG.md and npm metadata (as of 2026-02-18).

@ai-sdk-tool/parser major AI SDK major Status
v1.x v4.x Legacy
v2.x v5.x Legacy
v3.x v6.x Legacy
v4.x v6.x Active (latest)

Local development

pnpm fmt:biome
pnpm check
pnpm test
pnpm build

One-command verification:

pnpm verify

If pnpm is not available:

corepack enable
corepack prepare pnpm@9.14.4 --activate

Docs map

License

Apache-2.0

Local Verification

pnpm install
pnpm verify

Expanded form:

pnpm check
pnpm typecheck
pnpm test
pnpm build

Repository Hygiene

  • Keep runtime artifacts out of commits (.codex_runs/, cache folders, temporary venvs).
  • Prefer running verification commands above before opening a PR.

About

Reliable tool calling for non-native models: parser middleware, StagePilot orchestration, and BenchLab prompt-mode eval tooling.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors