StagePilot is a TypeScript runtime and benchmark harness for stabilizing tool calls across provider families. The name comes from stage-gating unstable runs through parse, repair, replay, and review so reliability claims stay inspectable instead of hand-wavy.
The repo brings together three connected surfaces:
@ai-sdk-tool/parser: AI SDK middleware for parsing tool calls from models that do not natively supporttools.StagePilot: a multi-agent orchestration vertical with benchmark, API, demo UI, and Cloud Run path.BenchLab: prompt-mode BFCL experiment tooling, forensics, and local operator APIs.
- Read this repo as a reliability runtime with evidence, not as a generic agent demo.
- The strongest signals are checked-in benchmark lift, parser middleware behavior, replayable review APIs, and explicit regression and perf surfaces.
- It is the clearest first-pass repo for AI engineer interviews, while still carrying useful platform and solutions-architecture signal.
- AI engineer: parser hardening, benchmark discipline, explicit failure classes, replayable traces, and bounded retry posture.
- Solutions architect: package, API, benchmark, and review surfaces stay separate enough to discuss production posture honestly.
- Recruiter / hiring manager: this is the fastest repo for showing I can turn unstable model behavior into reviewable runtime evidence.
- Portfolio family: AI reliability and incident systems
- This repo's role: flagship reliability and tool-calling runtime in the portfolio.
- Related repos:
AegisOps,Aegis-Air,ogx
- Canonical execution plan:
docs/BIGTECH_ELEVATION_PLAN.md - Goal: turn this repo into the clearest public proof of tool-runtime reliability across unstable providers.
This repo is strongest for frontier/runtime reliability and platform-infrastructure teams. For broader enterprise platform discussions, pair it with enterprise-llm-adoption-kit.
| Team lens | What should stand out fast | Start here |
|---|---|---|
| Frontier / runtime reliability | parser hardening, benchmark lift, provider tradeoff posture, replayable traces, explicit regression gates, bounded retry posture, explicit failure classes, checked-in perf rehearsal | docs/reviewer-proof-guide.md, GET /v1/provider-benchmark-scorecard, GET /v1/trace-observability-pack, GET /v1/regression-gate-pack, GET /v1/perf-evidence-pack, GET /v1/failure-taxonomy, GET /v1/review-pack, docs/benchmarks/stagepilot-latest.json |
| Big tech / AI infra | runtime/API separation, contract visibility, developer-ops posture, honest benchmark boundary, bounded load evidence, replay-backed debugging posture | GET /v1/runtime-brief, GET /v1/perf-evidence-pack, GET /v1/trace-observability-pack, GET /v1/regression-gate-pack, GET /v1/provider-benchmark-scorecard, GET /v1/failure-taxonomy, GET /v1/schema/plan-report, GET /v1/developer-ops-pack |
| Evaluation systems | benchmark artifacts, variant comparison, local experiment loop, failure forensics | GET /v1/benchlab/review-pack, docs/benchlab/LOCAL_OLLAMA_SWEEP_20260311.md, experiments/ |
| High-trust workflow systems | reviewable workflow history, replayable proof surfaces, explicit parser-to-handoff boundary | GET /v1/workflow-runs, GET /v1/workflow-run-replay, docs/solution-architecture.md |
- GitHub profile: https://github.com/KIM3310
- GitHub repository: https://github.com/KIM3310/stage-pilot
- Demo video: https://youtu.be/6trgTH1vX4M
pnpm install
pnpm review:proof
pnpm api:stagepilot
# open http://127.0.0.1:8080/demoThen read, in order:
docs/reviewer-proof-guide.mddocs/benchmarks/stagepilot-latest.jsondocs/executive-one-pager.mddocs/solution-architecture.md
If you only want the flagship story: treat this repo as the public proof that I can evaluate unreliable tool calls, harden the parser/runtime boundary, and present the results in a reviewer-safe way.
- Recruiter / hiring manager: read
docs/reviewer-proof-guide.md, then inspectdocs/benchmarks/stagepilot-latest.json. - Bounded public live lane:
POST /v1/live-review-runwith a fixedscenarioIdwhen you want reviewer-safe OpenAI proof instead of the full operator mutation surface. - AI engineer: open
GET /v1/provider-benchmark-scorecard->GET /v1/trace-observability-pack->GET /v1/regression-gate-pack->GET /v1/perf-evidence-pack->GET /v1/failure-taxonomy->GET /v1/review-pack->GET /v1/schema/plan-report->src/. - Platform / solutions architect: open
GET /v1/runtime-brief->GET /v1/perf-evidence-pack->GET /v1/trace-observability-pack->GET /v1/regression-gate-pack->GET /v1/provider-benchmark-scorecard->GET /v1/failure-taxonomy->GET /v1/developer-ops-pack->docs/solution-architecture.md. - Eval / research reviewer: open
GET /v1/benchlab/review-pack->docs/benchlab/->experiments/.
- Parser package (real shipping surface): start in
src/+package.json; this is the npm/runtime path that actually hardens tool-call parsing. - StagePilot runtime/API (real runtime surface): start with
/v1/runtime-briefand/v1/review-pack, then trace the implementation insrc/stagepilot/andsrc/api/. - BenchLab proof surface: use
docs/benchmarks/,docs/benchlab/, and/v1/benchlab/*for checked-in experiment evidence. - Static/docs-only helpers:
docs/review-pack.svg,site/, and narrative docs are reviewer aids, not the benchmark or runtime source of truth.
- StagePilot reviewer API:
GET /v1/runtime-brief,GET /v1/review-pack,GET /v1/schema/plan-report - Provider benchmark scorecard:
GET /v1/provider-benchmark-scorecard - Runtime perf evidence pack:
GET /v1/perf-evidence-pack - Trace observability pack:
GET /v1/trace-observability-pack - Regression gate pack:
GET /v1/regression-gate-pack - Failure review surface:
GET /v1/failure-taxonomy - StagePilot developer workflow pack:
GET /v1/developer-ops-pack - StagePilot workflow history:
GET /v1/workflow-runs,GET /v1/workflow-runs/:requestId - StagePilot workflow replay surface:
GET /v1/workflow-run-replay - Local reviewer summary:
pnpm review:proof - BenchLab reviewer API:
GET /v1/benchlab/runtime-brief,GET /v1/benchlab/review-pack,GET /v1/benchlab/schema/job-report - Checked-in 24-case benchmark proof: baseline
29.17%-> middleware87.50%-> Ralph loop100.00% - Checked-in BenchLab claims: runtime compare, variant leaderboard, best artifacts, and failure forensics
- Latest no-key local validation:
llama3.1:8b,llama3.2:latest,qwen3.5:4ball moved from7.83to8.33with tuned RALPH variants on a5cases/category sweep - Llama follow-up hunt: on
llama3.2:latest,schema-lockstayed ahead whileparallel-safe,coverage,strict,call-count, andcompactall stayed flat in a3cases/category search; a wider10cases/category replay still keptschema-lockpositive at7.50 -> 7.75(+0.25pp)
GET /v1/runtime-brief-> confirm orchestration readiness and integration posture.GET /v1/perf-evidence-pack-> inspect checked-in k6 rehearsal, latency posture, and release guardrails before scale claims.GET /v1/trace-observability-pack-> inspect replayable traces and operator escalation posture before frontier-runtime claims.GET /v1/regression-gate-pack-> inspect explicit promotion logic, watch items, and release posture.GET /v1/provider-benchmark-scorecard-> inspect provider-family contract confidence, latency/cost posture, and strongest protocol surfaces.GET /v1/failure-taxonomy-> inspect parser drift, retry exhaustion, delivery gaps, and observed runtime regressions in one place.GET /v1/developer-ops-pack-> inspect MR / pipeline / release lanes before demoing automation.GET /v1/workflow-runs-> verify recent developer workflow runs.GET /v1/workflow-run-replay-> inspect replay-ready proof routes and recent workflow timeline.GET /v1/review-pack-> inspect benchmark lift and parser/handoff boundary.GET /v1/schema/plan-report-> verify contract before trusting downstream routing output.GET /v1/benchlab/review-pack-> inspect checked-in runtime and artifact claims.docs/reviewer-proof-guide.md+docs/review-pack.svg+docs/benchmarks/stagepilot-latest.json+docs/benchmarks/stagepilot-runtime-load-latest.json+docs/benchmarks/stagepilot-trace-observability-latest.json+docs/benchmarks/stagepilot-regression-gate-latest.json-> read the strongest proof assets first.
- Reviewer guide:
docs/reviewer-proof-guide.md - Architecture:
docs/solution-architecture.md - Overview:
docs/executive-one-pager.md - Discovery notes:
docs/discovery-guide.md - Local no-key sweep:
docs/benchlab/LOCAL_OLLAMA_SWEEP_20260311.md
- Earlier fork / baseline reference: https://github.com/KIM3310/ai-sdk-tool-call-middleware
- Upstream source lineage: https://github.com/minpeter/ai-sdk-tool-call-middleware
This repo keeps attribution explicit while treating stage-pilot as the canonical working surface for new development.
This repo focuses on tool-calling reliability, benchmarked success-rate improvement, and operational handoff readiness.
If API integration is needed, you can connect and use it immediately through the provided API endpoints (/v1/plan, /v1/benchmark, /v1/insights, /v1/whatif, /v1/notify), either locally or on Cloud Run.
Many models still output tool calls as loose text (<tool_call>...</tool_call>, relaxed JSON, trailing tokens, mixed formatting). This project hardens that path so tool execution remains stable instead of silently failing.
For the parser layer, this means:
- parsing malformed tool-call text safely
- coercing payloads to schema-compatible shapes
- streaming tool inputs without depending on native provider tooling
For StagePilot, this directly improves operation routing reliability by:
- applying a bounded Ralph-loop retry when the first call is invalid.
For BenchLab, it creates a repeatable environment to test prompt-mode tool-calling strategies and inspect error buckets instead of relying on anecdotal wins.
stage-pilot/
src/
api/
bin/
stagepilot/
tests/
docs/
benchmarks/
benchlab/
experiments/
scripts/
Source: docs/benchmarks/stagepilot-latest.json
Generated at: 2026-03-02T11:15:13.733Z
Cases: 24 (BENCHMARK_SEED=20260228, BENCHMARK_LOOP_ATTEMPTS=2)
| Strategy | Parse/Plan Success | Success Rate | Avg Latency (ms) | P95 Latency (ms) | Avg Attempts |
|---|---|---|---|---|---|
baseline |
7 / 24 | 29.17% | 0.02 | 0.03 | 1.00 |
middleware |
21 / 24 | 87.50% | 0.14 | 0.45 | 1.00 |
middleware+ralph-loop |
24 / 24 | 100.00% | 0.04 | 0.08 | 1.13 |
Improvement deltas:
- Middleware vs Baseline:
+58.33pp - Ralph Loop vs Middleware:
+12.50pp - Ralph Loop vs Baseline:
+70.83pp
Ralph-loop point (what changed):
middlewareis already robust on malformed payloads.middleware+ralph-loopadds one bounded retry pass (default max 2 attempts), letting the second corrected output recover remaining failures.- In this current checked-in 24-case benchmark, that closes the gap from
87.50%to100.00%.
Latency note: these numbers come from deterministic in-process benchmark harness execution (parser + planning), not network LLM round-trip latency.
Review-pack surfaces now expose this benchmark delta directly through /v1/review-pack so reviewers can inspect the lift without parsing the raw JSON file first.
docs/reviewer-proof-guide.mddocs/review-pack.svgdocs/DEVELOPER_OPS_PACK.mddocs/benchmarks/stagepilot-latest.jsondocs/STAGEPILOT.mddocs/benchlab/TOOL_CALLING_GAINS.mddocs/benchlab/FAILURE_TAXONOMY.md
pnpm installpnpm review:proofpnpm demo:stagepilotpnpm api:stagepilot
# open http://127.0.0.1:8080/demopnpm bench:stagepilotOptional benchmark knobs:
BENCHMARK_CASES=24 BENCHMARK_SEED=20260228 BENCHMARK_LOOP_ATTEMPTS=2 pnpm bench:stagepilotRun the local BenchLab operator API:
pnpm api:benchlab
# open http://127.0.0.1:8090/benchlabBenchLab surfaces:
GET /benchlabGET /healthGET /v1/benchlab/runtime-briefGET /v1/benchlab/review-packGET /v1/benchlab/schema/job-reportGET /v1/benchlab/configsGET /v1/benchlab/jobsGET /v1/benchlab/jobs/:idGET /v1/benchlab/jobs/:id/logsPOST /v1/benchlab/jobs/:id/cancel
BenchLab repo assets:
- research notes under
docs/benchlab/ - runnable prompt-mode experiments under
experiments/ - local operator scripts under
scripts/
EligibilityAgent: triage eligibility and constraintsSafetyAgent: risk and urgency assessmentPlannerAgent: route/action plan generationOutreachAgent: execution-ready outreach actionsJudgeAgent: final consistency gateStagePilotEngine: orchestration runtimesimulateStagePilotTwin: what-if simulation for staffing/demand/contact-rate deltasGeminiGateway(optional): narrative summarization layer
Core files:
src/stagepilot/types.tssrc/stagepilot/ontology.tssrc/stagepilot/agents.tssrc/stagepilot/orchestrator.tssrc/stagepilot/twin.tssrc/stagepilot/benchmark.ts
Run API:
pnpm api:stagepilotEndpoints:
GET /demoGET /healthGET /v1/metaGET /v1/runtime-briefGET /v1/review-packGET /v1/schema/plan-reportPOST /v1/planPOST /v1/benchmarkPOST /v1/insightsPOST /v1/whatifPOST /v1/notifyPOST /v1/openclaw/inbox
See full behavior and payload examples in docs/STAGEPILOT.md.
/v1/runtime-brief,/v1/review-pack, and/v1/schema/plan-reportexpose StagePilot readiness, benchmark proof, parser/orchestration posture, and report contract./v1/benchlab/runtime-brief,/v1/benchlab/review-pack, and/v1/benchlab/schema/job-reportexpose BenchLab evidence counts, checked-in claim proof, dominant failure buckets, and job-report expectations./demoand/benchlabnow render review-pack surfaces directly in the UI so reviewers can validate posture without reading code first.
BenchLab API entrypoint:
pnpm api:benchlabpnpm deploy:stagepilotPost-deploy smoke test:
STAGEPILOT_BASE_URL="https://<your-cloud-run-url>" pnpm smoke:stagepilotRuntime notes:
- CPU-only enforced:
USE_GPU=0 - Secret Manager key mapping expected for
GEMINI_API_KEY - safety timeouts supported:
GEMINI_HTTP_TIMEOUT_MSSTAGEPILOT_REQUEST_BODY_TIMEOUT_MSOPENCLAW_WEBHOOK_TIMEOUT_MSOPENCLAW_CLI_TIMEOUT_MS
Install package only:
pnpm add @ai-sdk-tool/parserQuick example:
import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
import { morphXmlToolMiddleware } from "@ai-sdk-tool/parser";
import { stepCountIs, streamText, wrapLanguageModel } from "ai";
import { z } from "zod";
const model = createOpenAICompatible({
name: "openrouter",
apiKey: process.env.OPENROUTER_API_KEY,
baseURL: "https://openrouter.ai/api/v1",
})("arcee-ai/trinity-large-preview:free");
const result = streamText({
model: wrapLanguageModel({
model,
middleware: morphXmlToolMiddleware,
}),
stopWhen: stepCountIs(4),
prompt: "What is the weather in Seoul?",
tools: {
get_weather: {
description: "Get weather by city name",
inputSchema: z.object({ city: z.string() }),
execute: async ({ city }) => ({ city, condition: "sunny", celsius: 23 }),
},
},
});
for await (const part of result.fullStream) {
// text-delta / tool-input-start / tool-input-delta / tool-input-end / tool-call / tool-result
}Preconfigured middleware exports:
| Middleware | Best for |
|---|---|
hermesToolMiddleware |
JSON-style tool payloads |
morphXmlToolMiddleware |
XML-style payloads + schema-aware coercion |
yamlXmlToolMiddleware |
XML tool tags + YAML bodies |
qwen3CoderToolMiddleware |
Qwen/UI-TARS style <tool_call> markup |
Fact-checked from this repo CHANGELOG.md and npm metadata (as of 2026-02-18).
@ai-sdk-tool/parser major |
AI SDK major | Status |
|---|---|---|
v1.x |
v4.x |
Legacy |
v2.x |
v5.x |
Legacy |
v3.x |
v6.x |
Legacy |
v4.x |
v6.x |
Active (latest) |
pnpm fmt:biome
pnpm check
pnpm test
pnpm buildOne-command verification:
pnpm verifyIf pnpm is not available:
corepack enable
corepack prepare pnpm@9.14.4 --activate- Reviewer guide:
docs/reviewer-proof-guide.md - StagePilot guide:
docs/STAGEPILOT.md - Latest benchmark artifact:
docs/benchmarks/stagepilot-latest.json - BenchLab gains:
docs/benchlab/TOOL_CALLING_GAINS.md - BenchLab failure taxonomy:
docs/benchlab/FAILURE_TAXONOMY.md - Parser core examples:
examples/parser-core/README.md - RXML examples:
examples/rxml-core/README.md - Prompt-mode experiments:
experiments/*
Apache-2.0
pnpm install
pnpm verifyExpanded form:
pnpm check
pnpm typecheck
pnpm test
pnpm build- Keep runtime artifacts out of commits (
.codex_runs/, cache folders, temporary venvs). - Prefer running verification commands above before opening a PR.