StagePilot: Stage-Gated Tool-Calling Reliability Runtime

StagePilot: Stage-Gated Tool-Calling Reliability Runtime

StagePilot is a TypeScript runtime and benchmark harness for stabilizing tool calls across provider families. The name comes from stage-gating unstable runs through parse, repair, replay, and review so reliability claims stay inspectable instead of hand-wavy.

The repo brings together three connected surfaces:

@ai-sdk-tool/parser: AI SDK middleware for parsing tool calls from models that do not natively support tools.
StagePilot: a multi-agent orchestration vertical with benchmark, API, demo UI, and Cloud Run path.
BenchLab: prompt-mode BFCL experiment tooling, forensics, and local operator APIs.

Portfolio posture

Read this repo as a reliability runtime with evidence, not as a generic agent demo.
The strongest signals are checked-in benchmark lift, parser middleware behavior, replayable review APIs, and explicit regression and perf surfaces.
It is the clearest first-pass repo for AI engineer interviews, while still carrying useful platform and solutions-architecture signal.

Role signals

AI engineer: parser hardening, benchmark discipline, explicit failure classes, replayable traces, and bounded retry posture.
Solutions architect: package, API, benchmark, and review surfaces stay separate enough to discuss production posture honestly.
Recruiter / hiring manager: this is the fastest repo for showing I can turn unstable model behavior into reviewable runtime evidence.

Portfolio context

Portfolio family: AI reliability and incident systems
This repo's role: flagship reliability and tool-calling runtime in the portfolio.
Related repos: AegisOps, Aegis-Air, ogx

Big-Tech Elevation Track

Canonical execution plan: docs/BIGTECH_ELEVATION_PLAN.md
Goal: turn this repo into the clearest public proof of tool-runtime reliability across unstable providers.

Best target-team fit

This repo is strongest for frontier/runtime reliability and platform-infrastructure teams. For broader enterprise platform discussions, pair it with enterprise-llm-adoption-kit.

Team lens	What should stand out fast	Start here
Frontier / runtime reliability	parser hardening, benchmark lift, provider tradeoff posture, replayable traces, explicit regression gates, bounded retry posture, explicit failure classes, checked-in perf rehearsal	`docs/reviewer-proof-guide.md`, `GET /v1/provider-benchmark-scorecard`, `GET /v1/trace-observability-pack`, `GET /v1/regression-gate-pack`, `GET /v1/perf-evidence-pack`, `GET /v1/failure-taxonomy`, `GET /v1/review-pack`, `docs/benchmarks/stagepilot-latest.json`
Big tech / AI infra	runtime/API separation, contract visibility, developer-ops posture, honest benchmark boundary, bounded load evidence, replay-backed debugging posture	`GET /v1/runtime-brief`, `GET /v1/perf-evidence-pack`, `GET /v1/trace-observability-pack`, `GET /v1/regression-gate-pack`, `GET /v1/provider-benchmark-scorecard`, `GET /v1/failure-taxonomy`, `GET /v1/schema/plan-report`, `GET /v1/developer-ops-pack`
Evaluation systems	benchmark artifacts, variant comparison, local experiment loop, failure forensics	`GET /v1/benchlab/review-pack`, `docs/benchlab/LOCAL_OLLAMA_SWEEP_20260311.md`, `experiments/`
High-trust workflow systems	reviewable workflow history, replayable proof surfaces, explicit parser-to-handoff boundary	`GET /v1/workflow-runs`, `GET /v1/workflow-run-replay`, `docs/solution-architecture.md`

Project links

GitHub profile: https://github.com/KIM3310
GitHub repository: https://github.com/KIM3310/stage-pilot
Demo video: https://youtu.be/6trgTH1vX4M

60-second reviewer start

pnpm install
pnpm review:proof
pnpm api:stagepilot
# open http://127.0.0.1:8080/demo

Then read, in order:

If you only want the flagship story: treat this repo as the public proof that I can evaluate unreliable tool calls, harden the parser/runtime boundary, and present the results in a reviewer-safe way.

Reviewer Front Door

Recruiter / hiring manager: read docs/reviewer-proof-guide.md, then inspect docs/benchmarks/stagepilot-latest.json.
Bounded public live lane: POST /v1/live-review-run with a fixed scenarioId when you want reviewer-safe OpenAI proof instead of the full operator mutation surface.
AI engineer: open GET /v1/provider-benchmark-scorecard -> GET /v1/trace-observability-pack -> GET /v1/regression-gate-pack -> GET /v1/perf-evidence-pack -> GET /v1/failure-taxonomy -> GET /v1/review-pack -> GET /v1/schema/plan-report -> src/.
Platform / solutions architect: open GET /v1/runtime-brief -> GET /v1/perf-evidence-pack -> GET /v1/trace-observability-pack -> GET /v1/regression-gate-pack -> GET /v1/provider-benchmark-scorecard -> GET /v1/failure-taxonomy -> GET /v1/developer-ops-pack -> docs/solution-architecture.md.
Eval / research reviewer: open GET /v1/benchlab/review-pack -> docs/benchlab/ -> experiments/.

Choose Your First Lane

Parser package (real shipping surface): start in src/ + package.json; this is the npm/runtime path that actually hardens tool-call parsing.
StagePilot runtime/API (real runtime surface): start with /v1/runtime-brief and /v1/review-pack, then trace the implementation in src/stagepilot/ and src/api/.
BenchLab proof surface: use docs/benchmarks/, docs/benchlab/, and /v1/benchlab/* for checked-in experiment evidence.
Static/docs-only helpers: docs/review-pack.svg, site/, and narrative docs are reviewer aids, not the benchmark or runtime source of truth.

Review Pack At A Glance

StagePilot reviewer API: GET /v1/runtime-brief, GET /v1/review-pack, GET /v1/schema/plan-report
Provider benchmark scorecard: GET /v1/provider-benchmark-scorecard
Runtime perf evidence pack: GET /v1/perf-evidence-pack
Trace observability pack: GET /v1/trace-observability-pack
Regression gate pack: GET /v1/regression-gate-pack
Failure review surface: GET /v1/failure-taxonomy
StagePilot developer workflow pack: GET /v1/developer-ops-pack
StagePilot workflow history: GET /v1/workflow-runs, GET /v1/workflow-runs/:requestId
StagePilot workflow replay surface: GET /v1/workflow-run-replay
Local reviewer summary: pnpm review:proof
BenchLab reviewer API: GET /v1/benchlab/runtime-brief, GET /v1/benchlab/review-pack, GET /v1/benchlab/schema/job-report
Checked-in 24-case benchmark proof: baseline 29.17% -> middleware 87.50% -> Ralph loop 100.00%
Checked-in BenchLab claims: runtime compare, variant leaderboard, best artifacts, and failure forensics
Latest no-key local validation: llama3.1:8b, llama3.2:latest, qwen3.5:4b all moved from 7.83 to 8.33 with tuned RALPH variants on a 5 cases/category sweep
Llama follow-up hunt: on llama3.2:latest, schema-lock stayed ahead while parallel-safe, coverage, strict, call-count, and compact all stayed flat in a 3 cases/category search; a wider 10 cases/category replay still kept schema-lock positive at 7.50 -> 7.75 (+0.25pp)

Review Flow

GET /v1/runtime-brief -> confirm orchestration readiness and integration posture.
GET /v1/perf-evidence-pack -> inspect checked-in k6 rehearsal, latency posture, and release guardrails before scale claims.
GET /v1/trace-observability-pack -> inspect replayable traces and operator escalation posture before frontier-runtime claims.
GET /v1/regression-gate-pack -> inspect explicit promotion logic, watch items, and release posture.
GET /v1/provider-benchmark-scorecard -> inspect provider-family contract confidence, latency/cost posture, and strongest protocol surfaces.
GET /v1/failure-taxonomy -> inspect parser drift, retry exhaustion, delivery gaps, and observed runtime regressions in one place.
GET /v1/developer-ops-pack -> inspect MR / pipeline / release lanes before demoing automation.
GET /v1/workflow-runs -> verify recent developer workflow runs.
GET /v1/workflow-run-replay -> inspect replay-ready proof routes and recent workflow timeline.
GET /v1/review-pack -> inspect benchmark lift and parser/handoff boundary.
GET /v1/schema/plan-report -> verify contract before trusting downstream routing output.
GET /v1/benchlab/review-pack -> inspect checked-in runtime and artifact claims.
docs/reviewer-proof-guide.md + docs/review-pack.svg + docs/benchmarks/stagepilot-latest.json + docs/benchmarks/stagepilot-runtime-load-latest.json + docs/benchmarks/stagepilot-trace-observability-latest.json + docs/benchmarks/stagepilot-regression-gate-latest.json -> read the strongest proof assets first.

References and attribution

Earlier fork / baseline reference: https://github.com/KIM3310/ai-sdk-tool-call-middleware
Upstream source lineage: https://github.com/minpeter/ai-sdk-tool-call-middleware

This repo keeps attribution explicit while treating stage-pilot as the canonical working surface for new development.

Project context

This repo focuses on tool-calling reliability, benchmarked success-rate improvement, and operational handoff readiness.

If API integration is needed, you can connect and use it immediately through the provided API endpoints (/v1/plan, /v1/benchmark, /v1/insights, /v1/whatif, /v1/notify), either locally or on Cloud Run.

Why this repo exists

Many models still output tool calls as loose text (<tool_call>...</tool_call>, relaxed JSON, trailing tokens, mixed formatting). This project hardens that path so tool execution remains stable instead of silently failing.

For the parser layer, this means:

parsing malformed tool-call text safely
coercing payloads to schema-compatible shapes
streaming tool inputs without depending on native provider tooling

For StagePilot, this directly improves operation routing reliability by:

applying a bounded Ralph-loop retry when the first call is invalid.

For BenchLab, it creates a repeatable environment to test prompt-mode tool-calling strategies and inspect error buckets instead of relying on anecdotal wins.

Repository layout

stage-pilot/
  src/
    api/
    bin/
    stagepilot/
  tests/
  docs/
    benchmarks/
    benchlab/
  experiments/
  scripts/

StagePilot benchmark (latest)

Source: docs/benchmarks/stagepilot-latest.json
Generated at: 2026-03-02T11:15:13.733Z
Cases: 24 (BENCHMARK_SEED=20260228, BENCHMARK_LOOP_ATTEMPTS=2)

Strategy	Parse/Plan Success	Success Rate	Avg Latency (ms)	P95 Latency (ms)	Avg Attempts
`baseline`	7 / 24	29.17%	0.02	0.03	1.00
`middleware`	21 / 24	87.50%	0.14	0.45	1.00
`middleware+ralph-loop`	24 / 24	100.00%	0.04	0.08	1.13

Improvement deltas:

Middleware vs Baseline: +58.33pp
Ralph Loop vs Middleware: +12.50pp
Ralph Loop vs Baseline: +70.83pp

Ralph-loop point (what changed):

middleware is already robust on malformed payloads.
middleware+ralph-loop adds one bounded retry pass (default max 2 attempts), letting the second corrected output recover remaining failures.
In this current checked-in 24-case benchmark, that closes the gap from 87.50% to 100.00%.

Latency note: these numbers come from deterministic in-process benchmark harness execution (parser + planning), not network LLM round-trip latency.

Review-pack surfaces now expose this benchmark delta directly through /v1/review-pack so reviewers can inspect the lift without parsing the raw JSON file first.

Supporting Files

docs/reviewer-proof-guide.md
docs/review-pack.svg
docs/DEVELOPER_OPS_PACK.md
docs/benchmarks/stagepilot-latest.json
docs/STAGEPILOT.md
docs/benchlab/TOOL_CALLING_GAINS.md
docs/benchlab/FAILURE_TAXONOMY.md

Quick start

1) Install

pnpm install

2) Print the reviewer proof summary

pnpm review:proof

3) Run StagePilot demo flow

pnpm demo:stagepilot

4) Run local API + judge demo UI

pnpm api:stagepilot
# open http://127.0.0.1:8080/demo

5) Reproduce benchmark

pnpm bench:stagepilot

Optional benchmark knobs:

BENCHMARK_CASES=24 BENCHMARK_SEED=20260228 BENCHMARK_LOOP_ATTEMPTS=2 pnpm bench:stagepilot

BenchLab quick start

Run the local BenchLab operator API:

pnpm api:benchlab
# open http://127.0.0.1:8090/benchlab

BenchLab surfaces:

GET /benchlab
GET /health
GET /v1/benchlab/runtime-brief
GET /v1/benchlab/review-pack
GET /v1/benchlab/schema/job-report
GET /v1/benchlab/configs
GET /v1/benchlab/jobs
GET /v1/benchlab/jobs/:id
GET /v1/benchlab/jobs/:id/logs
POST /v1/benchlab/jobs/:id/cancel

BenchLab repo assets:

research notes under docs/benchlab/
runnable prompt-mode experiments under experiments/
local operator scripts under scripts/

StagePilot architecture (high-level)

EligibilityAgent: triage eligibility and constraints
SafetyAgent: risk and urgency assessment
PlannerAgent: route/action plan generation
OutreachAgent: execution-ready outreach actions
JudgeAgent: final consistency gate
StagePilotEngine: orchestration runtime
simulateStagePilotTwin: what-if simulation for staffing/demand/contact-rate deltas
GeminiGateway (optional): narrative summarization layer

Core files:

src/stagepilot/types.ts
src/stagepilot/ontology.ts
src/stagepilot/agents.ts
src/stagepilot/orchestrator.ts
src/stagepilot/twin.ts
src/stagepilot/benchmark.ts

API surface

Run API:

pnpm api:stagepilot

Endpoints:

GET /demo
GET /health
GET /v1/meta
GET /v1/runtime-brief
GET /v1/review-pack
GET /v1/schema/plan-report
POST /v1/plan
POST /v1/benchmark
POST /v1/insights
POST /v1/whatif
POST /v1/notify
POST /v1/openclaw/inbox

See full behavior and payload examples in docs/STAGEPILOT.md.

Service-Grade Surfaces

/v1/runtime-brief, /v1/review-pack, and /v1/schema/plan-report expose StagePilot readiness, benchmark proof, parser/orchestration posture, and report contract.
/v1/benchlab/runtime-brief, /v1/benchlab/review-pack, and /v1/benchlab/schema/job-report expose BenchLab evidence counts, checked-in claim proof, dominant failure buckets, and job-report expectations.
/demo and /benchlab now render review-pack surfaces directly in the UI so reviewers can validate posture without reading code first.

BenchLab API entrypoint:

pnpm api:benchlab

Cloud Run deployment (Google-only)

pnpm deploy:stagepilot

Post-deploy smoke test:

STAGEPILOT_BASE_URL="https://<your-cloud-run-url>" pnpm smoke:stagepilot

Runtime notes:

CPU-only enforced: USE_GPU=0
Secret Manager key mapping expected for GEMINI_API_KEY
safety timeouts supported:
- GEMINI_HTTP_TIMEOUT_MS
- STAGEPILOT_REQUEST_BODY_TIMEOUT_MS
- OPENCLAW_WEBHOOK_TIMEOUT_MS
- OPENCLAW_CLI_TIMEOUT_MS

`@ai-sdk-tool/parser` usage

Install package only:

pnpm add @ai-sdk-tool/parser

Quick example:

import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
import { morphXmlToolMiddleware } from "@ai-sdk-tool/parser";
import { stepCountIs, streamText, wrapLanguageModel } from "ai";
import { z } from "zod";

const model = createOpenAICompatible({
  name: "openrouter",
  apiKey: process.env.OPENROUTER_API_KEY,
  baseURL: "https://openrouter.ai/api/v1",
})("arcee-ai/trinity-large-preview:free");

const result = streamText({
  model: wrapLanguageModel({
    model,
    middleware: morphXmlToolMiddleware,
  }),
  stopWhen: stepCountIs(4),
  prompt: "What is the weather in Seoul?",
  tools: {
    get_weather: {
      description: "Get weather by city name",
      inputSchema: z.object({ city: z.string() }),
      execute: async ({ city }) => ({ city, condition: "sunny", celsius: 23 }),
    },
  },
});

for await (const part of result.fullStream) {
  // text-delta / tool-input-start / tool-input-delta / tool-input-end / tool-call / tool-result
}

Preconfigured middleware exports:

Middleware	Best for
`hermesToolMiddleware`	JSON-style tool payloads
`morphXmlToolMiddleware`	XML-style payloads + schema-aware coercion
`yamlXmlToolMiddleware`	XML tool tags + YAML bodies
`qwen3CoderToolMiddleware`	Qwen/UI-TARS style `<tool_call>` markup

AI SDK compatibility

Fact-checked from this repo CHANGELOG.md and npm metadata (as of 2026-02-18).

`@ai-sdk-tool/parser` major	AI SDK major	Status
`v1.x`	`v4.x`	Legacy
`v2.x`	`v5.x`	Legacy
`v3.x`	`v6.x`	Legacy
`v4.x`	`v6.x`	Active (`latest`)

Local development

pnpm fmt:biome
pnpm check
pnpm test
pnpm build

One-command verification:

pnpm verify

If pnpm is not available:

corepack enable
corepack prepare pnpm@9.14.4 --activate

Docs map

Reviewer guide: docs/reviewer-proof-guide.md
StagePilot guide: docs/STAGEPILOT.md
Latest benchmark artifact: docs/benchmarks/stagepilot-latest.json
BenchLab gains: docs/benchlab/TOOL_CALLING_GAINS.md
BenchLab failure taxonomy: docs/benchlab/FAILURE_TAXONOMY.md
Parser core examples: examples/parser-core/README.md
RXML examples: examples/rxml-core/README.md
Prompt-mode experiments: experiments/*

License

Apache-2.0

Local Verification

pnpm install
pnpm verify

Expanded form:

pnpm check
pnpm typecheck
pnpm test
pnpm build

Repository Hygiene

Keep runtime artifacts out of commits (.codex_runs/, cache folders, temporary venvs).
Prefer running verification commands above before opening a PR.

Name		Name	Last commit message	Last commit date
Latest commit History 340 Commits
.changeset		.changeset
.github		.github
.specify		.specify
.zed		.zed
docs		docs
examples		examples
experiments		experiments
infra/terraform		infra/terraform
scripts		scripts
site		site
src		src
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLOUDFLARE_PAGES.md		CLOUDFLARE_PAGES.md
Dockerfile		Dockerfile
IMPLEMENT.md		IMPLEMENT.md
LICENSE		LICENSE
README.md		README.md
biome.jsonc		biome.jsonc
codecov.yml		codecov.yml
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
tsconfig.vitest.json		tsconfig.vitest.json
tsup.config.ts		tsup.config.ts
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

StagePilot: Stage-Gated Tool-Calling Reliability Runtime

Portfolio posture

Role signals

Portfolio context

Big-Tech Elevation Track

Best target-team fit

Project links

60-second reviewer start

Reviewer Front Door

Choose Your First Lane

Review Pack At A Glance

Review Flow

Further Reading

References and attribution

Project context

Why this repo exists

Repository layout

StagePilot benchmark (latest)

Supporting Files

Quick start

1) Install

2) Print the reviewer proof summary

3) Run StagePilot demo flow

4) Run local API + judge demo UI

5) Reproduce benchmark

BenchLab quick start

StagePilot architecture (high-level)

API surface

Service-Grade Surfaces

Cloud Run deployment (Google-only)

@ai-sdk-tool/parser usage

AI SDK compatibility

Local development

Docs map

License

Local Verification

Repository Hygiene

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`@ai-sdk-tool/parser` usage

Packages