Turn failed agent traces into replayable regression cases.
ReplayKit is a local-first CLI tool that takes a failed agent trace and converts it into a versioned regression case you can replay, assert against, and commit to your repo.
Teams debug failed agent runs, inspect traces, write postmortems — and then repeat the same failure a week later because nobody converts the investigation into a reusable test.
ReplayKit closes that gap: debug once, replay forever.
Failed trace → Replay case → Run against candidate → Did the failure reproduce?
ReplayKit has four first-class concepts:
- TraceRun — a normalized record of what happened in the original agent run
- ReplayCase — a versioned artifact with scenario inputs, failure signatures, and assertions
- ReplayResult — the outcome of running a candidate against a replay case
- FailureSignature — a deterministic pattern describing what went wrong
ReplayKit is not a tracing platform. It is not an eval dashboard. It promotes painful failures into replayable tests.
pip install -e ".[dev]"
# Scaffold a case from a failed trace
replaykit scaffold examples/traces/search-loop.json --out cases/search-loop
# Lint the case
replaykit lint cases/search-loop/case.yaml
# Replay against the bad runner (reproduces the failure)
replaykit replay cases/search-loop/case.yaml \
--runner-cmd "python examples/runners/reproduce_failure.py"
# Replay against the fixed runner (failure no longer reproduces)
replaykit replay cases/search-loop/case.yaml \
--runner-cmd "python examples/runners/fixed_runner.py"{
"run_id": "run-billing-2024-0312",
"status": "error",
"events": [
{"type": "input_message", "content": "Why was I charged $49.99?"},
{"type": "tool_call", "tool_name": "lookup_billing", "tool_args": {"user_id": "usr_8821"}},
{"type": "tool_result", "output": "Error: billing service timeout"},
...5 retries, same timeout...
{"type": "error", "error": "Max tool call limit exceeded. Agent terminated."}
]
}replaykit scaffold examples/traces/search-loop.json --out cases/search-loopThis creates:
cases/search-loop/
case.yaml # assertions, signatures, scenario
source-trace.json # original trace
fixtures/ # frozen tool outputs
notes.md # timeline summary
# Does the failure still happen?
replaykit replay cases/search-loop/case.yaml \
--runner-cmd "python examples/runners/reproduce_failure.py"Output:
┌─────────────────────────── ReplayKit Result ───────────────────────────┐
│ Run ended with error; Tool 'lookup_billing' called 5 times; No final │
│ output │
│ Result: FAILURE REPRODUCED │
└────────────────────────────────────────────────────────────────────────┘
Observed failure reproduced: Yes
Desired assertions: 0/3 passed
Semantic checks needing review: 1
# Try the fixed version
replaykit replay cases/search-loop/case.yaml \
--runner-cmd "python examples/runners/fixed_runner.py" Observed failure reproduced: No
Desired assertions: 3/3 passed
Semantic checks needing review: 1
Observed failure signatures describe what went wrong:
status_is: errormax_tool_calls: 5(tool called 5 times — loop detected)final_output_exists: false
Desired behavior assertions describe what should happen:
status_is: successmax_tool_calls: 3final_output_exists: true
Both use the same assertion engine. Observed signatures tell you if the original failure still reproduces. Desired assertions tell you if the fix works.
Semantic checks (e.g., "response addresses the user's question correctly") are preserved but marked needs_human — no LLM grading in v0.1.
Runners are subprocess commands. ReplayKit sets environment variables:
| Variable | Description |
|---|---|
REPLAYKIT_CASE_PATH |
Path to case.yaml |
REPLAYKIT_OUTPUT_PATH |
Where to write result JSON |
REPLAYKIT_FIXTURES_DIR |
Path to fixtures directory |
REPLAYKIT_SOURCE_TRACE_PATH |
Path to source trace |
The runner writes a JSON file to REPLAYKIT_OUTPUT_PATH:
{
"run_id": "run-001",
"status": "success",
"final_output": "Here is the answer.",
"tool_calls": [{"tool_name": "lookup", "tool_args": {}, "tool_output": "..."}],
"errors": []
}cases/ # committed — your regression suite
search-loop/
case.yaml
source-trace.json
fixtures/
notes.md
.replaykit/runs/ # generated — replay evidence
20240312-143200-search-loop/
result.json
summary.md
runner-output.json
stdout.log
stderr.log
- Vendor-specific trace importers (LangSmith, OpenAI, Anthropic)
- Real LLM-based semantic grading
- Dashboard or hosted service
- GitHub PR integration
- Distributed execution
- Plugin system
These have clean extension points in the code. v0.1 proves the core loop.
- Trace adapters — import from LangSmith, OpenAI, Anthropic traces
- Semantic graders — plug in LLM judges for semantic checks
- Export — emit cases as Promptfoo, OpenAI Evals, or pytest fixtures
- CI integration — run replay suites in GitHub Actions
- Redaction — sanitize PII from traces before committing cases
pip install -e ".[dev]"
pytestMIT