ReplayKit

Turn failed agent traces into replayable regression cases.

ReplayKit is a local-first CLI tool that takes a failed agent trace and converts it into a versioned regression case you can replay, assert against, and commit to your repo.

Why this exists

Teams debug failed agent runs, inspect traces, write postmortems — and then repeat the same failure a week later because nobody converts the investigation into a reusable test.

ReplayKit closes that gap: debug once, replay forever.

Mental model

Failed trace → Replay case → Run against candidate → Did the failure reproduce?

ReplayKit has four first-class concepts:

TraceRun — a normalized record of what happened in the original agent run
ReplayCase — a versioned artifact with scenario inputs, failure signatures, and assertions
ReplayResult — the outcome of running a candidate against a replay case
FailureSignature — a deterministic pattern describing what went wrong

ReplayKit is not a tracing platform. It is not an eval dashboard. It promotes painful failures into replayable tests.

Quick start

pip install -e ".[dev]"

# Scaffold a case from a failed trace
replaykit scaffold examples/traces/search-loop.json --out cases/search-loop

# Lint the case
replaykit lint cases/search-loop/case.yaml

# Replay against the bad runner (reproduces the failure)
replaykit replay cases/search-loop/case.yaml \
  --runner-cmd "python examples/runners/reproduce_failure.py"

# Replay against the fixed runner (failure no longer reproduces)
replaykit replay cases/search-loop/case.yaml \
  --runner-cmd "python examples/runners/fixed_runner.py"

Example workflow

1. You have a failed trace

{
  "run_id": "run-billing-2024-0312",
  "status": "error",
  "events": [
    {"type": "input_message", "content": "Why was I charged $49.99?"},
    {"type": "tool_call", "tool_name": "lookup_billing", "tool_args": {"user_id": "usr_8821"}},
    {"type": "tool_result", "output": "Error: billing service timeout"},
    ...5 retries, same timeout...
    {"type": "error", "error": "Max tool call limit exceeded. Agent terminated."}
  ]
}

2. Scaffold a replay case

replaykit scaffold examples/traces/search-loop.json --out cases/search-loop

This creates:

cases/search-loop/
  case.yaml          # assertions, signatures, scenario
  source-trace.json  # original trace
  fixtures/          # frozen tool outputs
  notes.md           # timeline summary

3. Replay and compare

# Does the failure still happen?
replaykit replay cases/search-loop/case.yaml \
  --runner-cmd "python examples/runners/reproduce_failure.py"

Output:

┌─────────────────────────── ReplayKit Result ───────────────────────────┐
│ Run ended with error; Tool 'lookup_billing' called 5 times; No final  │
│ output                                                                │
│ Result: FAILURE REPRODUCED                                            │
└────────────────────────────────────────────────────────────────────────┘

  Observed failure reproduced: Yes
  Desired assertions: 0/3 passed
  Semantic checks needing review: 1

# Try the fixed version
replaykit replay cases/search-loop/case.yaml \
  --runner-cmd "python examples/runners/fixed_runner.py"

  Observed failure reproduced: No
  Desired assertions: 3/3 passed
  Semantic checks needing review: 1

How ReplayKit models failure vs. desired behavior

Observed failure signatures describe what went wrong:

status_is: error
max_tool_calls: 5 (tool called 5 times — loop detected)
final_output_exists: false

Desired behavior assertions describe what should happen:

status_is: success
max_tool_calls: 3
final_output_exists: true

Both use the same assertion engine. Observed signatures tell you if the original failure still reproduces. Desired assertions tell you if the fix works.

Semantic checks (e.g., "response addresses the user's question correctly") are preserved but marked needs_human — no LLM grading in v0.1.

Runner contract

Runners are subprocess commands. ReplayKit sets environment variables:

Variable	Description
`REPLAYKIT_CASE_PATH`	Path to case.yaml
`REPLAYKIT_OUTPUT_PATH`	Where to write result JSON
`REPLAYKIT_FIXTURES_DIR`	Path to fixtures directory
`REPLAYKIT_SOURCE_TRACE_PATH`	Path to source trace

The runner writes a JSON file to REPLAYKIT_OUTPUT_PATH:

{
  "run_id": "run-001",
  "status": "success",
  "final_output": "Here is the answer.",
  "tool_calls": [{"tool_name": "lookup", "tool_args": {}, "tool_output": "..."}],
  "errors": []
}

Case directory structure

cases/              # committed — your regression suite
  search-loop/
    case.yaml
    source-trace.json
    fixtures/
    notes.md

.replaykit/runs/    # generated — replay evidence
  20240312-143200-search-loop/
    result.json
    summary.md
    runner-output.json
    stdout.log
    stderr.log

What is intentionally not built yet

Vendor-specific trace importers (LangSmith, OpenAI, Anthropic)
Real LLM-based semantic grading
Dashboard or hosted service
GitHub PR integration
Distributed execution
Plugin system

These have clean extension points in the code. v0.1 proves the core loop.

Roadmap

Trace adapters — import from LangSmith, OpenAI, Anthropic traces
Semantic graders — plug in LLM judges for semantic checks
Export — emit cases as Promptfoo, OpenAI Evals, or pytest fixtures
CI integration — run replay suites in GitHub Actions
Redaction — sanitize PII from traces before committing cases

Development

pip install -e ".[dev]"
pytest

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
cases/search-loop		cases/search-loop
examples		examples
src/replaykit		src/replaykit
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReplayKit

Why this exists

Mental model

Quick start

Example workflow

1. You have a failed trace

2. Scaffold a replay case

3. Replay and compare

How ReplayKit models failure vs. desired behavior

Runner contract

Case directory structure

What is intentionally not built yet

Roadmap

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReplayKit

Why this exists

Mental model

Quick start

Example workflow

1. You have a failed trace

2. Scaffold a replay case

3. Replay and compare

How ReplayKit models failure vs. desired behavior

Runner contract

Case directory structure

What is intentionally not built yet

Roadmap

Development

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages