Skip to content

JohnODowdAI/replaykit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReplayKit

Turn failed agent traces into replayable regression cases.

ReplayKit is a local-first CLI tool that takes a failed agent trace and converts it into a versioned regression case you can replay, assert against, and commit to your repo.

Why this exists

Teams debug failed agent runs, inspect traces, write postmortems — and then repeat the same failure a week later because nobody converts the investigation into a reusable test.

ReplayKit closes that gap: debug once, replay forever.

Mental model

Failed trace → Replay case → Run against candidate → Did the failure reproduce?

ReplayKit has four first-class concepts:

  • TraceRun — a normalized record of what happened in the original agent run
  • ReplayCase — a versioned artifact with scenario inputs, failure signatures, and assertions
  • ReplayResult — the outcome of running a candidate against a replay case
  • FailureSignature — a deterministic pattern describing what went wrong

ReplayKit is not a tracing platform. It is not an eval dashboard. It promotes painful failures into replayable tests.

Quick start

pip install -e ".[dev]"

# Scaffold a case from a failed trace
replaykit scaffold examples/traces/search-loop.json --out cases/search-loop

# Lint the case
replaykit lint cases/search-loop/case.yaml

# Replay against the bad runner (reproduces the failure)
replaykit replay cases/search-loop/case.yaml \
  --runner-cmd "python examples/runners/reproduce_failure.py"

# Replay against the fixed runner (failure no longer reproduces)
replaykit replay cases/search-loop/case.yaml \
  --runner-cmd "python examples/runners/fixed_runner.py"

Example workflow

1. You have a failed trace

{
  "run_id": "run-billing-2024-0312",
  "status": "error",
  "events": [
    {"type": "input_message", "content": "Why was I charged $49.99?"},
    {"type": "tool_call", "tool_name": "lookup_billing", "tool_args": {"user_id": "usr_8821"}},
    {"type": "tool_result", "output": "Error: billing service timeout"},
    ...5 retries, same timeout...
    {"type": "error", "error": "Max tool call limit exceeded. Agent terminated."}
  ]
}

2. Scaffold a replay case

replaykit scaffold examples/traces/search-loop.json --out cases/search-loop

This creates:

cases/search-loop/
  case.yaml          # assertions, signatures, scenario
  source-trace.json  # original trace
  fixtures/          # frozen tool outputs
  notes.md           # timeline summary

3. Replay and compare

# Does the failure still happen?
replaykit replay cases/search-loop/case.yaml \
  --runner-cmd "python examples/runners/reproduce_failure.py"

Output:

┌─────────────────────────── ReplayKit Result ───────────────────────────┐
│ Run ended with error; Tool 'lookup_billing' called 5 times; No final  │
│ output                                                                │
│ Result: FAILURE REPRODUCED                                            │
└────────────────────────────────────────────────────────────────────────┘

  Observed failure reproduced: Yes
  Desired assertions: 0/3 passed
  Semantic checks needing review: 1
# Try the fixed version
replaykit replay cases/search-loop/case.yaml \
  --runner-cmd "python examples/runners/fixed_runner.py"
  Observed failure reproduced: No
  Desired assertions: 3/3 passed
  Semantic checks needing review: 1

How ReplayKit models failure vs. desired behavior

Observed failure signatures describe what went wrong:

  • status_is: error
  • max_tool_calls: 5 (tool called 5 times — loop detected)
  • final_output_exists: false

Desired behavior assertions describe what should happen:

  • status_is: success
  • max_tool_calls: 3
  • final_output_exists: true

Both use the same assertion engine. Observed signatures tell you if the original failure still reproduces. Desired assertions tell you if the fix works.

Semantic checks (e.g., "response addresses the user's question correctly") are preserved but marked needs_human — no LLM grading in v0.1.

Runner contract

Runners are subprocess commands. ReplayKit sets environment variables:

Variable Description
REPLAYKIT_CASE_PATH Path to case.yaml
REPLAYKIT_OUTPUT_PATH Where to write result JSON
REPLAYKIT_FIXTURES_DIR Path to fixtures directory
REPLAYKIT_SOURCE_TRACE_PATH Path to source trace

The runner writes a JSON file to REPLAYKIT_OUTPUT_PATH:

{
  "run_id": "run-001",
  "status": "success",
  "final_output": "Here is the answer.",
  "tool_calls": [{"tool_name": "lookup", "tool_args": {}, "tool_output": "..."}],
  "errors": []
}

Case directory structure

cases/              # committed — your regression suite
  search-loop/
    case.yaml
    source-trace.json
    fixtures/
    notes.md

.replaykit/runs/    # generated — replay evidence
  20240312-143200-search-loop/
    result.json
    summary.md
    runner-output.json
    stdout.log
    stderr.log

What is intentionally not built yet

  • Vendor-specific trace importers (LangSmith, OpenAI, Anthropic)
  • Real LLM-based semantic grading
  • Dashboard or hosted service
  • GitHub PR integration
  • Distributed execution
  • Plugin system

These have clean extension points in the code. v0.1 proves the core loop.

Roadmap

  • Trace adapters — import from LangSmith, OpenAI, Anthropic traces
  • Semantic graders — plug in LLM judges for semantic checks
  • Export — emit cases as Promptfoo, OpenAI Evals, or pytest fixtures
  • CI integration — run replay suites in GitHub Actions
  • Redaction — sanitize PII from traces before committing cases

Development

pip install -e ".[dev]"
pytest

License

MIT

About

Turn failed agent traces into replayable regression cases.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages