Skip to content

Add deterministic replay gate for agent tests#500

Open
ZackMitchell910 wants to merge 1 commit intotruffle-ai:mainfrom
ZackMitchell910:runledger/replay-gate
Open

Add deterministic replay gate for agent tests#500
ZackMitchell910 wants to merge 1 commit intotruffle-ai:mainfrom
ZackMitchell910:runledger/replay-gate

Conversation

@ZackMitchell910
Copy link
Copy Markdown

@ZackMitchell910 ZackMitchell910 commented Dec 19, 2025

Summary

  • add a replay-only RunLedger eval suite (suite/case/schema/cassette + stub agent)
  • add a baseline file for regression gating
  • add a GitHub Actions workflow using runledger/Runledger@v0.1
  • add a small README note + ignore runledger_out/

How to run locally

runledger run evals/runledger --mode replay --baseline baselines/runledger-demo.json

Notes

  • no external calls; replay-only cassette
  • feel free to remove the suite/workflow if it is not desired
  • GitHub Actions note: workflows from first-time contributors/forks may require a maintainer to click “Approve and run” before checks will execute.

@vercel
Copy link
Copy Markdown

vercel bot commented Dec 19, 2025

@ZackMitchell910 is attempting to deploy a commit to the Shaunak's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

The changes introduce a complete RunLedger CI gate infrastructure. A GitHub Actions workflow executes deterministic evaluations in replay mode against a baseline snapshot, detecting regressions in tool schema, behavior, and budget constraints. Supporting files include an agent implementation, evaluation suite configuration, test cases, cassettes, and baseline data.

Changes

Cohort / File(s) Change Summary
CI Infrastructure
.github/workflows/runledger.yml, .gitignore
Added GitHub Actions workflow (runledger-gate) to run deterministic evals in replay mode against baseline; added runledger_out/ to gitignore.
Documentation
README.md
Added "RunLedger CI gate" section describing the deterministic CI gate for tool-using agents and usage examples.
Baseline Snapshot
baselines/runledger-demo.json
Added complete RunLedger run ledger snapshot containing aggregates, case results, metrics, policy configuration, and run metadata for baseline comparison.
Agent Implementation
evals/runledger/agent/agent.py
Added simple line-oriented JSON agent that processes task_start and tool_result messages, calls search_docs tool, and emits final output.
Evaluation Suite & Schema
evals/runledger/suite.yaml, evals/runledger/schema.json
Added suite configuration defining agent workflow, tool registration (search_docs), budget constraints, and 100% pass rate policy; added JSON schema requiring category and reply fields.
Test Case & Cassette
evals/runledger/cases/t1.yaml, evals/runledger/cassettes/t1.jsonl
Added evaluation case (t1) for login ticket triage and corresponding cassette with mocked search_docs tool response.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • Review agent.py for correct message handling and tool invocation logic
  • Verify suite.yaml correctly references all configuration files and paths
  • Validate baseline snapshot structure and integrity
  • Confirm schema.json matches assertions in test cases

Poem

🐰 A gate deterministic, so grand and so true,
Replaying each call on the things agents do,
With budgets and schemas, we catch every slip,
No regressions shall sneak through this ledger-tracked trip! 🏃‍♂️✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a deterministic replay gate for agent tests via RunLedger integration.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
.github/workflows/runledger.yml (1)

16-16: Consider pinning the RunLedger version.

Installing the latest RunLedger without version pinning may introduce non-deterministic CI behavior if breaking changes are released.

🔎 Optional enhancement
-          python -m pip install runledger
+          python -m pip install runledger==0.1.0

Replace 0.1.0 with the current stable version.

evals/runledger/agent/agent.py (1)

13-13: Consider adding error handling for malformed JSON.

While the replay mode should provide valid JSON, adding try-except around json.loads() would make the agent more robust for debugging scenarios.

🔎 Optional enhancement
 def main():
     for line in sys.stdin:
         line = line.strip()
         if not line:
             continue
-        msg = json.loads(line)
+        try:
+            msg = json.loads(line)
+        except json.JSONDecodeError as e:
+            sys.stderr.write(f"Invalid JSON: {e}\n")
+            continue
         if msg.get("type") == "task_start":
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63fe0a6 and ac6d4a4.

📒 Files selected for processing (9)
  • .github/workflows/runledger.yml (1 hunks)
  • .gitignore (1 hunks)
  • README.md (1 hunks)
  • baselines/runledger-demo.json (1 hunks)
  • evals/runledger/agent/agent.py (1 hunks)
  • evals/runledger/cases/t1.yaml (1 hunks)
  • evals/runledger/cassettes/t1.jsonl (1 hunks)
  • evals/runledger/schema.json (1 hunks)
  • evals/runledger/suite.yaml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Document agent architecture and design patterns in markdown files (e.g., AGENTS.md)

Files:

  • README.md
🧠 Learnings (2)
📚 Learning: 2025-12-11T14:06:07.500Z
Learnt from: CR
Repo: truffle-ai/dexto PR: 0
File: packages/cli/src/cli/ink-cli/GEMINI.md:0-0
Timestamp: 2025-12-11T14:06:07.500Z
Learning: Refer to AGENTS.md for detailed agent documentation and guidelines

Applied to files:

  • evals/runledger/agent/agent.py
📚 Learning: 2025-11-24T19:37:07.870Z
Learnt from: CR
Repo: truffle-ai/dexto PR: 0
File: GEMINI.md:0-0
Timestamp: 2025-11-24T19:37:07.870Z
Learning: Applies to AGENTS.md : Review AGENTS.md for agent-related development guidelines

Applied to files:

  • evals/runledger/agent/agent.py
🔇 Additional comments (9)
.gitignore (1)

83-83: LGTM!

Correctly ignores the runledger_out/ directory to prevent committing generated evaluation artifacts.

evals/runledger/schema.json (1)

1-15: LGTM!

The JSON schema is well-defined and aligns with the agent output structure. It enforces the required category and reply fields for validation during replay.

README.md (1)

642-651: LGTM!

Clear and concise documentation for the RunLedger CI gate. The command example is correct and the explanation effectively communicates the purpose of deterministic replay-based regression detection.

evals/runledger/cases/t1.yaml (1)

1-10: LGTM!

The test case configuration is well-structured. The input, cassette reference, and assertions are properly aligned with the schema and cassette data.

evals/runledger/cassettes/t1.jsonl (1)

1-1: LGTM!

The cassette data is correctly formatted for replay mode and contains appropriate mock search results that align with the test case input.

.github/workflows/runledger.yml (1)

1-24: LGTM!

The workflow is well-structured for a CI gate. The use of if: always() for artifact uploads ensures debugging information is available even when evaluations fail.

evals/runledger/suite.yaml (1)

1-18: LGTM!

The evaluation suite configuration is comprehensive and well-structured. The budgets are appropriate for a replay-mode demo, and the strict pass rate (100%) correctly enforces regression detection.

evals/runledger/agent/agent.py (1)

1-22: LGTM!

The stub agent correctly implements the line-oriented JSON protocol for replay mode. The send() function appropriately flushes output, and the message handling logic aligns with the cassette and schema requirements.

baselines/runledger-demo.json (1)

1-112: Baseline structure is correct and internally consistent.

The JSON is valid and all aggregate statistics correctly roll up from the single test case: cases_total (1), pass_rate (1.0), and tool_calls aggregates all match their constituent values. The cassette SHA256 is present for integrity verification, and the policy threshold (min_pass_rate: 1.0) is satisfied.

However, please verify the following points:

  1. Null metrics in replay mode (Lines 8–34): The cost_usd, steps, tokens_in, and tokens_out metrics are all null while tool_calls, tool_errors, and wall_ms contain actual values. Confirm this is intentional for replay-mode baselines where actual model/tool execution doesn't occur.

  2. git_sha null in CI context (Line 96): The baseline has git_sha: null, which may complicate reproducibility tracing. Verify whether the CI workflow (.github/workflows/runledger.yml) populates this field before storing the baseline.

  3. suite_config_hash null (Line 109): Confirm whether this field should be computed from the suite config or left null by design.

@rahulkarajgikar
Copy link
Copy Markdown
Collaborator

what is this?

@ZackMitchell910
Copy link
Copy Markdown
Author

This adds an optional CI check that records one known‑good run and then replays it in CI, so it’s deterministic and doesn’t call external APIs. It helps catch regressions like tool order/args changing, output shape drifting, or unexpected tool errors before merge. For Dexto, you could wire it to a small agent flow (e.g., a simple task with one tool call) and get a quick gate that protects that behavior. It’s fully removable if you don’t want it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants