Add deterministic replay gate for agent tests by ZackMitchell910 · Pull Request #500 · truffle-ai/dexto

ZackMitchell910 · 2025-12-19T11:49:14Z

Summary

add a replay-only RunLedger eval suite (suite/case/schema/cassette + stub agent)
add a baseline file for regression gating
add a GitHub Actions workflow using runledger/Runledger@v0.1
add a small README note + ignore runledger_out/

How to run locally

runledger run evals/runledger --mode replay --baseline baselines/runledger-demo.json

Notes

no external calls; replay-only cassette
feel free to remove the suite/workflow if it is not desired
GitHub Actions note: workflows from first-time contributors/forks may require a maintainer to click “Approve and run” before checks will execute.

vercel · 2025-12-19T11:49:18Z

@ZackMitchell910 is attempting to deploy a commit to the Shaunak's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2025-12-19T11:49:34Z

📝 Walkthrough

Walkthrough

The changes introduce a complete RunLedger CI gate infrastructure. A GitHub Actions workflow executes deterministic evaluations in replay mode against a baseline snapshot, detecting regressions in tool schema, behavior, and budget constraints. Supporting files include an agent implementation, evaluation suite configuration, test cases, cassettes, and baseline data.

Changes

Cohort / File(s)	Change Summary
CI Infrastructure `.github/workflows/runledger.yml`, `.gitignore`	Added GitHub Actions workflow (runledger-gate) to run deterministic evals in replay mode against baseline; added runledger_out/ to gitignore.
Documentation `README.md`	Added "RunLedger CI gate" section describing the deterministic CI gate for tool-using agents and usage examples.
Baseline Snapshot `baselines/runledger-demo.json`	Added complete RunLedger run ledger snapshot containing aggregates, case results, metrics, policy configuration, and run metadata for baseline comparison.
Agent Implementation `evals/runledger/agent/agent.py`	Added simple line-oriented JSON agent that processes task_start and tool_result messages, calls search_docs tool, and emits final output.
Evaluation Suite & Schema `evals/runledger/suite.yaml`, `evals/runledger/schema.json`	Added suite configuration defining agent workflow, tool registration (search_docs), budget constraints, and 100% pass rate policy; added JSON schema requiring category and reply fields.
Test Case & Cassette `evals/runledger/cases/t1.yaml`, `evals/runledger/cassettes/t1.jsonl`	Added evaluation case (t1) for login ticket triage and corresponding cassette with mocked search_docs tool response.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Review agent.py for correct message handling and tool invocation logic
Verify suite.yaml correctly references all configuration files and paths
Validate baseline snapshot structure and integrity
Confirm schema.json matches assertions in test cases

Poem

🐰 A gate deterministic, so grand and so true,
Replaying each call on the things agents do,
With budgets and schemas, we catch every slip,
No regressions shall sneak through this ledger-tracked trip! 🏃‍♂️✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a deterministic replay gate for agent tests via RunLedger integration.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

.github/workflows/runledger.yml (1)
16-16: Consider pinning the RunLedger version.

Installing the latest RunLedger without version pinning may introduce non-deterministic CI behavior if breaking changes are released.
🔎 Optional enhancement
-          python -m pip install runledger
+          python -m pip install runledger==0.1.0
Replace 0.1.0 with the current stable version.
evals/runledger/agent/agent.py (1)
13-13: Consider adding error handling for malformed JSON.

While the replay mode should provide valid JSON, adding try-except around json.loads() would make the agent more robust for debugging scenarios.
🔎 Optional enhancement
 def main():
     for line in sys.stdin:
         line = line.strip()
         if not line:
             continue
-        msg = json.loads(line)
+        try:
+            msg = json.loads(line)
+        except json.JSONDecodeError as e:
+            sys.stderr.write(f"Invalid JSON: {e}\n")
+            continue
         if msg.get("type") == "task_start":

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63fe0a6 and ac6d4a4.

📒 Files selected for processing (9)

.github/workflows/runledger.yml (1 hunks)
.gitignore (1 hunks)
README.md (1 hunks)
baselines/runledger-demo.json (1 hunks)
evals/runledger/agent/agent.py (1 hunks)
evals/runledger/cases/t1.yaml (1 hunks)
evals/runledger/cassettes/t1.jsonl (1 hunks)
evals/runledger/schema.json (1 hunks)
evals/runledger/suite.yaml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Document agent architecture and design patterns in markdown files (e.g., AGENTS.md)

Files:

README.md

🧠 Learnings (2)

📚 Learning: 2025-12-11T14:06:07.500Z

Learnt from: CR
Repo: truffle-ai/dexto PR: 0
File: packages/cli/src/cli/ink-cli/GEMINI.md:0-0
Timestamp: 2025-12-11T14:06:07.500Z
Learning: Refer to AGENTS.md for detailed agent documentation and guidelines

Applied to files:

evals/runledger/agent/agent.py

📚 Learning: 2025-11-24T19:37:07.870Z

Learnt from: CR
Repo: truffle-ai/dexto PR: 0
File: GEMINI.md:0-0
Timestamp: 2025-11-24T19:37:07.870Z
Learning: Applies to AGENTS.md : Review AGENTS.md for agent-related development guidelines

Applied to files:

evals/runledger/agent/agent.py

🔇 Additional comments (9)

.gitignore (1)

83-83: LGTM!

Correctly ignores the runledger_out/ directory to prevent committing generated evaluation artifacts.

evals/runledger/schema.json (1)

1-15: LGTM!

The JSON schema is well-defined and aligns with the agent output structure. It enforces the required category and reply fields for validation during replay.

README.md (1)

642-651: LGTM!

Clear and concise documentation for the RunLedger CI gate. The command example is correct and the explanation effectively communicates the purpose of deterministic replay-based regression detection.

evals/runledger/cases/t1.yaml (1)

1-10: LGTM!

The test case configuration is well-structured. The input, cassette reference, and assertions are properly aligned with the schema and cassette data.

evals/runledger/cassettes/t1.jsonl (1)

1-1: LGTM!

The cassette data is correctly formatted for replay mode and contains appropriate mock search results that align with the test case input.

.github/workflows/runledger.yml (1)

1-24: LGTM!

The workflow is well-structured for a CI gate. The use of if: always() for artifact uploads ensures debugging information is available even when evaluations fail.

evals/runledger/suite.yaml (1)

1-18: LGTM!

The evaluation suite configuration is comprehensive and well-structured. The budgets are appropriate for a replay-mode demo, and the strict pass rate (100%) correctly enforces regression detection.

evals/runledger/agent/agent.py (1)

1-22: LGTM!

The stub agent correctly implements the line-oriented JSON protocol for replay mode. The send() function appropriately flushes output, and the message handling logic aligns with the cassette and schema requirements.

baselines/runledger-demo.json (1)

1-112: Baseline structure is correct and internally consistent.

The JSON is valid and all aggregate statistics correctly roll up from the single test case: cases_total (1), pass_rate (1.0), and tool_calls aggregates all match their constituent values. The cassette SHA256 is present for integrity verification, and the policy threshold (min_pass_rate: 1.0) is satisfied.

However, please verify the following points:

Null metrics in replay mode (Lines 8–34): The cost_usd, steps, tokens_in, and tokens_out metrics are all null while tool_calls, tool_errors, and wall_ms contain actual values. Confirm this is intentional for replay-mode baselines where actual model/tool execution doesn't occur.

git_sha null in CI context (Line 96): The baseline has git_sha: null, which may complicate reproducibility tracing. Verify whether the CI workflow (.github/workflows/runledger.yml) populates this field before storing the baseline.

suite_config_hash null (Line 109): Confirm whether this field should be computed from the suite config or left null by design.

rahulkarajgikar · 2026-01-12T06:50:23Z

what is this?

ZackMitchell910 · 2026-01-13T03:05:21Z

This adds an optional CI check that records one known‑good run and then replays it in CI, so it’s deterministic and doesn’t call external APIs. It helps catch regressions like tool order/args changing, output shape drifting, or unexpected tool errors before merge. For Dexto, you could wire it to a small agent flow (e.g., a simple task with one tool call) and get a quick gate that protects that behavior. It’s fully removable if you don’t want it.

Add RunLedger replay gate

ac6d4a4

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deterministic replay gate for agent tests#500

Add deterministic replay gate for agent tests#500
ZackMitchell910 wants to merge 1 commit intotruffle-ai:mainfrom
ZackMitchell910:runledger/replay-gate

ZackMitchell910 commented Dec 19, 2025 •

edited

Loading

Uh oh!

vercel bot commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

rahulkarajgikar commented Jan 12, 2026

Uh oh!

ZackMitchell910 commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZackMitchell910 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to run locally

Notes

Uh oh!

vercel bot commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rahulkarajgikar commented Jan 12, 2026

Uh oh!

ZackMitchell910 commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZackMitchell910 commented Dec 19, 2025 •

edited

Loading

coderabbitai bot commented Dec 19, 2025 •

edited

Loading