Add deterministic replay gate for agent tests#500
Add deterministic replay gate for agent tests#500ZackMitchell910 wants to merge 1 commit intotruffle-ai:mainfrom
Conversation
|
@ZackMitchell910 is attempting to deploy a commit to the Shaunak's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThe changes introduce a complete RunLedger CI gate infrastructure. A GitHub Actions workflow executes deterministic evaluations in replay mode against a baseline snapshot, detecting regressions in tool schema, behavior, and budget constraints. Supporting files include an agent implementation, evaluation suite configuration, test cases, cassettes, and baseline data. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
.github/workflows/runledger.yml (1)
16-16: Consider pinning the RunLedger version.Installing the latest RunLedger without version pinning may introduce non-deterministic CI behavior if breaking changes are released.
🔎 Optional enhancement
- python -m pip install runledger + python -m pip install runledger==0.1.0Replace
0.1.0with the current stable version.evals/runledger/agent/agent.py (1)
13-13: Consider adding error handling for malformed JSON.While the replay mode should provide valid JSON, adding try-except around
json.loads()would make the agent more robust for debugging scenarios.🔎 Optional enhancement
def main(): for line in sys.stdin: line = line.strip() if not line: continue - msg = json.loads(line) + try: + msg = json.loads(line) + except json.JSONDecodeError as e: + sys.stderr.write(f"Invalid JSON: {e}\n") + continue if msg.get("type") == "task_start":
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
.github/workflows/runledger.yml(1 hunks).gitignore(1 hunks)README.md(1 hunks)baselines/runledger-demo.json(1 hunks)evals/runledger/agent/agent.py(1 hunks)evals/runledger/cases/t1.yaml(1 hunks)evals/runledger/cassettes/t1.jsonl(1 hunks)evals/runledger/schema.json(1 hunks)evals/runledger/suite.yaml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
Document agent architecture and design patterns in markdown files (e.g., AGENTS.md)
Files:
README.md
🧠 Learnings (2)
📚 Learning: 2025-12-11T14:06:07.500Z
Learnt from: CR
Repo: truffle-ai/dexto PR: 0
File: packages/cli/src/cli/ink-cli/GEMINI.md:0-0
Timestamp: 2025-12-11T14:06:07.500Z
Learning: Refer to AGENTS.md for detailed agent documentation and guidelines
Applied to files:
evals/runledger/agent/agent.py
📚 Learning: 2025-11-24T19:37:07.870Z
Learnt from: CR
Repo: truffle-ai/dexto PR: 0
File: GEMINI.md:0-0
Timestamp: 2025-11-24T19:37:07.870Z
Learning: Applies to AGENTS.md : Review AGENTS.md for agent-related development guidelines
Applied to files:
evals/runledger/agent/agent.py
🔇 Additional comments (9)
.gitignore (1)
83-83: LGTM!Correctly ignores the
runledger_out/directory to prevent committing generated evaluation artifacts.evals/runledger/schema.json (1)
1-15: LGTM!The JSON schema is well-defined and aligns with the agent output structure. It enforces the required
categoryandreplyfields for validation during replay.README.md (1)
642-651: LGTM!Clear and concise documentation for the RunLedger CI gate. The command example is correct and the explanation effectively communicates the purpose of deterministic replay-based regression detection.
evals/runledger/cases/t1.yaml (1)
1-10: LGTM!The test case configuration is well-structured. The input, cassette reference, and assertions are properly aligned with the schema and cassette data.
evals/runledger/cassettes/t1.jsonl (1)
1-1: LGTM!The cassette data is correctly formatted for replay mode and contains appropriate mock search results that align with the test case input.
.github/workflows/runledger.yml (1)
1-24: LGTM!The workflow is well-structured for a CI gate. The use of
if: always()for artifact uploads ensures debugging information is available even when evaluations fail.evals/runledger/suite.yaml (1)
1-18: LGTM!The evaluation suite configuration is comprehensive and well-structured. The budgets are appropriate for a replay-mode demo, and the strict pass rate (100%) correctly enforces regression detection.
evals/runledger/agent/agent.py (1)
1-22: LGTM!The stub agent correctly implements the line-oriented JSON protocol for replay mode. The
send()function appropriately flushes output, and the message handling logic aligns with the cassette and schema requirements.baselines/runledger-demo.json (1)
1-112: Baseline structure is correct and internally consistent.The JSON is valid and all aggregate statistics correctly roll up from the single test case: cases_total (1), pass_rate (1.0), and tool_calls aggregates all match their constituent values. The cassette SHA256 is present for integrity verification, and the policy threshold (min_pass_rate: 1.0) is satisfied.
However, please verify the following points:
Null metrics in replay mode (Lines 8–34): The
cost_usd,steps,tokens_in, andtokens_outmetrics are all null whiletool_calls,tool_errors, andwall_mscontain actual values. Confirm this is intentional for replay-mode baselines where actual model/tool execution doesn't occur.git_sha null in CI context (Line 96): The baseline has
git_sha: null, which may complicate reproducibility tracing. Verify whether the CI workflow (.github/workflows/runledger.yml) populates this field before storing the baseline.suite_config_hash null (Line 109): Confirm whether this field should be computed from the suite config or left null by design.
|
what is this? |
|
This adds an optional CI check that records one known‑good run and then replays it in CI, so it’s deterministic and doesn’t call external APIs. It helps catch regressions like tool order/args changing, output shape drifting, or unexpected tool errors before merge. For Dexto, you could wire it to a small agent flow (e.g., a simple task with one tool call) and get a quick gate that protects that behavior. It’s fully removable if you don’t want it. |
Summary
runledger/Runledger@v0.1runledger_out/How to run locally
Notes