Add ability to evaluate the tool results #151

bparees · 2026-01-30T22:07:37Z

Description

Adds support for checking the tool call results, in addition to the tool call names + arguments.

This is needed for aladdin evaluation in particular because some of the mcp tools are non-deterministic in nature.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Generated by: Cursor

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Release Notes

New Features
- Evaluation system now supports validating tool call results alongside arguments, using exact matching and regex pattern verification.
- Added optional setup and cleanup script support in evaluation configurations.
Documentation
- Enhanced evaluation guide with new "Result Validation" section, including practical YAML examples and implementation patterns.
Tests
- Comprehensive test coverage for result validation, including regex matching, exact matches, and error handling scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-30T22:07:53Z

Walkthrough

This PR adds optional result field validation support to the evaluation framework. Changes include configuration examples, parsing and client API updates to handle results, metrics evaluation logic for result comparison with regex patterns, and comprehensive tests validating the new functionality across the system.

Changes

Cohort / File(s)	Summary
Configuration & Documentation `config/evaluation_data.yaml`, `docs/EVALUATION_GUIDE.md`	Added new conversation group demonstrating tool call result validation with exact and regex-based matching; expanded YAML examples and metrics descriptions to document optional result fields and validation behavior.
API Layer `src/lightspeed_evaluation/core/api/streaming_parser.py`, `src/lightspeed_evaluation/core/api/client.py`	Extended parsing to extract optional result fields from tool calls; formatted_tool dictionary now includes result when present; updated docstrings to document new optional result field.
Metrics Evaluation `src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`	Introduced _compare_tool_result helper to validate optional result fields using regex-based matching; updated _compare_single_tool_call to invoke result comparison alongside argument validation; added debug logging for mismatches.
Validation Rules `src/lightspeed_evaluation/core/system/validator.py`	Updated custom:tool_eval metric field requirements to document optional result field in validation schema.
Test Coverage `tests/unit/core/api/test_streaming_parser.py`, `tests/unit/core/metrics/custom/test_tool_eval.py`	Added 36 lines of parser tests validating presence/absence of result fields; added 204 lines of metrics tests covering result comparison logic, regex matching, mismatches, and evaluation scenarios with mixed tool lists.

Sequence Diagram(s)

sequenceDiagram
    participant Parser as Streaming Parser
    participant Client as Client API
    participant ToolEval as Tool Eval Metrics
    
    Parser->>Parser: Extract tool_name, arguments, result (optional)
    Parser->>Client: Return parsed tool_call dict
    Client->>Client: Format tool_call with result field
    Client->>ToolEval: Pass formatted_tool to evaluation
    ToolEval->>ToolEval: Compare tool_name
    ToolEval->>ToolEval: Compare arguments (regex)
    ToolEval->>ToolEval: Compare result (regex, if present)
    ToolEval->>ToolEval: Return match status

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

VladimirKadlec
tisnik
asamal4

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main change: adding support for evaluating tool call results, which aligns with the comprehensive changes across configuration, documentation, and code.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Add ability to evaluate the tool results

597ab09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to evaluate the tool results #151

Add ability to evaluate the tool results #151

Uh oh!

bparees commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add ability to evaluate the tool results #151

Are you sure you want to change the base?

Add ability to evaluate the tool results #151

Uh oh!

Conversation

bparees commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bparees commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading