Skip to content

Conversation

@bparees
Copy link

@bparees bparees commented Jan 30, 2026

Description

Adds support for checking the tool call results, in addition to the tool call names + arguments.

This is needed for aladdin evaluation in particular because some of the mcp tools are non-deterministic in nature.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Generated by: Cursor

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Release Notes

  • New Features

    • Evaluation system now supports validating tool call results alongside arguments, using exact matching and regex pattern verification.
    • Added optional setup and cleanup script support in evaluation configurations.
  • Documentation

    • Enhanced evaluation guide with new "Result Validation" section, including practical YAML examples and implementation patterns.
  • Tests

    • Comprehensive test coverage for result validation, including regex matching, exact matches, and error handling scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

Walkthrough

This PR adds optional result field validation support to the evaluation framework. Changes include configuration examples, parsing and client API updates to handle results, metrics evaluation logic for result comparison with regex patterns, and comprehensive tests validating the new functionality across the system.

Changes

Cohort / File(s) Summary
Configuration & Documentation
config/evaluation_data.yaml, docs/EVALUATION_GUIDE.md
Added new conversation group demonstrating tool call result validation with exact and regex-based matching; expanded YAML examples and metrics descriptions to document optional result fields and validation behavior.
API Layer
src/lightspeed_evaluation/core/api/streaming_parser.py, src/lightspeed_evaluation/core/api/client.py
Extended parsing to extract optional result fields from tool calls; formatted_tool dictionary now includes result when present; updated docstrings to document new optional result field.
Metrics Evaluation
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
Introduced _compare_tool_result helper to validate optional result fields using regex-based matching; updated _compare_single_tool_call to invoke result comparison alongside argument validation; added debug logging for mismatches.
Validation Rules
src/lightspeed_evaluation/core/system/validator.py
Updated custom:tool_eval metric field requirements to document optional result field in validation schema.
Test Coverage
tests/unit/core/api/test_streaming_parser.py, tests/unit/core/metrics/custom/test_tool_eval.py
Added 36 lines of parser tests validating presence/absence of result fields; added 204 lines of metrics tests covering result comparison logic, regex matching, mismatches, and evaluation scenarios with mixed tool lists.

Sequence Diagram(s)

sequenceDiagram
    participant Parser as Streaming Parser
    participant Client as Client API
    participant ToolEval as Tool Eval Metrics
    
    Parser->>Parser: Extract tool_name, arguments, result (optional)
    Parser->>Client: Return parsed tool_call dict
    Client->>Client: Format tool_call with result field
    Client->>ToolEval: Pass formatted_tool to evaluation
    ToolEval->>ToolEval: Compare tool_name
    ToolEval->>ToolEval: Compare arguments (regex)
    ToolEval->>ToolEval: Compare result (regex, if present)
    ToolEval->>ToolEval: Return match status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • VladimirKadlec
  • tisnik
  • asamal4
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: adding support for evaluating tool call results, which aligns with the comprehensive changes across configuration, documentation, and code.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant