feat: add Eval System for end-to-end agent evaluation#6506
Open
singhhnitin wants to merge 2 commits intoaden-hive:mainfrom
Open
feat: add Eval System for end-to-end agent evaluation#6506singhhnitin wants to merge 2 commits intoaden-hive:mainfrom
singhhnitin wants to merge 2 commits intoaden-hive:mainfrom
Conversation
added 2 commits
March 15, 2026 22:31
- New framework.eval module with EvalCase, EvalSuite, EvalScorer, EvalRunner, EvalReport - Multi-dimension scoring: content, performance, tool usage, LLM-as-judge - YAML-defined eval suites with tag filtering and weighted scoring - CLI: hive eval run --suite <yaml> --agent <path> - CLI: hive eval report <json> - JSON and Markdown report export - CI-friendly --fail-under threshold for pass rate gating - Example eval suite in core/framework/eval/basic_agent_eval.yaml Closes #eval-system roadmap item
PR Requirements WarningThis PR does not meet the contribution requirements. Missing: No linked issue found. To fix:
Exception: To bypass this requirement, you can:
Micro-fix requirements (must meet ALL):
Why is this required? See #472 for details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implements the Eval System from the official roadmap — the first ever end-to-end agent graph evaluation framework for Aden Hive. Defines a new
framework.evalmodule that lets developers benchmark agent quality across content correctness, latency, cost, tool usage, and semantic quality via LLM-as-judge.Type of Change
Related Issues
Closes eval-system item from roadmap.md
Changes Made
core/framework/eval/module (6 files, zero new dependencies)EvalCase/EvalSuite— YAML-driven eval definitions with tag filtering and weighted scoringEvalScorer— multi-dimension scoring: content, latency, cost, tool usage, LLM-as-judgeEvalRunner— async runner with concurrency control via asyncio.SemaphoreEvalReport— aggregate report with JSON and Markdown exporthive eval runandhive eval reportCLI commands wired into framework/cli.py--fail-underflag for CI/CD pass rate gatingcore/framework/eval/basic_agent_eval.yamlTesting
cd core && pytest tests/test_eval/— 17/17 passed)cd core && ruff check .)Checklist
Usage
hive eval run --suite core/framework/eval/basic_agent_eval.yaml --agent exports/my-agent --verbose
hive eval run --suite my_suite.yaml --agent exports/agent --fail-under 0.8 --output-json report.json
hive eval report report.json