Skip to content

feat: add Eval System for end-to-end agent evaluation#6506

Open
singhhnitin wants to merge 2 commits intoaden-hive:mainfrom
singhhnitin:feature/eval-system
Open

feat: add Eval System for end-to-end agent evaluation#6506
singhhnitin wants to merge 2 commits intoaden-hive:mainfrom
singhhnitin:feature/eval-system

Conversation

@singhhnitin
Copy link
Contributor

Description

Implements the Eval System from the official roadmap — the first ever end-to-end agent graph evaluation framework for Aden Hive. Defines a new framework.eval module that lets developers benchmark agent quality across content correctness, latency, cost, tool usage, and semantic quality via LLM-as-judge.

Type of Change

  • New feature (non-breaking change that adds functionality)

Related Issues

Closes eval-system item from roadmap.md

Changes Made

  • New core/framework/eval/ module (6 files, zero new dependencies)
  • EvalCase / EvalSuite — YAML-driven eval definitions with tag filtering and weighted scoring
  • EvalScorer — multi-dimension scoring: content, latency, cost, tool usage, LLM-as-judge
  • EvalRunner — async runner with concurrency control via asyncio.Semaphore
  • EvalReport — aggregate report with JSON and Markdown export
  • hive eval run and hive eval report CLI commands wired into framework/cli.py
  • --fail-under flag for CI/CD pass rate gating
  • Example eval suite at core/framework/eval/basic_agent_eval.yaml
  • 17 unit tests covering all scoring dimensions

Testing

  • Unit tests pass (cd core && pytest tests/test_eval/ — 17/17 passed)
  • Lint passes (cd core && ruff check .)
  • Manual testing performed

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Usage

hive eval run --suite core/framework/eval/basic_agent_eval.yaml --agent exports/my-agent --verbose
hive eval run --suite my_suite.yaml --agent exports/agent --fail-under 0.8 --output-json report.json
hive eval report report.json

Nitin Singh added 2 commits March 15, 2026 22:31
- New framework.eval module with EvalCase, EvalSuite, EvalScorer, EvalRunner, EvalReport
- Multi-dimension scoring: content, performance, tool usage, LLM-as-judge
- YAML-defined eval suites with tag filtering and weighted scoring
- CLI: hive eval run --suite <yaml> --agent <path>
- CLI: hive eval report <json>
- JSON and Markdown report export
- CI-friendly --fail-under threshold for pass rate gating
- Example eval suite in core/framework/eval/basic_agent_eval.yaml

Closes #eval-system roadmap item
@github-actions
Copy link

PR Requirements Warning

This PR does not meet the contribution requirements.
If the issue is not fixed within ~24 hours, it may be automatically closed.

Missing: No linked issue found.

To fix:

  1. Create or find an existing issue for this work
  2. Assign yourself to the issue
  3. Re-open this PR and add Fixes #123 in the description

Exception: To bypass this requirement, you can:

  • Add the micro-fix label or include micro-fix in your PR title for trivial fixes
  • Add the documentation label or include doc/docs in your PR title for documentation changes

Micro-fix requirements (must meet ALL):

Qualifies Disqualifies
< 20 lines changed Any functional bug fix
Typos & Documentation & Linting Refactoring for "clean code"
No logic/API/DB changes New features (even tiny ones)

Why is this required? See #472 for details.

@github-actions github-actions bot added the pr-requirements-warning PR doesn't follow contribution guidelines. Please fix or it will be auto-closed. label Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-requirements-warning PR doesn't follow contribution guidelines. Please fix or it will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant