Skip to content

TJ-Neary/AI_Eval

Repository files navigation

AI_Eval

AI_Eval Architecture

A comprehensive LLM evaluation and benchmarking framework for comparing local and cloud-hosted language models across standardized test suites with hardware-aware performance profiling.

Python 3.12+ License: MIT Status: Stable


Overview

Selecting the right language model for a production workload requires more than reading leaderboard scores. Published benchmarks rarely account for the hardware the model will actually run on, the specific task domain it needs to serve, or the cost trade-offs between local inference and API calls. AI_Eval addresses this gap by providing a structured, reproducible evaluation pipeline that benchmarks LLMs under real-world conditions and produces actionable, hardware-contextualized recommendations.

AI_Eval supports both local models via Ollama and cloud API models from Google, Anthropic, and OpenAI. Every benchmark run captures a complete hardware profile — chip architecture, memory bandwidth, GPU cores, thermal state — so that results are always grounded in the environment that produced them. This makes it possible to compare a 7B model running on a MacBook Air against the same model on a workstation GPU, or to quantify exactly when a cloud API becomes more cost-effective than local inference.

The framework evaluates models across five benchmark categories (text generation, code generation, document analysis, conversational ability, and structured output), applies three distinct scoring methodologies, and generates fitness scores weighted to specific use cases such as RAG knowledge engines, code assistants, and document processing pipelines.


Key Features

Multi-Provider Architecture

A unified provider abstraction layer enables seamless evaluation across inference backends. Each provider implements a standardized interface for generation, chat, model metadata retrieval, and resource measurement — making it straightforward to add new backends.

Provider Backend Status
Ollama Local inference (Apple Silicon, NVIDIA, AMD, CPU) Implemented
Google Gemini Cloud API via google-genai SDK Implemented
Anthropic Claude Cloud API via anthropic SDK Extensible — implement BaseProvider
OpenAI GPT Cloud API via openai SDK Extensible — implement BaseProvider

Scoring Methodology

AI_Eval employs three complementary scoring approaches, each designed for a different class of evaluation task:

  • pass@k (Code Generation) — HumanEval-style functional correctness scoring. Generated code is executed in a sandboxed subprocess with timeout enforcement, and pass rates are calculated across k samples using the unbiased estimator from the Evaluating Large Language Models Trained on Code paper.

  • LLM-as-Judge (Subjective Quality) — Automated quality assessment using a separate judge model with bias mitigation protocols per TD-011. Position shuffling eliminates ordering bias in A/B comparisons, explicit rubrics penalize verbosity bias, and self-evaluation is prohibited (the judge model never scores its own output). Multi-evaluation consensus across multiple judge calls further reduces noise.

  • RAG Metrics (Retrieval-Augmented Generation) — Integration with DeepEval for RAGAS-based evaluation of retrieval pipelines. Metrics include Answer Relevancy (embedding-based semantic similarity), Faithfulness (factual grounding against retrieved context), Contextual Precision (retrieval signal-to-noise ratio), and Hallucination detection.

Benchmark Categories

Models are evaluated across five standardized categories, each with defined subcategories, test prompts, and scoring rubrics:

Category Weight Subcategories Focus
Text Generation 20% Instruction following, reasoning, summarization, creative writing General writing quality and coherence
Code Generation 25% Python, SQL, bug fixing, code explanation, multi-language Functional correctness and code quality
Document Analysis 30% Information extraction, classification, question answering, long context Comprehension and extraction accuracy
Conversational 10% Multi-turn context, persona consistency, clarification handling Dialogue quality and context retention
Structured Output 15% JSON/YAML generation, schema compliance, output consistency Format adherence and reliability

Full test definitions, scoring rubrics, and pass criteria are documented in docs/TEST_SUITE_SPEC.md.

Fitness Profiles

Raw category scores are aggregated into fitness scores — weighted composites tuned to specific deployment scenarios. This allows the same benchmark data to answer different questions: "Which model is best for my RAG pipeline?" produces a different ranking than "Which model is best for code generation?"

Profile Primary Categories Use Case
RAG Knowledge Engine Document Analysis (40%), Structured Output (25%) Retrieval-augmented Q&A systems
Code Assistant Code Generation (50%), Conversational (15%) IDE integration, code review, generation
Document Processor Document Analysis (50%), Structured Output (30%) Classification, extraction, compliance
Chat Application Conversational (40%), Text Generation (30%) Customer support, conversational agents
Data Pipeline Structured Output (35%), Document Analysis (30%) ETL, schema transformation, data cleaning

Hardware-Aware Profiling

Every benchmark run begins with automatic hardware detection, embedding a complete system profile in the results:

  • Apple Silicon — Chip identification (M1–M4, Pro/Max/Ultra tiers), GPU core count, Neural Engine cores, unified memory bandwidth
  • NVIDIA GPU — VRAM, CUDA cores, driver version via pynvml / nvidia-smi
  • AMD GPU — ROCm detection via rocm-smi
  • CPU Fallback — Core count, clock speed, available RAM, architecture

This enables cross-machine comparison: the same model evaluated on different hardware produces results that can be meaningfully compared when normalized against the hardware profile.


Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              AI_Eval CLI                                    │
│  run | quick-test | compare | hardware | list-models | models | evaluate    │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           Orchestrator                                      │
│  Config loading · Model discovery · Test coordination · Result aggregation  │
└─────────────────────────────────────────────────────────────────────────────┘
          │                    │                    │                    │
          ▼                    ▼                    ▼                    ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Providers   │    │  Benchmarks  │    │  Profiling   │    │   Scoring    │
│              │    │              │    │              │    │              │
│  Ollama      │    │  Suites      │    │  Hardware    │    │  pass@k      │
│  Google      │    │  Datasets    │    │  Memory      │    │  LLM Judge   │
│  Anthropic*  │    │  Runner      │    │  Thermal     │    │  RAG Metrics │
│  OpenAI*     │    │              │    │              │    │              │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
          │                    │                                     │
          ▼                    ▼                                     ▼
┌──────────────┐    ┌─────────────────────────────────────────────────────────┐
│  Evaluation  │    │                   Reporting & Export                     │
│              │    │  Jinja2 reports · MODEL_CATALOG · DECISION_MATRIX · JSON │
│  Config      │    └─────────────────────────────────────────────────────────┘
│  Runner      │                                            * = Extensible
│  Scorers     │
│  Discovery   │
└──────────────┘

Design Principles

  • Provider abstractionBaseProvider abstract class with ProviderFactory registry. Adding a new backend requires implementing generate(), chat(), and get_model_info().
  • YAML-driven configuration — Test suites, model lists, scoring thresholds, and fitness weights are all defined in configs/default.yaml, making evaluations fully reproducible.
  • Marker-based catalog updates — Export module uses <!-- AI_EVAL:BEGIN --> / <!-- AI_EVAL:END --> markers to update shared catalog files without overwriting manual annotations.
  • Statistical rigor — Bootstrap confidence intervals, minimum sample sizes per category, and paired-difference analysis for model comparisons (see TD-010).

Requirements

  • Python 3.12 or higher
  • macOS (Apple Silicon optimized), Linux, or Windows
  • Ollama installed and running (for local model evaluation)
  • API keys for cloud providers (optional — only required for API model evaluation)

Installation

# Clone the repository
git clone https://github.com/yourusername/AI_Eval.git
cd AI_Eval

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows

# Install runtime dependencies
pip install -r requirements.txt

# Install development dependencies (testing, linting, type checking)
pip install -r requirements-dev.txt

# Or install as editable package with dev extras
pip install -e ".[dev]"

# Configure environment variables
cp .env.example .env
# Edit .env with your API keys (optional for local-only evaluation)

Quick Start

# Run a quick sanity test on a single model
python -m src quick-test --provider ollama --model qwen2.5:7b

# Execute a full benchmark suite
python -m src run --config configs/default.yaml

# Compare two models head-to-head
python -m src compare --models "qwen2.5:7b,llama3:8b" --provider ollama

# List available models from a provider
python -m src list-models --provider ollama

# Display hardware profile for the current machine
python -m src hardware

Configuration

Evaluation behavior is controlled through two configuration layers:

File Purpose
.env API keys, data directory path, log level (gitignored — copy from .env.example)
configs/default.yaml Model lists, test suite definitions, scoring thresholds, fitness profile weights, output formats

See .env.example for all supported environment variables and configs/default.yaml for the full configuration schema.


Project Structure

AI_Eval/
├── src/                        # Source code
│   ├── cli.py                  # CLI entry point and command definitions
│   ├── providers/              # LLM provider abstraction layer
│   │   ├── base.py             #   BaseProvider, ProviderFactory, data classes
│   │   ├── ollama_provider.py  #   Ollama local inference with throughput metrics
│   │   └── google_provider.py  #   Google Gemini API integration
│   ├── benchmarks/             # Benchmark execution engine
│   │   ├── runner.py           #   BenchmarkRunner with warmup and concurrency
│   │   └── datasets.py         #   Dataset management and test case loading
│   ├── scoring/                # Scoring methodologies
│   │   ├── pass_k.py           #   HumanEval-style pass@k for code generation
│   │   ├── llm_judge.py        #   LLM-as-Judge with bias mitigation (TD-011)
│   │   └── rag_metrics.py      #   DeepEval RAG evaluation metrics
│   ├── profiling/              # Hardware detection and resource monitoring
│   │   └── hardware.py         #   Platform-aware hardware profiling
│   ├── evaluation/             # Config-driven evaluation workflow
│   │   ├── config.py           #   YAML config loading, dataclass definitions
│   │   ├── runner.py           #   EvaluationRunner orchestration
│   │   ├── scorers.py          #   Custom scorers (JSON, latency, pattern, citation)
│   │   ├── model_discovery.py  #   Model catalog management, Ollama integration
│   │   └── report.py           #   Evaluation report generator
│   ├── reporting/              # Report generation
│   │   ├── report_generator.py #   Jinja2-based markdown/JSON report generation
│   │   ├── readme_updater.py   #   README.md results table updater
│   │   └── templates/          #   Jinja2 report templates
│   └── export/                 # Catalog export to shared markdown files
├── utils/                      # Shared utilities
│   ├── exceptions.py           #   Custom exception hierarchy (AiEvalError base, ReportingError)
│   ├── marker_parser.py        #   Marker-based content replacement for catalog updates
│   ├── logging_config.py       #   Multi-handler structured logging
│   ├── retry.py                #   Retry with backoff, circuit breaker, retry budget
│   ├── rate_limiter.py         #   Sliding window API rate limiting
│   ├── state_machine.py        #   Async state machine for workflow management
│   └── plugin_loader.py        #   Dynamic plugin discovery system
├── configs/                    # Evaluation configuration files (YAML)
│   └── default.yaml            #   Default evaluation configuration
├── evaluations/                # Evaluation configs and results (gitignored except README)
├── tests/                      # Test suite (pytest, 203 tests)
├── docs/                       # Specifications and research
│   ├── TEST_SUITE_SPEC.md      #   Detailed test definitions and scoring rubrics
│   └── RESEARCH_SYNTHESIS.md   #   Synthesized research findings
├── config.py                   # Environment and path configuration
├── pyproject.toml              # Package metadata, tool configuration
├── DevPlan.md                  # Development roadmap and technical decisions
├── StartHere.md                # Developer onboarding guide
└── SECURITY.md                 # Security policy

Benchmark Results

Model Provider Score Tokens/sec Pass Rate Hardware Date
qwen2.5:32b OLLAMA 80.3 20.5 5/5 Apple M4 Max (48GB) 2026-02-07
llama3.1:8b OLLAMA 78.5 88.1 5/5 Apple M4 Max (48GB) 2026-02-07
gemma2:27b OLLAMA 67.1 18.2 4/5 Apple M4 Max (48GB) 2026-02-07

Development Status

AI_Eval is stable and complete as an evaluation framework. The core infrastructure — provider abstraction, benchmark runner, scoring engine, hardware detection, evaluation workflow, reporting, and catalog export — is fully implemented. The architecture is extensible: adding new providers requires only implementing the BaseProvider interface and registering with ProviderFactory.

Component Status Notes
Provider abstraction layer Complete BaseProvider + ProviderFactory pattern
Ollama provider Complete Async generate/chat with throughput metrics
Google Gemini provider Complete Cloud inference via google-genai SDK
Anthropic / OpenAI providers Extensible Implement BaseProvider to add
Benchmark runner Complete Warmup phase, concurrent execution, timeout handling
Scoring — pass@k Complete HumanEval methodology with sandboxed execution
Scoring — LLM-as-Judge Complete Bias-mitigated evaluation (TD-011)
Scoring — RAG metrics Complete DeepEval integration (RAGAS)
Hardware detection Complete Apple Silicon, NVIDIA, AMD, CPU fallback
CLI Complete 7 commands: run, quick-test, compare, list-models, hardware, models, evaluate
Evaluation workflow Complete Config-driven runner, custom scorers, model discovery
Reporting Complete Jinja2 markdown/JSON reports, README results table
Catalog export Complete Marker-based updates to MODEL_CATALOG, DECISION_MATRIX, HARDWARE_PROFILES

See DevPlan.md for the full development roadmap, technical decisions (TD-001 through TD-013), and task tracker.


Testing

AI_Eval follows test-driven development (TDD) practices with the Red-Green-Refactor cycle. All new tests must be observed to fail before writing the implementation that makes them pass.

# Run the full test suite with coverage reporting
pytest

# Run a specific test file
pytest tests/test_report_generator.py

# Skip slow or integration tests
pytest -m "not slow"
pytest -m "not integration"

# Run mutation testing to validate test effectiveness
mutmut run
mutmut results

Testing is configured in pyproject.toml with --cov=src --cov-report=term-missing and asyncio_mode = "auto".


Documentation

Document Description
DevPlan.md Development phases, technical decisions (TD-001–TD-013), task tracker, and roadmap
StartHere.md Comprehensive developer onboarding guide with architecture walkthrough
docs/TEST_SUITE_SPEC.md Detailed test definitions, scoring rubrics, and performance thresholds
docs/RESEARCH_SYNTHESIS.md Synthesized findings from multi-source evaluation methodology research
SECURITY.md Security policy, secrets management, and incident response procedures

Contributing

Contributions are welcome. Please follow these guidelines:

  1. Code style — Format with black (100 char line length), lint with ruff, type check with mypy
  2. Testing — Follow the Red-Green-Refactor TDD cycle. Write the test first, observe it fail, then implement
  3. Commit format — Use conventional commits: <type>: <description> (feat, fix, docs, refactor, test, chore, perf)
  4. Pre-commit hooks — Install with pre-commit install to run automated checks before each commit
  5. Security — Read SECURITY.md before committing. No secrets, PII, or hardcoded paths in tracked files

License

MIT License — See LICENSE for details.

About

Comprehensive LLM evaluation framework comparing local and cloud models with hardware-aware benchmarking. Evaluate across code generation, document analysis, and structured output using pass@k, LLM-as-Judge, and RAG metrics. Supports Ollama, Google Gemini, Anthropic, and OpenAI.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors