Skip to content

lightspeed-core/lightspeed-evaluation

Repository files navigation

LightSpeed Evaluation Framework

A comprehensive framework for evaluating GenAI applications.

This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.

🎯 Key Features

  • Multi-Framework Support: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
  • Turn & Conversation-Level Evaluation: Support for both individual queries and multi-turn conversations
  • Evaluation types: Response, Context, Tool Call, Overall Conversation evaluation & Script-based evaluation
  • LLM Provider Flexibility: OpenAI, Watsonx, Gemini, vLLM and others
  • API Integration: Direct integration with external API for real-time data generation (if enabled)
  • Setup/Cleanup Scripts: Support for running setup and cleanup scripts before/after each conversation evaluation (applicable when API is enabled)
  • Token Usage Tracking: Track input/output tokens for both API calls and Judge LLM evaluations
  • Streaming Performance Metrics: Capture time-to-first-token (TTFT), streaming duration, and tokens/second when using streaming endpoint
  • Statistical Analysis: Statistics for every metric with score distribution analysis
  • Rich Output: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
  • Flexible Configuration: Configurable environment & metric metadata, Global defaults with per-conversation/per-turn metric overrides
  • Early Validation: Catch configuration errors before expensive LLM calls
  • Concurrent Evaluation: Multi-threaded evaluation with configurable thread count
  • Caching: LLM, embedding, and API response caching for faster re-runs
  • Skip on Failure: Optionally skip remaining evaluations in a conversation when a turn evaluation fails (configurable globally or per conversation). When there is an error in API call/Setup script execution metrics are marked as ERROR always.

🚀 Quick Start

Installation

# From Git
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git

# Additional steps for local development
pip install uv
make install-tools

Optional: Local Embedding Models (HuggingFace)

By default, lightspeed-evaluation uses remote embedding providers (OpenAI, Gemini). If you need local embedding models (HuggingFace/sentence-transformers), install with:

# Using pip
pip install 'lightspeed-evaluation[local-embeddings]'

# Using uv (from already cloned repo for local development)
uv sync --extra local-embeddings

Note: Local embeddings require PyTorch and related packages (~6GB). Only install if you need embedding.provider: huggingface in your configuration.

Basic Usage

# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"

# Optional: For script-based evaluations requiring Kubernetes access
export KUBECONFIG="/path/to/your/kubeconfig"

# Run evaluation
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --output-dir <OUTPUT_DIR>

# Run subset of evaluations (filter by tag or conversation ID)
lightspeed-eval --tags basic advanced              # Filter by tags
lightspeed-eval --conv-ids conv_1 conv_2           # Filter by conversation IDs
lightspeed-eval --tags basic --conv-ids special    # Filter by either (OR logic)

Usage Scenarios

Please make any necessary modifications to system.yaml and evaluation_data.yaml. The evaluation_data.yaml file includes sample data for guidance.

1. API-Enabled Real-time data collection

# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)
export OPENAI_API_KEY="your-evaluation-llm-key"

export API_KEY="your-api-endpoint-key"

# Ensure API is running at configured endpoint
# Default: http://localhost:8080/v1/

# Run with API-enabled configuration
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml

2. Static Data Evaluation (API Disabled)

# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"

# Use system configuration with api.enabled: false
# You have to pre-generate response, contexts & tool_calls data in the input evaluation data file
lightspeed-eval --system-config config/system_api_disabled.yaml --eval-data config/evaluation_data.yaml

📊 Supported Metrics

Turn-Level (Single Query)

Conversation-Level (Multi-turn)

Custom Metrics with GEval (from DeepEval)

GEval allows us to define custom evaluation metrics using natural language criteria. Define metrics in system.yaml under metrics_metadata:

metrics_metadata:
  turn_level:
    "geval:custom_metric_name":
      criteria: |
        Specific criteria for the evaluation.
      evaluation_params:
        - query
        - response
        - expected_response  # optional
      evaluation_steps:
        - "Step 1: Check if..."
        - "Step 2: Verify that..."
      threshold: 0.7
      description: "Metric description"

See sample system config for complete examples of geval:technical_accuracy and geval:conversation_coherence.

⚙️ Configuration

System Config (config/system.yaml)

The default system config file is config/system.yaml. See docs/configuration.md for the detailed description.

Input File Data Structure (config/evaluation_data.yaml)

- conversation_group_id: "test_conversation"
  description: "Sample evaluation"
  tag: "basic"  # Optional: Tag for grouping eval conversations (default: "eval")
  
  # Optional: Environment setup/cleanup scripts, when API is enabled
  setup_script: "scripts/setup_env.sh"      # Run before conversation
  cleanup_script: "scripts/cleanup_env.sh"  # Run after conversation
  
  # Conversation-level metrics   
  conversation_metrics:
    - "deepeval:conversation_completeness"
  
  conversation_metrics_metadata:
    "deepeval:conversation_completeness":
      threshold: 0.8
  
  turns:
    - turn_id: id1
      query: What is OpenShift Virtualization?
      response: null                    # Populated by API if enabled, otherwise provide
      contexts:
        - OpenShift Virtualization is an extension of the OpenShift ...
      attachments: []                   # Attachments (Optional)
      expected_keywords: [["virtualization"], ["openshift"]]  # For keywords_eval evaluation
      expected_response: OpenShift Virtualization is an extension of the OpenShift Container Platform that allows running virtual machines alongside containers
      expected_intent: "explain a concept"  # Expected intent for intent evaluation
      
      # Per-turn metrics (overrides system defaults)
      turn_metrics:
        - "ragas:faithfulness"
        - "custom:keywords_eval"
        - "custom:answer_correctness"
        - "custom:intent_eval"
      
      # Per-turn metric configuration
      turn_metrics_metadata:
        "ragas:faithfulness": 
          threshold: 0.9  # Override system default
      # turn_metrics: null (omitted) → Use system defaults (metrics with default=true)
      
    - turn_id: id2
      query: Skip this turn evaluation
      turn_metrics: []                  # Skip evaluation for this turn

    - turn_id: id3
      query: Create a namespace called test-ns
      verify_script: "scripts/verify_namespace.sh"  # Script-based verification
      turn_metrics:
        - "script:action_eval"          # Script-based evaluation (if API is enabled)

Input file Data Structure Details

Conversation Data Fields

Field Type Required Description
conversation_group_id string Unique identifier for conversation
description string Optional description
tag string Tag for grouping eval conversations (default: "eval")
setup_script string Path to setup script (Optional, used when API is enabled)
cleanup_script string Path to cleanup script (Optional, used when API is enabled)
conversation_metrics list[string] Conversation-level metrics (Optional, if override is required)
conversation_metrics_metadata dict Conversation-level metric config (Optional, if override is required)
turns list[TurnData] List of conversation turns

Turn Data Fields

Field Type Required Description API Populated
turn_id string Unique identifier for the turn
query string The question/prompt to evaluate
response string 📋 Actual response from system ✅ (if API enabled)
contexts list[string] 📋 Context information for evaluation ✅ (if API enabled)
attachments list[string] Attachments
expected_keywords list[list[string]] 📋 Expected keywords for keyword evaluation (list of alternatives)
expected_response string or list[string] 📋 Expected response for comparison
expected_intent string 📋 Expected intent for intent evaluation
expected_tool_calls list[list[list[dict]]] 📋 Expected tool call sequences (multiple alternative sets)
tool_calls list[list[dict]] Actual tool calls from API ✅ (if API enabled)
verify_script string 📋 Path to verification script
turn_metrics list[string] Turn-specific metrics to evaluate
turn_metrics_metadata dict Turn-specific metric configuration

📋 Required based on metrics: Some fields are required only when using specific metrics

Examples

  • expected_keywords: Required for custom:keywords_eval (case insensitive matching)
  • expected_response: Required for custom:answer_correctness
  • expected_intent: Required for custom:intent_eval
  • expected_tool_calls: Required for custom:tool_eval (multiple alternative sets format)
  • verify_script: Required for script:action_eval (used when API is enabled)
  • response: Required for most metrics (auto-populated if API enabled)

Multiple expected responses: For metrics that include expected_response in their required_fields (defined in METRIC_REQUIREMENTS), you can provide expected_response as a list of strings. The evaluator will test each expected response until one passes. If all fail, it returns the maximum score from all attempts and logs all scores with their reasons into reason. Note: This feature only works for metrics explicitly listed in METRIC_REQUIREMENTS. For other metrics (e.g. GEval), only the first item in the list will be used. See example config for multiple expected responses (evaluation_data_multiple_expected_responses.yaml).

Metrics override behavior

Override Value Behavior
null (or omitted) Use system global metrics (metrics with default: true)
[] (empty list) Skip evaluation for this turn
["metric1", ...] Use specified metrics only, ignore global metrics

Tool Evaluation

The custom:tool_eval metric supports flexible matching with multiple alternative patterns:

  • Format: [[[tool_calls, ...]], [[tool_calls]], ...] (list of list of list)
  • Matching: Tries each alternative until one matches
  • Use Cases: Optional tools, multiple approaches, default arguments, skip scenarios
  • Empty Sets: [] represents "no tools" and must come after primary alternatives
  • Options:
    • ordered (default: true) — sequence order must match when true, ignored when false
    • full_match (default: true) — exact 1:1 match when true, partial match when false

Tool Call Structure

# Multiple alternative sets format: [[[tool_calls, ...]], [[tool_calls]], ...]
expected_tool_calls:
  - # Alternative 1: Primary approach
    - # Sequence 1
      - tool_name: oc_get
        arguments:
          kind: pod
          name: openshift-light*    # Regex patterns supported
    - # Sequence 2 (if multiple parallel tool calls needed)
      - tool_name: oc_describe
        arguments:
          kind: pod
  - # Alternative 2: Different approach
    - # Sequence 1
      - tool_name: kubectl_get
        arguments:
          resource: pods
  - # Alternative 3: Skip scenario (optional)
    []  # When model has information from previous conversation

Script-Based Evaluations

The framework supports script-based evaluations. Note: Scripts only execute when API is enabled - they're designed to test with actual environment changes.

  • Setup scripts: Run before conversation evaluation (e.g., create failed deployment for troubleshoot query)
  • Cleanup scripts: Run after conversation evaluation (e.g., cleanup failed deployment)
  • Verify scripts: Run per turn for script:action_eval metric (e.g., validate if a pod has been created or not)
# Example: evaluation_data.yaml
- conversation_group_id: infrastructure_test
  setup_script: ./scripts/setup_cluster.sh
  cleanup_script: ./scripts/cleanup_cluster.sh
  turns:
    - turn_id: turn_id
      query: Create a new cluster
      verify_script: ./scripts/verify_cluster.sh
      turn_metrics:
        - script:action_eval

Script Path Resolution

Script paths in evaluation data can be specified in multiple ways:

  • Relative Paths: Resolved relative to the evaluation data YAML file location, not the current working directory
  • Absolute Paths: Used as-is
  • Home Directory Paths: Expands to user's home directory

🔑 Authentication & Environment

Required Environment Variables

For LLM as a Judge Evaluation (Always Required)

# Hosted vLLM (provider: hosted_vllm)
export HOSTED_VLLM_API_KEY="your-key"
export HOSTED_VLLM_API_BASE="https://your-vllm-endpoint/v1"

# OpenAI (provider: openai)
export OPENAI_API_KEY="your-openai-key"

# IBM Watsonx (provider: watsonx)
export WATSONX_API_KEY="your-key"
export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com"
export WATSONX_PROJECT_ID="your-project-id"

# Gemini (provider: gemini)
export GEMINI_API_KEY="your-key"

# Azure OpenAI (provider: azure)
export AZURE_API_KEY="your-azure-key"
export AZURE_API_BASE="https://your-resource.openai.azure.com/"
# AZURE_API_VERSION is optional

Note for Azure: The model field should be Azure deployment name, not the model name (when these are different).

For Lightspeed Core API Integration (When api.enabled: true)

# API authentication for external system (MCP)
export API_KEY="your-api-endpoint-key"

📈 Output & Visualization

Generated Reports

  • CSV: Detailed results with status, scores, reasons
  • JSON: Summary statistics with score distributions
  • TXT: Human-readable summary
  • PNG: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)

Key Metrics in Output

  • Status: PASS/FAIL/ERROR/SKIPPED
  • Actual Reasons: Reason for evaluation status/result
  • Score Statistics: Mean, median, standard deviation, min/max for every metric

Streaming Performance Metrics

When using the streaming endpoint (api.endpoint_type: streaming), the framework captures additional performance metrics:

Metric Description
time_to_first_token Time in seconds from request start to first content token received
streaming_duration Total time in seconds to receive all tokens
tokens_per_second Output throughput (tokens generated per second, excluding TTFT)

These metrics are included in:

  • CSV output: Per-result columns for each metric
  • JSON output: Per-result fields and aggregate statistics in streaming_performance
  • TXT output: Aggregate statistics (mean, median, min/max) in the summary

🧪 Development

Development Tools

# Install dev dependencies and git hooks
make install-deps-test

# Format code
make black-format

# Run all pre-commit checks at once (same as CI)
make pre-commit      # Runs: bandit, check-types, pyright, docstyle, ruff, pylint, black-check
# or Run each quality checks individually:
make bandit          # Security scan
make check-types     # Type check
make pyright         # Type check
make docstyle        # Docstring style
make ruff            # Lint check
make pylint          # Lint check
make black-check     # Check formatting

# Run tests
make test            # Or: uv run pytest tests --cov=src

🔧 Troubleshooting

Issue Solution
Parsing error with context-related metrics (e.g., faithfulness) Increase max_tokens to a higher value (e.g., 2048 or higher - depends on number of the context & size)
API responses not changing after updates Disable caching (cache_enabled: false) or delete the cache folders (.caches/)

Generate answers (optional - for creating test data)

For generating answers (optional) refer README-generate-answers

📄 License & Contributing

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Contributions welcome - see development setup above for code quality tools.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 18

Languages