LightSpeed Evaluation Framework

A comprehensive framework for evaluating GenAI applications.

This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.

🎯 Key Features

Multi-Framework Support: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
Turn & Conversation-Level Evaluation: Support for both individual queries and multi-turn conversations
Evaluation types: Response, Context, Tool Call, Overall Conversation evaluation & Script-based evaluation
LLM Provider Flexibility: OpenAI, Watsonx, Gemini, vLLM and others
API Integration: Direct integration with external API for real-time data generation (if enabled)
Setup/Cleanup Scripts: Support for running setup and cleanup scripts before/after each conversation evaluation (applicable when API is enabled)
Token Usage Tracking: Track input/output tokens for both API calls and Judge LLM evaluations
Streaming Performance Metrics: Capture time-to-first-token (TTFT), streaming duration, and tokens/second when using streaming endpoint
Statistical Analysis: Statistics for every metric with score distribution analysis
Rich Output: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
Flexible Configuration: Configurable environment & metric metadata, Global defaults with per-conversation/per-turn metric overrides
Early Validation: Catch configuration errors before expensive LLM calls
Concurrent Evaluation: Multi-threaded evaluation with configurable thread count
Caching: LLM, embedding, and API response caching for faster re-runs
Skip on Failure: Optionally skip remaining evaluations in a conversation when a turn evaluation fails (configurable globally or per conversation). When there is an error in API call/Setup script execution metrics are marked as ERROR always.

🚀 Quick Start

Installation

# From Git
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git

# Additional steps for local development
pip install uv
make install-tools

Optional: Local Embedding Models (HuggingFace)

By default, lightspeed-evaluation uses remote embedding providers (OpenAI, Gemini). If you need local embedding models (HuggingFace/sentence-transformers), install with:

# Using pip
pip install 'lightspeed-evaluation[local-embeddings]'

# Using uv (from already cloned repo for local development)
uv sync --extra local-embeddings

Note: Local embeddings require PyTorch and related packages (~6GB). Only install if you need embedding.provider: huggingface in your configuration.

Basic Usage

# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"

# Optional: For script-based evaluations requiring Kubernetes access
export KUBECONFIG="/path/to/your/kubeconfig"

# Run evaluation
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --output-dir <OUTPUT_DIR>

# Run subset of evaluations (filter by tag or conversation ID)
lightspeed-eval --tags basic advanced              # Filter by tags
lightspeed-eval --conv-ids conv_1 conv_2           # Filter by conversation IDs
lightspeed-eval --tags basic --conv-ids special    # Filter by either (OR logic)

Usage Scenarios

Please make any necessary modifications to system.yaml and evaluation_data.yaml. The evaluation_data.yaml file includes sample data for guidance.

1. API-Enabled Real-time data collection

# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)
export OPENAI_API_KEY="your-evaluation-llm-key"

export API_KEY="your-api-endpoint-key"

# Ensure API is running at configured endpoint
# Default: http://localhost:8080/v1/

# Run with API-enabled configuration
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml

2. Static Data Evaluation (API Disabled)

# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"

# Use system configuration with api.enabled: false
# You have to pre-generate response, contexts & tool_calls data in the input evaluation data file
lightspeed-eval --system-config config/system_api_disabled.yaml --eval-data config/evaluation_data.yaml

📊 Supported Metrics

Turn-Level (Single Query)

Ragas -- docs on Ragas website
- Response Evaluation
  - faithfulness
  - response_relevancy
- Context Evaluation
Custom
- Response Evaluation
  - answer_correctness
  - intent_eval - Evaluates whether the response demonstrates the expected intent or purpose
  - keywords_eval - Keywords evaluation with alternatives (ALL keywords must match, case insensitive)
- Tool Evaluation
  - tool_eval - Validates tool calls and arguments with regex pattern matching
Script-based
- Action Evaluation
  - script:action_eval - Executes verification scripts to validate actions (e.g., infrastructure changes)

Conversation-Level (Multi-turn)

DeepEval -- docs on DeepEval website

Custom Metrics with GEval (from DeepEval)

GEval allows us to define custom evaluation metrics using natural language criteria. Define metrics in system.yaml under metrics_metadata:

metrics_metadata:
  turn_level:
    "geval:custom_metric_name":
      criteria: |
        Specific criteria for the evaluation.
      evaluation_params:
        - query
        - response
        - expected_response  # optional
      evaluation_steps:
        - "Step 1: Check if..."
        - "Step 2: Verify that..."
      threshold: 0.7
      description: "Metric description"

See sample system config for complete examples of geval:technical_accuracy and geval:conversation_coherence.

⚙️ Configuration

System Config (`config/system.yaml`)

The default system config file is config/system.yaml. See docs/configuration.md for the detailed description.

Input File Data Structure (`config/evaluation_data.yaml`)

- conversation_group_id: "test_conversation"
  description: "Sample evaluation"
  tag: "basic"  # Optional: Tag for grouping eval conversations (default: "eval")
  
  # Optional: Environment setup/cleanup scripts, when API is enabled
  setup_script: "scripts/setup_env.sh"      # Run before conversation
  cleanup_script: "scripts/cleanup_env.sh"  # Run after conversation
  
  # Conversation-level metrics   
  conversation_metrics:
    - "deepeval:conversation_completeness"
  
  conversation_metrics_metadata:
    "deepeval:conversation_completeness":
      threshold: 0.8
  
  turns:
    - turn_id: id1
      query: What is OpenShift Virtualization?
      response: null                    # Populated by API if enabled, otherwise provide
      contexts:
        - OpenShift Virtualization is an extension of the OpenShift ...
      attachments: []                   # Attachments (Optional)
      expected_keywords: [["virtualization"], ["openshift"]]  # For keywords_eval evaluation
      expected_response: OpenShift Virtualization is an extension of the OpenShift Container Platform that allows running virtual machines alongside containers
      expected_intent: "explain a concept"  # Expected intent for intent evaluation
      
      # Per-turn metrics (overrides system defaults)
      turn_metrics:
        - "ragas:faithfulness"
        - "custom:keywords_eval"
        - "custom:answer_correctness"
        - "custom:intent_eval"
      
      # Per-turn metric configuration
      turn_metrics_metadata:
        "ragas:faithfulness": 
          threshold: 0.9  # Override system default
      # turn_metrics: null (omitted) → Use system defaults (metrics with default=true)
      
    - turn_id: id2
      query: Skip this turn evaluation
      turn_metrics: []                  # Skip evaluation for this turn

    - turn_id: id3
      query: Create a namespace called test-ns
      verify_script: "scripts/verify_namespace.sh"  # Script-based verification
      turn_metrics:
        - "script:action_eval"          # Script-based evaluation (if API is enabled)

Input file Data Structure Details

Conversation Data Fields

Field	Type	Required	Description
`conversation_group_id`	string	✅	Unique identifier for conversation
`description`	string	❌	Optional description
`tag`	string	❌	Tag for grouping eval conversations (default: "eval")
`setup_script`	string	❌	Path to setup script (Optional, used when API is enabled)
`cleanup_script`	string	❌	Path to cleanup script (Optional, used when API is enabled)
`conversation_metrics`	list[string]	❌	Conversation-level metrics (Optional, if override is required)
`conversation_metrics_metadata`	dict	❌	Conversation-level metric config (Optional, if override is required)
`turns`	list[TurnData]	✅	List of conversation turns

Turn Data Fields

Field	Type	Required	Description	API Populated
`turn_id`	string	✅	Unique identifier for the turn	❌
`query`	string	✅	The question/prompt to evaluate	❌
`response`	string	📋	Actual response from system	✅ (if API enabled)
`contexts`	list[string]	📋	Context information for evaluation	✅ (if API enabled)
`attachments`	list[string]	❌	Attachments	❌
`expected_keywords`	list[list[string]]	📋	Expected keywords for keyword evaluation (list of alternatives)	❌
`expected_response`	string or list[string]	📋	Expected response for comparison	❌
`expected_intent`	string	📋	Expected intent for intent evaluation	❌
`expected_tool_calls`	list[list[list[dict]]]	📋	Expected tool call sequences (multiple alternative sets)	❌
`tool_calls`	list[list[dict]]	❌	Actual tool calls from API	✅ (if API enabled)
`verify_script`	string	📋	Path to verification script	❌
`turn_metrics`	list[string]	❌	Turn-specific metrics to evaluate	❌
`turn_metrics_metadata`	dict	❌	Turn-specific metric configuration	❌

📋 Required based on metrics: Some fields are required only when using specific metrics

Examples

expected_keywords: Required for custom:keywords_eval (case insensitive matching)

expected_response: Required for custom:answer_correctness

expected_intent: Required for custom:intent_eval

expected_tool_calls: Required for custom:tool_eval (multiple alternative sets format)

verify_script: Required for script:action_eval (used when API is enabled)

response: Required for most metrics (auto-populated if API enabled)

Multiple expected responses: For metrics that include expected_response in their required_fields (defined in METRIC_REQUIREMENTS), you can provide expected_response as a list of strings. The evaluator will test each expected response until one passes. If all fail, it returns the maximum score from all attempts and logs all scores with their reasons into reason. Note: This feature only works for metrics explicitly listed in METRIC_REQUIREMENTS. For other metrics (e.g. GEval), only the first item in the list will be used. See example config for multiple expected responses (evaluation_data_multiple_expected_responses.yaml).

Metrics override behavior

Override Value	Behavior
`null` (or omitted)	Use system global metrics (metrics with `default: true`)
`[]` (empty list)	Skip evaluation for this turn
`["metric1", ...]`	Use specified metrics only, ignore global metrics

Tool Evaluation

The custom:tool_eval metric supports flexible matching with multiple alternative patterns:

Format: [[[tool_calls, ...]], [[tool_calls]], ...] (list of list of list)
Matching: Tries each alternative until one matches
Use Cases: Optional tools, multiple approaches, default arguments, skip scenarios
Empty Sets: [] represents "no tools" and must come after primary alternatives
Options:
- ordered (default: true) — sequence order must match when true, ignored when false
- full_match (default: true) — exact 1:1 match when true, partial match when false

Tool Call Structure

# Multiple alternative sets format: [[[tool_calls, ...]], [[tool_calls]], ...]
expected_tool_calls:
  - # Alternative 1: Primary approach
    - # Sequence 1
      - tool_name: oc_get
        arguments:
          kind: pod
          name: openshift-light*    # Regex patterns supported
    - # Sequence 2 (if multiple parallel tool calls needed)
      - tool_name: oc_describe
        arguments:
          kind: pod
  - # Alternative 2: Different approach
    - # Sequence 1
      - tool_name: kubectl_get
        arguments:
          resource: pods
  - # Alternative 3: Skip scenario (optional)
    []  # When model has information from previous conversation

Script-Based Evaluations

The framework supports script-based evaluations. Note: Scripts only execute when API is enabled - they're designed to test with actual environment changes.

Setup scripts: Run before conversation evaluation (e.g., create failed deployment for troubleshoot query)
Cleanup scripts: Run after conversation evaluation (e.g., cleanup failed deployment)
Verify scripts: Run per turn for script:action_eval metric (e.g., validate if a pod has been created or not)

# Example: evaluation_data.yaml
- conversation_group_id: infrastructure_test
  setup_script: ./scripts/setup_cluster.sh
  cleanup_script: ./scripts/cleanup_cluster.sh
  turns:
    - turn_id: turn_id
      query: Create a new cluster
      verify_script: ./scripts/verify_cluster.sh
      turn_metrics:
        - script:action_eval

Script Path Resolution

Script paths in evaluation data can be specified in multiple ways:

Relative Paths: Resolved relative to the evaluation data YAML file location, not the current working directory
Absolute Paths: Used as-is
Home Directory Paths: Expands to user's home directory

🔑 Authentication & Environment

Required Environment Variables

For LLM as a Judge Evaluation (Always Required)

# Hosted vLLM (provider: hosted_vllm)
export HOSTED_VLLM_API_KEY="your-key"
export HOSTED_VLLM_API_BASE="https://your-vllm-endpoint/v1"

# OpenAI (provider: openai)
export OPENAI_API_KEY="your-openai-key"

# IBM Watsonx (provider: watsonx)
export WATSONX_API_KEY="your-key"
export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com"
export WATSONX_PROJECT_ID="your-project-id"

# Gemini (provider: gemini)
export GEMINI_API_KEY="your-key"

# Azure OpenAI (provider: azure)
export AZURE_API_KEY="your-azure-key"
export AZURE_API_BASE="https://your-resource.openai.azure.com/"
# AZURE_API_VERSION is optional

Note for Azure: The model field should be Azure deployment name, not the model name (when these are different).

For Lightspeed Core API Integration (When `api.enabled: true`)

# API authentication for external system (MCP)
export API_KEY="your-api-endpoint-key"

📈 Output & Visualization

Generated Reports

CSV: Detailed results with status, scores, reasons
JSON: Summary statistics with score distributions
TXT: Human-readable summary
PNG: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)

Key Metrics in Output

Status: PASS/FAIL/ERROR/SKIPPED
Actual Reasons: Reason for evaluation status/result
Score Statistics: Mean, median, standard deviation, min/max for every metric

Streaming Performance Metrics

When using the streaming endpoint (api.endpoint_type: streaming), the framework captures additional performance metrics:

Metric	Description
`time_to_first_token`	Time in seconds from request start to first content token received
`streaming_duration`	Total time in seconds to receive all tokens
`tokens_per_second`	Output throughput (tokens generated per second, excluding TTFT)

These metrics are included in:

CSV output: Per-result columns for each metric
JSON output: Per-result fields and aggregate statistics in streaming_performance
TXT output: Aggregate statistics (mean, median, min/max) in the summary

🧪 Development

Development Tools

# Install dev dependencies and git hooks
make install-deps-test

# Format code
make black-format

# Run all pre-commit checks at once (same as CI)
make pre-commit      # Runs: bandit, check-types, pyright, docstyle, ruff, pylint, black-check
# or Run each quality checks individually:
make bandit          # Security scan
make check-types     # Type check
make pyright         # Type check
make docstyle        # Docstring style
make ruff            # Lint check
make pylint          # Lint check
make black-check     # Check formatting

# Run tests
make test            # Or: uv run pytest tests --cov=src

🔧 Troubleshooting

Issue	Solution
Parsing error with context-related metrics (e.g., `faithfulness`)	Increase `max_tokens` to a higher value (e.g., 2048 or higher - depends on number of the context & size)
API responses not changing after updates	Disable caching (`cache_enabled: false`) or delete the cache folders (`.caches/`)

Generate answers (optional - for creating test data)

For generating answers (optional) refer README-generate-answers

📄 License & Contributing

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Contributions welcome - see development setup above for code quality tools.

Name		Name	Last commit message	Last commit date
Latest commit History 372 Commits
.github		.github
config		config
docs		docs
eval_data		eval_data
githooks		githooks
lsc_agent_eval		lsc_agent_eval
script		script
src		src
tests		tests
.coderabbit.yaml		.coderabbit.yaml
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README-generate-answers.md		README-generate-answers.md
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

License

lightspeed-core/lightspeed-evaluation

Folders and files

Latest commit

History

Repository files navigation