A comprehensive framework for evaluating GenAI applications.
This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.
- Multi-Framework Support: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
- Turn & Conversation-Level Evaluation: Support for both individual queries and multi-turn conversations
- Evaluation types: Response, Context, Tool Call, Overall Conversation evaluation & Script-based evaluation
- LLM Provider Flexibility: OpenAI, Watsonx, Gemini, vLLM and others
- API Integration: Direct integration with external API for real-time data generation (if enabled)
- Setup/Cleanup Scripts: Support for running setup and cleanup scripts before/after each conversation evaluation (applicable when API is enabled)
- Token Usage Tracking: Track input/output tokens for both API calls and Judge LLM evaluations
- Streaming Performance Metrics: Capture time-to-first-token (TTFT), streaming duration, and tokens/second when using streaming endpoint
- Statistical Analysis: Statistics for every metric with score distribution analysis
- Rich Output: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
- Flexible Configuration: Configurable environment & metric metadata, Global defaults with per-conversation/per-turn metric overrides
- Early Validation: Catch configuration errors before expensive LLM calls
- Concurrent Evaluation: Multi-threaded evaluation with configurable thread count
- Caching: LLM, embedding, and API response caching for faster re-runs
- Skip on Failure: Optionally skip remaining evaluations in a conversation when a turn evaluation fails (configurable globally or per conversation). When there is an error in API call/Setup script execution metrics are marked as ERROR always.
# From Git
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git
# Additional steps for local development
pip install uv
make install-toolsBy default, lightspeed-evaluation uses remote embedding providers (OpenAI, Gemini). If you need local embedding models (HuggingFace/sentence-transformers), install with:
# Using pip
pip install 'lightspeed-evaluation[local-embeddings]'
# Using uv (from already cloned repo for local development)
uv sync --extra local-embeddingsNote: Local embeddings require PyTorch and related packages (~6GB). Only install if you need
embedding.provider: huggingfacein your configuration.
# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"
# Optional: For script-based evaluations requiring Kubernetes access
export KUBECONFIG="/path/to/your/kubeconfig"
# Run evaluation
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --output-dir <OUTPUT_DIR>
# Run subset of evaluations (filter by tag or conversation ID)
lightspeed-eval --tags basic advanced # Filter by tags
lightspeed-eval --conv-ids conv_1 conv_2 # Filter by conversation IDs
lightspeed-eval --tags basic --conv-ids special # Filter by either (OR logic)Please make any necessary modifications to system.yaml and evaluation_data.yaml. The evaluation_data.yaml file includes sample data for guidance.
# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)
export OPENAI_API_KEY="your-evaluation-llm-key"
export API_KEY="your-api-endpoint-key"
# Ensure API is running at configured endpoint
# Default: http://localhost:8080/v1/
# Run with API-enabled configuration
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"
# Use system configuration with api.enabled: false
# You have to pre-generate response, contexts & tool_calls data in the input evaluation data file
lightspeed-eval --system-config config/system_api_disabled.yaml --eval-data config/evaluation_data.yaml- Ragas -- docs on Ragas website
- Response Evaluation
- Context Evaluation
- Custom
- Response Evaluation
answer_correctnessintent_eval- Evaluates whether the response demonstrates the expected intent or purposekeywords_eval- Keywords evaluation with alternatives (ALL keywords must match, case insensitive)
- Tool Evaluation
tool_eval- Validates tool calls and arguments with regex pattern matching
- Response Evaluation
- Script-based
- Action Evaluation
script:action_eval- Executes verification scripts to validate actions (e.g., infrastructure changes)
- Action Evaluation
- DeepEval -- docs on DeepEval website
GEval allows us to define custom evaluation metrics using natural language criteria. Define metrics in system.yaml under metrics_metadata:
metrics_metadata:
turn_level:
"geval:custom_metric_name":
criteria: |
Specific criteria for the evaluation.
evaluation_params:
- query
- response
- expected_response # optional
evaluation_steps:
- "Step 1: Check if..."
- "Step 2: Verify that..."
threshold: 0.7
description: "Metric description"See sample system config for complete examples of geval:technical_accuracy and geval:conversation_coherence.
The default system config file is config/system.yaml.
See docs/configuration.md for the detailed description.
- conversation_group_id: "test_conversation"
description: "Sample evaluation"
tag: "basic" # Optional: Tag for grouping eval conversations (default: "eval")
# Optional: Environment setup/cleanup scripts, when API is enabled
setup_script: "scripts/setup_env.sh" # Run before conversation
cleanup_script: "scripts/cleanup_env.sh" # Run after conversation
# Conversation-level metrics
conversation_metrics:
- "deepeval:conversation_completeness"
conversation_metrics_metadata:
"deepeval:conversation_completeness":
threshold: 0.8
turns:
- turn_id: id1
query: What is OpenShift Virtualization?
response: null # Populated by API if enabled, otherwise provide
contexts:
- OpenShift Virtualization is an extension of the OpenShift ...
attachments: [] # Attachments (Optional)
expected_keywords: [["virtualization"], ["openshift"]] # For keywords_eval evaluation
expected_response: OpenShift Virtualization is an extension of the OpenShift Container Platform that allows running virtual machines alongside containers
expected_intent: "explain a concept" # Expected intent for intent evaluation
# Per-turn metrics (overrides system defaults)
turn_metrics:
- "ragas:faithfulness"
- "custom:keywords_eval"
- "custom:answer_correctness"
- "custom:intent_eval"
# Per-turn metric configuration
turn_metrics_metadata:
"ragas:faithfulness":
threshold: 0.9 # Override system default
# turn_metrics: null (omitted) → Use system defaults (metrics with default=true)
- turn_id: id2
query: Skip this turn evaluation
turn_metrics: [] # Skip evaluation for this turn
- turn_id: id3
query: Create a namespace called test-ns
verify_script: "scripts/verify_namespace.sh" # Script-based verification
turn_metrics:
- "script:action_eval" # Script-based evaluation (if API is enabled)| Field | Type | Required | Description |
|---|---|---|---|
conversation_group_id |
string | ✅ | Unique identifier for conversation |
description |
string | ❌ | Optional description |
tag |
string | ❌ | Tag for grouping eval conversations (default: "eval") |
setup_script |
string | ❌ | Path to setup script (Optional, used when API is enabled) |
cleanup_script |
string | ❌ | Path to cleanup script (Optional, used when API is enabled) |
conversation_metrics |
list[string] | ❌ | Conversation-level metrics (Optional, if override is required) |
conversation_metrics_metadata |
dict | ❌ | Conversation-level metric config (Optional, if override is required) |
turns |
list[TurnData] | ✅ | List of conversation turns |
| Field | Type | Required | Description | API Populated |
|---|---|---|---|---|
turn_id |
string | ✅ | Unique identifier for the turn | ❌ |
query |
string | ✅ | The question/prompt to evaluate | ❌ |
response |
string | 📋 | Actual response from system | ✅ (if API enabled) |
contexts |
list[string] | 📋 | Context information for evaluation | ✅ (if API enabled) |
attachments |
list[string] | ❌ | Attachments | ❌ |
expected_keywords |
list[list[string]] | 📋 | Expected keywords for keyword evaluation (list of alternatives) | ❌ |
expected_response |
string or list[string] | 📋 | Expected response for comparison | ❌ |
expected_intent |
string | 📋 | Expected intent for intent evaluation | ❌ |
expected_tool_calls |
list[list[list[dict]]] | 📋 | Expected tool call sequences (multiple alternative sets) | ❌ |
tool_calls |
list[list[dict]] | ❌ | Actual tool calls from API | ✅ (if API enabled) |
verify_script |
string | 📋 | Path to verification script | ❌ |
turn_metrics |
list[string] | ❌ | Turn-specific metrics to evaluate | ❌ |
turn_metrics_metadata |
dict | ❌ | Turn-specific metric configuration | ❌ |
📋 Required based on metrics: Some fields are required only when using specific metrics
Examples
expected_keywords: Required forcustom:keywords_eval(case insensitive matching)expected_response: Required forcustom:answer_correctnessexpected_intent: Required forcustom:intent_evalexpected_tool_calls: Required forcustom:tool_eval(multiple alternative sets format)verify_script: Required forscript:action_eval(used when API is enabled)response: Required for most metrics (auto-populated if API enabled)
Multiple expected responses: For metrics that include expected_response in their required_fields (defined in METRIC_REQUIREMENTS), you can provide expected_response as a list of strings. The evaluator will test each expected response until one passes. If all fail, it returns the maximum score from all attempts and logs all scores with their reasons into reason. Note: This feature only works for metrics explicitly listed in METRIC_REQUIREMENTS. For other metrics (e.g. GEval), only the first item in the list will be used. See example config for multiple expected responses (evaluation_data_multiple_expected_responses.yaml).
| Override Value | Behavior |
|---|---|
null (or omitted) |
Use system global metrics (metrics with default: true) |
[] (empty list) |
Skip evaluation for this turn |
["metric1", ...] |
Use specified metrics only, ignore global metrics |
The custom:tool_eval metric supports flexible matching with multiple alternative patterns:
- Format:
[[[tool_calls, ...]], [[tool_calls]], ...](list of list of list) - Matching: Tries each alternative until one matches
- Use Cases: Optional tools, multiple approaches, default arguments, skip scenarios
- Empty Sets:
[]represents "no tools" and must come after primary alternatives - Options:
ordered(default: true) — sequence order must match when true, ignored when falsefull_match(default: true) — exact 1:1 match when true, partial match when false
# Multiple alternative sets format: [[[tool_calls, ...]], [[tool_calls]], ...]
expected_tool_calls:
- # Alternative 1: Primary approach
- # Sequence 1
- tool_name: oc_get
arguments:
kind: pod
name: openshift-light* # Regex patterns supported
- # Sequence 2 (if multiple parallel tool calls needed)
- tool_name: oc_describe
arguments:
kind: pod
- # Alternative 2: Different approach
- # Sequence 1
- tool_name: kubectl_get
arguments:
resource: pods
- # Alternative 3: Skip scenario (optional)
[] # When model has information from previous conversationThe framework supports script-based evaluations. Note: Scripts only execute when API is enabled - they're designed to test with actual environment changes.
- Setup scripts: Run before conversation evaluation (e.g., create failed deployment for troubleshoot query)
- Cleanup scripts: Run after conversation evaluation (e.g., cleanup failed deployment)
- Verify scripts: Run per turn for
script:action_evalmetric (e.g., validate if a pod has been created or not)
# Example: evaluation_data.yaml
- conversation_group_id: infrastructure_test
setup_script: ./scripts/setup_cluster.sh
cleanup_script: ./scripts/cleanup_cluster.sh
turns:
- turn_id: turn_id
query: Create a new cluster
verify_script: ./scripts/verify_cluster.sh
turn_metrics:
- script:action_evalScript Path Resolution
Script paths in evaluation data can be specified in multiple ways:
- Relative Paths: Resolved relative to the evaluation data YAML file location, not the current working directory
- Absolute Paths: Used as-is
- Home Directory Paths: Expands to user's home directory
# Hosted vLLM (provider: hosted_vllm)
export HOSTED_VLLM_API_KEY="your-key"
export HOSTED_VLLM_API_BASE="https://your-vllm-endpoint/v1"
# OpenAI (provider: openai)
export OPENAI_API_KEY="your-openai-key"
# IBM Watsonx (provider: watsonx)
export WATSONX_API_KEY="your-key"
export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com"
export WATSONX_PROJECT_ID="your-project-id"
# Gemini (provider: gemini)
export GEMINI_API_KEY="your-key"
# Azure OpenAI (provider: azure)
export AZURE_API_KEY="your-azure-key"
export AZURE_API_BASE="https://your-resource.openai.azure.com/"
# AZURE_API_VERSION is optionalNote for Azure: The
modelfield should be Azure deployment name, not the model name (when these are different).
# API authentication for external system (MCP)
export API_KEY="your-api-endpoint-key"- CSV: Detailed results with status, scores, reasons
- JSON: Summary statistics with score distributions
- TXT: Human-readable summary
- PNG: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)
- Status: PASS/FAIL/ERROR/SKIPPED
- Actual Reasons: Reason for evaluation status/result
- Score Statistics: Mean, median, standard deviation, min/max for every metric
When using the streaming endpoint (api.endpoint_type: streaming), the framework captures additional performance metrics:
| Metric | Description |
|---|---|
time_to_first_token |
Time in seconds from request start to first content token received |
streaming_duration |
Total time in seconds to receive all tokens |
tokens_per_second |
Output throughput (tokens generated per second, excluding TTFT) |
These metrics are included in:
- CSV output: Per-result columns for each metric
- JSON output: Per-result fields and aggregate statistics in
streaming_performance - TXT output: Aggregate statistics (mean, median, min/max) in the summary
# Install dev dependencies and git hooks
make install-deps-test
# Format code
make black-format
# Run all pre-commit checks at once (same as CI)
make pre-commit # Runs: bandit, check-types, pyright, docstyle, ruff, pylint, black-check
# or Run each quality checks individually:
make bandit # Security scan
make check-types # Type check
make pyright # Type check
make docstyle # Docstring style
make ruff # Lint check
make pylint # Lint check
make black-check # Check formatting
# Run tests
make test # Or: uv run pytest tests --cov=src| Issue | Solution |
|---|---|
Parsing error with context-related metrics (e.g., faithfulness) |
Increase max_tokens to a higher value (e.g., 2048 or higher - depends on number of the context & size) |
| API responses not changing after updates | Disable caching (cache_enabled: false) or delete the cache folders (.caches/) |
For generating answers (optional) refer README-generate-answers
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Contributions welcome - see development setup above for code quality tools.