A production-grade framework for evaluating Large Language Model (LLM) responses using multiple metrics, including classical NLP metrics, semantic similarity, RAG-specific metrics, and LLM-as-a-Judge evaluations.
🎬 See it in action
This framework provides a comprehensive suite of evaluation metrics for assessing LLM-generated responses. It's designed for researchers, ML engineers, and developers who need reliable, reproducible evaluation of language model outputs.
Key Capabilities:
- 7 Built-in Metrics: BLEU, ROUGE-L, BERTScore, Faithfulness, Context Relevancy, Answer Relevancy, Multi-dimensional Judge
- Rich Reporting: JSON results, markdown reports, radar charts, score histograms
- Fully Containerized: Docker and docker-compose ready
- CI/CD Integrated: GitHub Actions workflow with automated testing
- 80%+ Test Coverage: Comprehensive test suite with pytest
- Flexible Configuration: YAML-based configuration with Pydantic validation
- Multi-Provider Support: OpenAI and Groq LLM providers
- Beautiful Visualizations: matplotlib and seaborn charts
- BLEU: Bilingual Evaluation Understudy score for n-gram overlap
- ROUGE-L: Longest Common Subsequence-based recall metric
- BERTScore: Neural similarity using BERT embeddings
- Faithfulness: How well the response is grounded in retrieved contexts
- Context Relevancy: How relevant retrieved contexts are to the query
- Answer Relevancy: How relevant the response is to the original query
- Multi-dimensional Judge: Customizable dimensions (accuracy, clarity, completeness, etc.)
- JSON Results: Structured output for programmatic access
- Markdown Reports: Human-readable evaluation summaries
- Radar Charts: Multi-metric performance visualization
- Score Histograms: Distribution analysis for each metric
- Python 3.13 or higher
- Poetry (for dependency management)
- Docker & docker-compose (optional, for containerized deployment)
# Clone the repository
git clone https://github.com/AkhileshMalthi/llm-eval-framework.git
cd llm-eval-framework
# Install dependencies
poetry install
# Run commands with poetry run (e.g., poetry run llm-eval)# Clone the repository
git clone https://github.com/AkhileshMalthi/llm-eval-framework.git
cd llm-eval-framework
# Install dependencies
pip install -e .# Clone the repository
git clone https://github.com/AkhileshMalthi/llm-eval-framework.git
cd llm-eval-framework
# Build and run with docker-compose
docker-compose upCreate a .env file with your API keys:
OPENAI_API_KEY=your_openai_key_here
GROQ_API_KEY=your_groq_key_hereCreate a JSONL file with your evaluation data (see benchmarks/rag_benchmark.jsonl for format):
{"query": "What is Python?", "response": "Python is a programming language.", "expected_answer": "Python is a high-level programming language.", "retrieved_contexts": ["Python is a popular programming language."]}
{"query": "What is the capital of France?", "response": "Paris", "expected_answer": "Paris is the capital of France.", "retrieved_contexts": ["Paris is the capital city of France."]}Create a config.yaml file:
eval_name: "My Evaluation"
dataset_path: "benchmarks/rag_benchmark.jsonl"
metrics:
- bleu
- rouge
- bertscore
- faithfulness
- context_relevancy
- answer_relevancy
- judge
llm_judge:
provider: "groq" # or "openai"
model: "llama-3.3-70b-versatile" # or "gpt-4"# Using Poetry
poetry run llm-eval --config examples/config.yaml
# Using installed CLI
llm-eval --config examples/config.yaml
# Using Docker
docker-compose run llm-eval llm-eval --config examples/config.yamlResults are saved to the results/ directory:
results.json: Raw evaluation datareport.md: Human-readable markdown reportradar_chart.png: Multi-metric radar visualizationscore_histograms.png: Score distribution charts
# Run with default config
llm-eval --config config.yaml
# Specify custom benchmark file
llm-eval --config config.yaml --benchmark benchmarks/custom.jsonl
# Specify custom output directory
llm-eval --config config.yaml --output-dir results/experiment1
# Adjust parallel workers
llm-eval --config config.yaml --max-workers 8from llm_eval import Evaluator
from llm_eval.metrics.classical import BleuMetric, RougeLMetric
from llm_eval.metrics.semantic import BERTScoreMetric
from llm_eval.reporting.markdown_gen import generate_markdown_report
from llm_eval.reporting.visualizer import generate_radar_chart, generate_score_histograms
import pandas as pd
from pathlib import Path
# Initialize metrics
metrics = [
BleuMetric(),
RougeLMetric(),
BERTScoreMetric()
]
# Create evaluator
evaluator = Evaluator(metrics=metrics, max_workers=4)
# Run evaluation
results_df = evaluator.run("benchmarks/rag_benchmark.jsonl")
# Generate reports
output_dir = Path("results")
output_dir.mkdir(exist_ok=True)
# Generate markdown report
report_md = generate_markdown_report(results_df, "My Evaluation")
with open(output_dir / "report.md", "w") as f:
f.write(report_md)
# Generate visualizations
generate_radar_chart(results_df, output_dir / "radar_chart.png")
generate_score_histograms(results_df, output_dir)# Build the container
docker-compose build
# Run evaluation
docker-compose up
# Run with custom config
docker-compose run llm-eval llm-eval --config /app/examples/custom_config.yaml
# Enter container for debugging
docker-compose run llm-eval bashllm-eval-framework/
├── src/llm_eval/ # Main package
│ ├── __init__.py # Package exports
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration models
│ ├── evaluator.py # Core evaluation engine
│ ├── metrics/ # Metric implementations
│ │ ├── base.py # Base metric class
│ │ ├── classical.py # BLEU, ROUGE
│ │ ├── semantic.py # BERTScore
│ │ ├── rag.py # RAG-specific metrics
│ │ └── judge.py # LLM-as-a-Judge
│ ├── reporting/ # Report generation
│ │ ├── markdown_gen.py # Markdown reports
│ │ └── visualizer.py # Charts and plots
│ └── utils/ # Utilities
│ └── llm_client.py # LLM API client
├── tests/ # Test suite (80%+ coverage)
├── benchmarks/ # Example benchmark data
├── examples/ # Example configurations
├── results/ # Evaluation outputs
├── Dockerfile # Container definition
├── docker-compose.yml # Service orchestration
├── pyproject.toml # Dependencies and metadata
└── .github/workflows/ # CI/CD pipelines
All metrics inherit from BaseMetric and implement the compute() method:
from llm_eval.metrics.base import BaseMetric, MetricResult
class CustomMetric(BaseMetric):
def compute(self, query: str, response: str,
reference: str = None, contexts: list = None) -> MetricResult:
# Your metric logic here
score = self._calculate_score(response, reference)
return MetricResult(
name="Custom Metric",
score=score,
details={"additional_info": "value"}
)Configuration is validated using Pydantic models:
from llm_eval.config import EvalConfig, LLMJudgeConfig
config = EvalConfig(
eval_name="My Evaluation",
dataset_path="data.jsonl",
metrics=["bleu", "rouge"],
llm_judge=LLMJudgeConfig(
provider="openai",
model="gpt-4"
)
)# src/llm_eval/metrics/custom.py
from llm_eval.metrics.base import BaseMetric, MetricResult
class MyCustomMetric(BaseMetric):
"""Description of your metric."""
def __init__(self, **kwargs):
super().__init__()
# Initialize any required resources
def compute(self, query: str, response: str,
reference: str = None, contexts: list = None) -> MetricResult:
"""Compute your metric."""
# Implement your metric logic
score = self._your_calculation(response, reference)
return MetricResult(
name="My Custom Metric",
score=score,
details={"key": "value"}
)Add your metric to the registry in src/llm_eval/evaluator.py:
from llm_eval.metrics.custom import MyCustomMetric
METRIC_REGISTRY = {
"bleu": BLEUMetric,
"rouge": ROUGEMetric,
"my_custom": MyCustomMetric, # Add your metric
# ...
}metrics:
- my_custom
- bleu
- rougeThe framework includes comprehensive tests with 80%+ coverage:
# Run all tests
poetry run pytest
# Run with coverage report
poetry run pytest --cov=src/llm_eval --cov-report=html
# Run specific test file
poetry run pytest tests/test_classical_metrics.py
# Run with verbose output
poetry run pytest -v{
"query": "What is Python?",
"response": "Python is a programming language.",
"expected_answer": "Python is a high-level programming language.",
"metrics": {
"BLEU": {
"score": 0.7532,
"details": {}
},
"ROUGE-L": {
"score": 0.8246,
"details": {}
},
"BERTScore": {
"score": 0.9134,
"details": {}
},
"Faithfulness": {
"score": 0.85,
"details": {}
}
}
}# LLM Evaluation Report
## Summary
- Total Examples: 28
- Average BLEU: 0.7234
- Average ROUGE-L: 0.7891
- Average BERTScore: 0.8923
## Detailed Results
| Query | BLEU | ROUGE-L | BERTScore |
|-------|------|---------|-----------|
| What is Python? | 0.75 | 0.82 | 0.91 |
| ... | ... | ... | ... |Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass (
poetry run pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install development dependencies
poetry install --with dev
# Run tests with coverage
poetry run pytest --cov=src/llm_eval
# Run linting
poetry run ruff check src/
# Format code
poetry run black src/This project is licensed under the MIT License - see the LICENSE file for details.
- NLTK for BLEU implementation
- rouge-score for ROUGE metrics
- BERTScore for semantic similarity
- OpenAI and Groq for LLM-as-a-Judge capabilities
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: akhileshmalthi2299@gmail.com
Built with ❤️ for the ML community
