This project is developed as part of Google Summer of Code 2025, mentored by Google DeepMind.
An open-source evaluation framework for large language models, providing systematic benchmarking across standardized academic tasks.
OpenEvals provides infrastructure for:
- Evaluating open-weight models on established benchmarks (MMLU, GSM8K, MATH, HumanEval, ARC, TruthfulQA, and more)
- Comparing performance across model families (Gemma, Llama, Mistral, Qwen, DeepSeek, and arbitrary HuggingFace models)
- Measuring computational efficiency metrics (latency, throughput, memory utilization)
- Generating statistical analyses with confidence intervals
- Producing publication-ready visualizations and reports
- Python 3.10+
- CUDA 11.8+ (for GPU acceleration)
- 50 GB+ disk space (models and datasets)
Rule-of-thumb estimates for dense decoder-only models at moderate context length. Long context and batching can dominate VRAM via KV cache.
| Model Size | Minimum VRAM | Recommended | With Quantization |
|---|---|---|---|
| 1-3B | 4 GB | 8 GB | 2 GB |
| 7-9B | 8 GB | 16 GB | 5 GB |
| 13-14B | 16 GB | 24 GB | 8 GB |
| 70B+ | 40 GB | 80 GB+ | 20 GB |
git clone https://github.com/heilcheng/openevals.git
cd openevals
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt┌─────────────────────────────────────────────────────────────────────────────┐
│ OpenEvals Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────────┐ ┌──────────────────────────────┐│
│ │ Input │ │ Model Under │ │ Model Response ││
│ │ Benchmark │───▶│ Test │───▶│ (Generated Answer) ││
│ │ Dataset │ │ (HuggingFace) │ │ ││
│ └─────────────┘ └─────────────────┘ └──────────────┬───────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Evaluation Engine ││
│ ├─────────────────────────────────────────────────────────────────────────┤│
│ │ ││
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────────────────┐││
│ │ │ Knowledge │ │ Code │ │ Efficiency │││
│ │ │ Tasks │ │ Generation │ │ Metrics │││
│ │ │ │ │ │ │ │││
│ │ │ • MMLU │ │ • HumanEval │ │ • Latency (ms) │││
│ │ │ • TruthfulQA │ │ • MBPP │ │ • Throughput (tok/s) │││
│ │ │ • ARC │ │ • Pass@k │ │ • Memory (GB) │││
│ │ │ • GPQA │ │ │ │ • GPU utilization │││
│ │ └────────────────┘ └────────────────┘ └────────────────────────────┘││
│ │ ││
│ │ ┌────────────────┐ ┌────────────────┐ ││
│ │ │ Mathematical │ │ Reasoning │ ││
│ │ │ Tasks │ │ Tasks │ ││
│ │ │ │ │ │ ││
│ │ │ • GSM8K │ │ • HellaSwag │ ││
│ │ │ • MATH │ │ • WinoGrande │ ││
│ │ │ • Chain-of- │ │ • BBH │ ││
│ │ │ Thought │ │ • IFEval │ ││
│ │ └────────────────┘ └────────────────┘ ││
│ │ ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Output: Scores, Reports, Visualizations ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Model naming convention: Use HuggingFace model IDs or shorthand names as defined in configuration.
| Family | Variants | Sizes | Notes |
|---|---|---|---|
| Gemma | Gemma, Gemma 2, Gemma 3 | 1B - 27B | |
| Llama 3 | Llama 3, 3.1, 3.2 | 1B - 405B | Meta |
| Mistral | Mistral, Mixtral | 7B, 8x7B, 8x22B | Mistral AI |
| Qwen | Qwen 2, Qwen 2.5 | 0.5B - 72B | Alibaba |
| DeepSeek | DeepSeek, DeepSeek-R1 | 1.5B - 671B | DeepSeek |
| Phi | Phi-3 | Mini, Small, Med | Microsoft |
| OLMo | OLMo | 1B, 7B | Allen AI |
| HuggingFace | Any model on Hub | Custom | Any compatible model |
| Benchmark | Category | Description | Metrics |
|---|---|---|---|
| MMLU | Knowledge | 57 subjects spanning STEM, humanities, social sciences | Per-subject accuracy |
| MMLU-Pro | Knowledge | Enhanced MMLU with more challenging questions | Accuracy |
| GSM8K | Mathematical | Grade school math word problems | Exact match |
| MATH | Mathematical | Competition math (AMC, AIME, Olympiad) | Accuracy |
| HumanEval | Code Generation | Python function completion tasks | Pass@k |
| MBPP | Code Generation | Mostly Basic Python Problems | Pass@k |
| ARC | Reasoning | Science questions (Easy and Challenge splits) | Accuracy |
| HellaSwag | Reasoning | Commonsense reasoning about situations | Accuracy |
| WinoGrande | Reasoning | Large-scale Winograd Schema Challenge | Accuracy |
| TruthfulQA | Truthfulness | Questions probing common misconceptions | MC1/MC2 accuracy |
| GPQA | Expert Knowledge | Graduate-level physics, biology, chemistry | Accuracy |
| IFEval | Instruction | Instruction following evaluation | Strict/Loose acc |
| BBH | Reasoning | BIG-Bench Hard - 23 challenging tasks | Accuracy |
# Set HuggingFace authentication
export HF_TOKEN=your_token_here
# Download benchmark datasets
python -m openevals.scripts.download_data --all
# Run evaluation
python -m openevals.scripts.run_benchmark \
--config configs/benchmark_config.yaml \
--models llama3-8b \
--tasks mmlu gsm8k \
--visualizemodels:
llama3-8b:
type: llama3
size: 8b
variant: instruct
quantization: true
qwen2.5-7b:
type: qwen2.5
size: 7b
variant: instruct
tasks:
mmlu:
type: mmlu
subset: all
shot_count: 5
gsm8k:
type: gsm8k
shot_count: 8
use_chain_of_thought: true
evaluation:
runs: 1
batch_size: auto
statistical_tests: true
output:
path: ./results
visualize: true
export_formats: [json, yaml]
hardware:
device: auto
precision: bfloat16
quantization: truefrom openevals.core.benchmark import Benchmark
# Initialize
benchmark = Benchmark("config.yaml")
benchmark.load_models(["llama3-8b", "qwen2.5-7b"])
benchmark.load_tasks(["mmlu", "gsm8k"])
# Run evaluation
results = benchmark.run_benchmarks()
benchmark.save_results("results.yaml")
# Access results
for model_name, model_results in results.items():
for task_name, task_results in model_results.items():
accuracy = task_results["overall"]["accuracy"]
print(f"{model_name} on {task_name}: {accuracy:.4f}")An interactive web interface is available for browser-based evaluation:
# Backend
cd web/backend
pip install -r requirements.txt
uvicorn app.main:app --port 8000
# Frontend
cd web/frontend
npm install && npm run devAccess at http://localhost:3000
Features:
- Dashboard: Overview of benchmark runs and statistics
- Benchmarks: Configure and run evaluations
- Models: Manage model configurations
- Leaderboard: Compare model performance across tasks
- Results: Visualize scores and export reports
results/
├── 20250128_143022/
│ ├── results.yaml
│ ├── summary.json
│ ├── visualizations/
│ │ ├── performance_overview.png
│ │ ├── model_comparison.png
│ │ └── task_breakdown.png
│ └── report.md
pytest tests/
pytest --cov=openevals tests/openevals/
├── openevals/ # Core Python library
│ ├── core/ # Orchestration and model loading
│ ├── tasks/ # Benchmark implementations
│ ├── utils/ # Metrics, validation, utilities
│ ├── visualization/ # Charts and reporting
│ └── scripts/ # CLI entry points
├── web/ # Web platform (Next.js + FastAPI)
├── configs/ # Configuration templates
├── data/ # Benchmark datasets
├── tests/ # Test suite
├── docs/ # Sphinx documentation
└── examples/ # Usage examples
Full documentation is available at: https://heilcheng.github.io/openevals/
If you use OpenEvals in your research, please cite:
@software{openevals2025,
author = {Cheng Hei Lam},
title = {OpenEvals: An Open-Source Evaluation Framework for Large Language Models},
year = {2025},
url = {https://github.com/heilcheng/openevals}
}MIT License. See LICENSE for details.