LlamaBench is a comprehensive benchmarking framework for evaluating and comparing Large Language Models (LLMs). It provides an easy-to-use interface for running standardized tests across different models and generating detailed performance reports.
- Cloud and Local Model Support: Benchmark models from major providers like OpenAI and Anthropic, as well as local models using HuggingFace Transformers and LlamaCpp
- Pre-defined Task Suites: Evaluate models on reasoning, coding, factual knowledge, safety, and more
- Custom Task Creation: Easily define your own benchmark tasks with custom examples and evaluation metrics
- Parallel Execution: Run benchmarks across multiple models simultaneously
- Flexible Output Formats: Generate reports in JSON, CSV, Markdown, and HTML formats
pip install llamabenchfrom llamabench import run, ModelConfig
from llamabench.suites import get_suite
# Define models to benchmark
models = [
ModelConfig(provider="openai", model="gpt-4-turbo", temperature=0.0),
ModelConfig(provider="anthropic", model="claude-3-opus-20240229", temperature=0.0),
]
# Get a predefined benchmark suite
reasoning_suite = get_suite("reasoning")
# Run the benchmark
results = run(models=models, suite=reasoning_suite)
# Print results
print(results.summary())Check out the examples directory for more usage examples:
- Basic benchmark: Simple comparison of cloud models
- Custom tasks: Creating your own benchmark tasks
- Local models: Using HuggingFace and LlamaCpp models
- CLI usage: Command line interface examples
MIT