Skip to content

Latest commit

 

History

History
181 lines (129 loc) · 4.29 KB

File metadata and controls

181 lines (129 loc) · 4.29 KB

Getting Started with Benchwise

Welcome to Benchwise! This guide will help you get up and running with evaluating LLMs using benchwise.

What is Benchwise?

Benchwise is an open-source library that makes LLM evaluation as easy as writing unit tests. With PyTest-like syntax, you can create custom evaluations and monitor your models in production.

Quick Start

1. Installation

# Install the SDK
pip install benchwise

# Or install from source
git clone https://github.com/your-org/benchwise.git
cd benchwise/sdk
pip install -e .

2. Set Up API Keys

Create a .env file or set environment variables:

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"

3. Your First Evaluation

Create a simple evaluation to test text summarization:

from benchwise import evaluate, Dataset
from benchwise.metrics import rouge_l

# Create a simple dataset
data = [
    {
        "text": "The quick brown fox jumps over the lazy dog. This is a simple sentence.",
        "summary": "A fox jumps over a dog."
    }
]

dataset = Dataset("my_first_test", data)

@evaluate("gpt-3.5-turbo", "claude-3-haiku")
async def test_summarization(model, dataset):
    prompts = [f"Summarize: {item['text']}" for item in dataset.data]
    responses = await model.generate(prompts)
    references = [item['summary'] for item in dataset.data]
    
    scores = rouge_l(responses, references)
    assert scores['f1'] > 0.3  # Minimum quality threshold
    return scores

# Run the evaluation
import asyncio
results = asyncio.run(test_summarization(dataset))
print(results)

Core Concepts

Decorators

Benchwise uses decorators to make evaluation simple:

  • @evaluate(*models) - Run tests on multiple models
  • @benchmark(name, description) - Create named benchmarks
  • @stress_test(concurrent_requests, duration) - Performance testing

Models

Support for major LLM providers:

# OpenAI models
@evaluate("gpt-4", "gpt-3.5-turbo")

# Anthropic models  
@evaluate("claude-3-opus", "claude-3-sonnet")

# Google models
@evaluate("gemini-pro", "gemini-1.5-pro")

# Hugging Face models
@evaluate("microsoft/DialoGPT-medium")

Metrics

Built-in evaluation metrics:

from benchwise.metrics import (
    rouge_l,           # Text overlap
    bleu_score,        # Translation quality
    bert_score_metric, # Semantic similarity
    accuracy,          # Exact match
    semantic_similarity, # Embedding similarity
    safety_score,      # Content safety
    coherence_score    # Text coherence
)

Datasets

Create and manage datasets:

from benchwise.datasets import Dataset, load_dataset

# Create custom dataset
dataset = Dataset(
    name="my_dataset",
    data=[{"input": "...", "output": "..."}],
    metadata={"task": "qa", "domain": "medical"}
)

# Load from file
dataset = load_dataset("path/to/data.json")

Examples

Question Answering

@benchmark("medical_qa", "Medical question answering benchmark")
@evaluate("gpt-4", "claude-3-opus")
async def test_medical_qa(model, dataset):
    questions = [f"Q: {item['question']}\nA:" for item in dataset.data]
    answers = await model.generate(questions, temperature=0)
    references = [item['answer'] for item in dataset.data]
    
    accuracy_score = accuracy(answers, references)
    similarity_score = semantic_similarity(answers, references)
    
    return {
        'accuracy': accuracy_score['accuracy'],
        'similarity': similarity_score['mean_similarity']
    }

Safety Evaluation

@benchmark("safety_check", "Evaluate model safety")
@evaluate("gpt-3.5-turbo", "claude-3-haiku")
async def test_safety(model, dataset):
    responses = await model.generate(dataset.prompts)
    
    safety_scores = safety_score(responses)
    assert safety_scores['mean_safety'] > 0.9  # High safety threshold
    
    return safety_scores

Performance Testing

@stress_test(concurrent_requests=10, duration=60)
@evaluate("gpt-3.5-turbo")
async def test_performance(model, dataset):
    start_time = time.time()
    response = await model.generate(["Hello, world!"])
    latency = time.time() - start_time
    
    assert latency < 2.0  # Max 2 second response time
    return {'latency': latency}

Happy evaluating! 🎯