BibleBench Quick Start Guide

Get up and running with BibleBench in 5 minutes!

Step 1: Install Dependencies

# Make sure you have Node.js 18+ and pnpm installed
pnpm install

Step 2: Set Up OpenRouter API Key

BibleBench uses OpenRouter exclusively - you only need one API key for all models!

# Copy the example environment file
cp .env.example .env

# Edit .env and add your OpenRouter API key

Your .env should contain:

OPENROUTER_API_KEY=your_openrouter_key_here

Get your API key: https://openrouter.ai/keys

Why OpenRouter?

✅ One key for GPT, Claude, Llama, Grok, Gemini, and hundreds more
✅ Pay-as-you-go pricing
✅ Automatic failover
✅ No need for multiple provider accounts

Step 3: Run Your First Evaluation

# Run all evaluations with the UI
pnpm eval:dev

This will:

Start the Evalite development server
Run all evaluation suites
Open a UI at http://localhost:3006

Step 4: Explore Results

The Evalite UI shows:

Overall scores for each model on each evaluation
Detailed breakdowns by scorer
Traces showing exact inputs/outputs
Metadata with reasoning from LLM-as-judge scorers

Running Specific Evaluations

# Run only scripture tests
pnpm eval evals/scripture/

# Run only a specific test file
pnpm eval evals/theology/core-doctrines.eval.ts

# Run without caching (for production)
pnpm eval --no-cache

Understanding the Results

Score Interpretation

1.0 = Perfect score
0.7-0.9 = Good, with minor issues
0.4-0.6 = Partial correctness
0.0-0.3 = Significant problems

Evaluation Categories

Scripture/Exact Scripture Matching: Tests precise recall of verses across multiple translations with exact wording
Scripture/Reference Knowledge: Tests knowledge of where verses are found
Scripture/Context Understanding: Tests understanding of biblical context
Theology/Core Doctrines: Tests understanding of key Christian doctrines
Theology/Heresy Detection: Tests ability to identify false teachings
Theology/Denominational Nuance: Tests fair representation of different traditions
Theology/Pastoral Application: Tests application of theology to real situations

Scorers Explained

Exact Match: Binary 0 or 1 (exact text match)
Contains: Binary 0 or 1 (substring match)
Levenshtein: 0-1 similarity based on edit distance
Theological Accuracy Judge: LLM evaluates theological correctness (0-1)
Heresy Detection Judge: LLM detects heterodox teaching (1 = orthodox, 0 = heretical)
Custom scorers: Various domain-specific metrics

Customizing Tests

Test Fewer Models

Use the MODELS environment variable to filter which models to test - no code changes needed:

# Run only specific models
MODELS="gpt" pnpm eval              # Only GPT models
MODELS="claude" pnpm eval           # Only Claude models
MODELS="opus,sonnet" pnpm eval      # Only Opus and Sonnet models
MODELS="gpt-5.2" pnpm eval:dev      # Only GPT-5.2

# Run multiple providers
MODELS="gpt,claude,grok" pnpm eval

The pattern matching is case-insensitive and matches partial names. For example:

MODELS="gpt" matches all models with "gpt" in the name
MODELS="claude haiku" matches "Claude Haiku 4.5"
MODELS="5.2,opus" matches "GPT-5.2" and "Claude Opus 4.5"

Add More Test Cases

Edit any .eval.ts file and add to the data array:

const verseRecallData = [
  // ... existing test cases
  {
    input: "Your new question",
    expected: "Expected answer",
    reference: "Scripture reference"
  }
];

Change the Judge Model

Edit evals/lib/models.ts:

// Use a different model for judging
export const defaultJudgeModel = sonnet45; // Instead of gpt-5-mini

Common Issues

"API key not found"

Make sure your .env file has the correct API keys for the models you're testing.

"Module not found" errors

Run pnpm install to ensure all dependencies are installed.

Slow evaluation

Use --no-cache flag only when needed
With caching enabled, repeated runs are much faster
Use MODELS to test fewer models: MODELS="gpt-5.2" pnpm eval

High API costs

Caching helps reduce costs significantly
Start with just one or two models
Use smaller/cheaper models as judges (e.g., gpt4oMini)

Next Steps

Read the full README.md for detailed documentation
Check CONTRIBUTING.md to add your own evaluations
Explore the evaluation files in evals/ to understand the structure
Customize scorers in evals/lib/scorers.ts

Tips for Best Results

Use caching during development - It saves time and money
Check the metadata - LLM-as-judge scorers include detailed rationales
Run multiple times - Some models have non-deterministic outputs
Compare across models - The UI makes cross-model comparison easy
Export results - Use pnpm eval:ui to view past results

Getting Help

Read the README.md for full documentation
Check existing GitHub Issues
Open a new issue for bugs or questions
See CONTRIBUTING.md for how to contribute

Happy evaluating! May your models know their scripture well. 📖

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BibleBench Quick Start Guide

Step 1: Install Dependencies

Step 2: Set Up OpenRouter API Key

Step 3: Run Your First Evaluation

Step 4: Explore Results

Running Specific Evaluations

Understanding the Results

Score Interpretation

Evaluation Categories

Scorers Explained

Customizing Tests

Test Fewer Models

Add More Test Cases

Change the Judge Model

Common Issues

"API key not found"

"Module not found" errors

Slow evaluation

High API costs

Next Steps

Tips for Best Results

Getting Help

FilesExpand file tree

QUICKSTART.md

Latest commit

History

QUICKSTART.md

File metadata and controls

BibleBench Quick Start Guide

Step 1: Install Dependencies

Step 2: Set Up OpenRouter API Key

Step 3: Run Your First Evaluation

Step 4: Explore Results

Running Specific Evaluations

Understanding the Results

Score Interpretation

Evaluation Categories

Scorers Explained

Customizing Tests

Test Fewer Models

Add More Test Cases

Change the Judge Model

Common Issues

"API key not found"

"Module not found" errors

Slow evaluation

High API costs

Next Steps

Tips for Best Results

Getting Help