Get up and running with BibleBench in 5 minutes!
# Make sure you have Node.js 18+ and pnpm installed
pnpm installBibleBench uses OpenRouter exclusively - you only need one API key for all models!
# Copy the example environment file
cp .env.example .env
# Edit .env and add your OpenRouter API keyYour .env should contain:
OPENROUTER_API_KEY=your_openrouter_key_hereGet your API key: https://openrouter.ai/keys
Why OpenRouter?
- ✅ One key for GPT, Claude, Llama, Grok, Gemini, and hundreds more
- ✅ Pay-as-you-go pricing
- ✅ Automatic failover
- ✅ No need for multiple provider accounts
# Run all evaluations with the UI
pnpm eval:devThis will:
- Start the Evalite development server
- Run all evaluation suites
- Open a UI at
http://localhost:3006
The Evalite UI shows:
- Overall scores for each model on each evaluation
- Detailed breakdowns by scorer
- Traces showing exact inputs/outputs
- Metadata with reasoning from LLM-as-judge scorers
# Run only scripture tests
pnpm eval evals/scripture/
# Run only a specific test file
pnpm eval evals/theology/core-doctrines.eval.ts
# Run without caching (for production)
pnpm eval --no-cache- 1.0 = Perfect score
- 0.7-0.9 = Good, with minor issues
- 0.4-0.6 = Partial correctness
- 0.0-0.3 = Significant problems
- Scripture/Exact Scripture Matching: Tests precise recall of verses across multiple translations with exact wording
- Scripture/Reference Knowledge: Tests knowledge of where verses are found
- Scripture/Context Understanding: Tests understanding of biblical context
- Theology/Core Doctrines: Tests understanding of key Christian doctrines
- Theology/Heresy Detection: Tests ability to identify false teachings
- Theology/Denominational Nuance: Tests fair representation of different traditions
- Theology/Pastoral Application: Tests application of theology to real situations
- Exact Match: Binary 0 or 1 (exact text match)
- Contains: Binary 0 or 1 (substring match)
- Levenshtein: 0-1 similarity based on edit distance
- Theological Accuracy Judge: LLM evaluates theological correctness (0-1)
- Heresy Detection Judge: LLM detects heterodox teaching (1 = orthodox, 0 = heretical)
- Custom scorers: Various domain-specific metrics
Use the MODELS environment variable to filter which models to test - no code changes needed:
# Run only specific models
MODELS="gpt" pnpm eval # Only GPT models
MODELS="claude" pnpm eval # Only Claude models
MODELS="opus,sonnet" pnpm eval # Only Opus and Sonnet models
MODELS="gpt-5.2" pnpm eval:dev # Only GPT-5.2
# Run multiple providers
MODELS="gpt,claude,grok" pnpm evalThe pattern matching is case-insensitive and matches partial names. For example:
MODELS="gpt"matches all models with "gpt" in the nameMODELS="claude haiku"matches "Claude Haiku 4.5"MODELS="5.2,opus"matches "GPT-5.2" and "Claude Opus 4.5"
Edit any .eval.ts file and add to the data array:
const verseRecallData = [
// ... existing test cases
{
input: "Your new question",
expected: "Expected answer",
reference: "Scripture reference"
}
];Edit evals/lib/models.ts:
// Use a different model for judging
export const defaultJudgeModel = sonnet45; // Instead of gpt-5-miniMake sure your .env file has the correct API keys for the models you're testing.
Run pnpm install to ensure all dependencies are installed.
- Use
--no-cacheflag only when needed - With caching enabled, repeated runs are much faster
- Use
MODELSto test fewer models:MODELS="gpt-5.2" pnpm eval
- Caching helps reduce costs significantly
- Start with just one or two models
- Use smaller/cheaper models as judges (e.g.,
gpt4oMini)
- Read the full README.md for detailed documentation
- Check CONTRIBUTING.md to add your own evaluations
- Explore the evaluation files in
evals/to understand the structure - Customize scorers in
evals/lib/scorers.ts
- Use caching during development - It saves time and money
- Check the metadata - LLM-as-judge scorers include detailed rationales
- Run multiple times - Some models have non-deterministic outputs
- Compare across models - The UI makes cross-model comparison easy
- Export results - Use
pnpm eval:uito view past results
- Read the README.md for full documentation
- Check existing GitHub Issues
- Open a new issue for bugs or questions
- See CONTRIBUTING.md for how to contribute
Happy evaluating! May your models know their scripture well. 📖