NicheBench is a lightweight, extensible CLI framework for benchmarking AI models on framework-specific tasks. Starting with Drupal, it features LLM-as-a-Judge evaluation, configuration-driven model management, and a rich CLI for interactive reporting.
Provided by HumanFace Tech.
- π― LLM-as-a-Judge: All tasks scored by a second LLM with custom prompts (no regex/heuristics)
- π 3-Value Scoring: Pass (>66%), Partial (33-66%), Fail (<33%) for nuanced evaluation
- π€ Multi-Turn Conversations: Agentic code generation with iterative refinement
- β‘ Parallel Execution: Configurable parallelism with beautiful worker progress visualization
- π‘οΈ Runaway Protection: Automatic detection and handling of repetitive model responses
- π¦ Framework Packs: Plug-and-play support for frameworks (Drupal first, others to follow)
- βοΈ Configuration-Driven: YAML-based configuration with profiles for different evaluation scenarios
- π§ Provider Agnostic: Works with OpenAI, Groq, Anthropic, etc. via
litellm - π¨ Rich CLI: Beautiful progress bars, tables, and interactive reporting with stacked results
- π Auto-Discovery: New frameworks and tasks discovered automatically
- π Structured Results: Detailed JSON/JSONL output for analysis and reproducibility
# Install and run
poetry install
poetry run nichebench list
# View available tasks
poetry run nichebench list drupal
# Run evaluations (uses configuration defaults)
poetry run nichebench run drupal quiz
# Run specific test cases by ID
poetry run nichebench run drupal quiz --ids="drupal_quiz_001,drupal_quiz_003"
# Run with specific models
poetry run nichebench run drupal quiz --model groq/llama-3.1-8b-instant --judge openai/gpt-4o
# Use configuration profiles
poetry run nichebench run drupal quiz --profile fast # Groq models for speed
poetry run nichebench run drupal quiz --profile reasoning # OpenAI o1 for complex tasks
poetry run nichebench run drupal quiz --profile anthropic # Claude models
# Run tests
poetry run pytestCopy the sample configuration and customize for your needs:
cp nichebench.sample.yml nichebench.yml
# Edit nichebench.yml with your API keys and preferred modelsNicheBench uses a nichebench.yml configuration file with intelligent defaults and profile system:
# Model Under Test (MUT) configuration
mut:
provider: "groq"
model: "gemma2-9b-it"
parameters:
temperature: 0.0
max_tokens: 4096
# Judge model configuration
judge:
provider: "openai"
model: "gpt-4o"
parameters:
temperature: 1.0
max_tokens: 1024
# Evaluation settings
evaluation:
parallelism: 1 # Number of parallel workers (1 = sequential)
# Configuration profiles for different scenarios
profiles:
fast: # Cost-effective Groq models
mut: {provider: "groq", model: "llama-3.1-8b-instant"}
judge: {provider: "groq", model: "llama-3.1-70b-versatile"}
reasoning: # OpenAI o1 models with reasoning
mut:
provider: "openai"
model: "o1-preview"
parameters:
reasoning_effort: "high"
reasoning_format: "hidden"
judge: {provider: "openai", model: "o1-mini"}Configuration Precedence: CLI args > Environment variables > Profile > Defaults
Latest benchmark results from September 2025
Test models on Drupal-specific knowledge across multiple choice questions covering APIs, best practices, and advanced concepts.
Key Insights:
- GPT-4o leads with 80% accuracy on complex Drupal knowledge
- Strong mid-tier performance from Llama 3.1 70B (60%) and Claude Sonnet (60%)
- Clear performance tiers emerge across model sizes and training approaches
- Specialized knowledge matters - general coding ability doesn't always translate to framework expertise
Evaluate models on multi-step, production-ready Drupal module development with comprehensive criteria-based judging.
Key Insights:
- GPT-4o dominates complex code generation with 67% success rate
- Significant gap between top performers and smaller models
- Code structure & best practices prove challenging even for capable models
- Multi-turn agentic approach enables more sophisticated implementations than single-shot generation
π‘ Why This Matters: These results demonstrate that framework-specific evaluation reveals performance patterns invisible in general coding benchmarks. NicheBench helps you choose the right model for your specific technical domain.
- β Drupal Framework Pack: 2 quiz + 1 code generation task (more coming)
- β Multi-Turn Code Generation: Agentic conversations with up to 5 turns
- β 3-Value Scoring System: Pass/Partial/Fail evaluation with percentage thresholds
- β Runaway Protection: Automatic detection of repetitive/infinite loop responses
- β Configuration System: Profile-based model management with YAML configuration
- β LLM Integration: Full litellm support with parameter filtering and error handling
- β Judge-driven Evaluation: DeepEval-compatible metrics with structured JSON responses
- β Rich CLI: Interactive reporting with stacked results and average scores
- β Test Coverage: Comprehensive test suite with mocked LLM responses
nichebench/
βββ nichebench.sample.yml # Sample configuration file
βββ results/ # Evaluation outputs
βββ src/nichebench/
βββ cli/ # CLI commands + Rich UI
βββ config/ # Configuration management
βββ core/ # Discovery, datamodel, loaders
βββ providers/ # LLM client + judge adapters + conversation management
βββ metrics/ # DeepEval-compatible metrics with 3-value scoring
βββ frameworks/ # Framework packs
β βββ drupal/
β βββ data/ # YAML test cases
β βββ prompts/ # System prompts (MUT + Judge)
βββ utils/ # Helpers
Test data for frameworks is stored in private repositories as Git submodules to prevent it from being crawled and incorporated into AI training datasets. This preserves benchmark integrity while keeping the evaluation framework open source.
For new users: Clone with submodules to get test data access:
git clone --recursive git@github.com:HumanFace-Tech/nichebench.gitFor existing clones: Initialize submodules:
git submodule update --init --recursiveNote: Access to private test data repositories is granted on a collaboration basis. The framework structure and evaluation prompts remain publicly available.
# Setup development environment
poetry install
poetry run pre-commit install
# Copy sample configuration
cp nichebench.sample.yml nichebench.yml
# Run tests
poetry run pytest
# Code quality checks
poetry run pre-commit run --all-files
# CLI development
poetry run nichebench --helpRequirements: Python 3.10+, Poetry for dependency management
# frameworks/drupal/data/quiz/my_quiz.yaml
id: "drupal_quiz_006"
context: "You're building a custom module..."
question: "Which API should you use for custom entities in Drupal 11?"
choices:
- "hook_entity_info()"
- "EntityTypeInterface annotation"
- "EntityInterface::create()"
- "Custom entity plugins"
correct_choice: "B"And here's an example for a code generation task. Note that we want to have a full context: everything the model needs to know to complete the task: file-tree, file contents even, acceptance criteria, etc.
# frameworks/drupal/data/code_generation/my_code_task.yaml
id: "drupal_code_013"
context: |
Project: "Summit"
Stack:
- Drupal 11.1 (standard install)
- PHP 8.2
Scenario:
- Marketing wants a short-lived promo message on the homepage.
- Editors need to toggle the message without deploying code.
summary: Provide a configurable promo block that only renders when enabled.
prompt: |
Create a `summit_promo` module exposing a `SummitPromoBlock` block plugin.
The block should render a promo message pulled from `summit_promo.settings` and only
appear when the feature is enabled. Add a ConfigFormBase settings page so editors can
toggle the feature and update the message. Ensure the block bubbles cache tags/contexts
for the config and respects max-age. Include an automated test that flips the config
and asserts the block output.
judge_checklist:
- "Block plugin class annotated with `@Block`, uses dependency injection, and returns a render array with the configured message when enabled."
- "Settings form extends ConfigFormBase, writes to `summit_promo.settings`, and prevents saving an empty message while enabled."
- "Typed config schema present; default config installed via `config/install` with sensible initial values."
- "Block attaches cache metadata (config tag + user permissions) using CacheableMetadata to avoid stale output."
- "Functional test (BrowserTestBase or Kernel plus Block plugin build) toggles the config and asserts the block text visibility."
judge_notes: |
β’ Block plugins live under `Drupal\summit_promo\Plugin\Block` and extend BlockBase.
β’ Inject config with `ConfigFactoryInterface` or `ImmutableConfig`, not `\Drupal::config()`.
β’ Settings form should clear caches after save so block output updates immediately.
β’ Tests may use `drupalPlaceBlock()` or build the block plugin manually; ensure assertions cover both enabled and disabled states.# frameworks/drupal/prompts/QUIZ.py
QUIZ_SYSTEM_PROMPT = """You are a senior Drupal developer...
Respond with only the letter of your choice (A, B, C, D, or E)."""
# frameworks/drupal/prompts/judges/JUDGE_QUIZ.py
JUDGE_QUIZ_SYSTEM_PROMPT = """You are an expert evaluator...
Respond with JSON: {"pass": true/false, "selected": "B", "score": 1, "explanation": "..."}"""- Framework-Specific: Focus on niche technical domains (Drupal, WordPress, etc.) vs. generic benchmarks
- 3-Value Scoring: Pass/Partial/Fail evaluation provides nuanced performance insights
- Multi-Turn Capable: Supports iterative code generation with conversation management
- Runaway Protection: Handles model misbehavior (infinite loops, repetitive responses) gracefully
- Judge-Centric: Every evaluation uses LLM-as-a-Judge with custom prompts, not regex matching
- Configuration-Driven: Profile system eliminates CLI parameter overload
- Modular: Plug-and-play framework packs with auto-discovery
- Rich UX: Beautiful CLI with stacked results, average scores, and progress tracking
- Add new framework packs under
src/nichebench/frameworks/<name>/ - Create YAML test cases in
data/<category>/directories - Define system prompts in
prompts/for both MUT and judge - Follow the existing Drupal pack structure
MIT License - see LICENSE file for details.

