🎙️ VoiceBench

Open-source voice AI evaluation workbench for dev teams

Quick Start • Features • Screenshots • Architecture • Providers • Contributing

What is VoiceBench?

VoiceBench is an open-source evaluation workbench for teams building voice AI applications. Hook up your voice providers, run real conversations against evaluation prompts, and get comprehensive quality metrics — both auto-detected and human-rated — in one place.

If you're building a voice agent and your current eval process is "listen to it and see if it sounds okay," this is for you.

The Problem

Voice AI teams lack standardized tooling to evaluate their agents. Text LLMs have dozens of eval frameworks. Voice agents? You're on your own. You can't easily answer:

How fast does my agent respond? (TTFB, latency)
How accurate is the speech recognition? (WER)
Does it sound natural? Handle interruptions well? (Human judgment)
How does Provider A compare to Provider B on the same prompts?
Are we getting better or worse over time?

The Solution

VoiceBench gives you a structured workflow: pick a provider → pick a prompt → have a conversation → rate the responses → analyze across sessions. Auto-detected metrics (latency, WER, speech rate) are captured automatically. Human quality metrics (naturalness, emotion, turn-taking) are one-click ratings per turn. Everything feeds into a cross-session analytics dashboard you can filter by provider, prompt, or metric.

📸 Screenshots

Live Eval — Conversation with Real-Time Metrics

Start a session with any configured provider. Auto metrics update live. Rate each turn on 8 quality dimensions with one click.

Results — Cross-Session Analytics

Compare providers, analyze by prompt, break down by metric. Export to CSV.

Prompts — 75+ Evaluation Scenarios

Built-in scenarios across task completion, information retrieval, and conversation flow. Create your own or import from YAML.

Settings — Provider Configuration

Add and manage voice AI providers. Test connections before running evals.

🚀 Quick Start

Prerequisites

Node.js 20+
npm or pnpm
SQLite (bundled via better-sqlite3)

Installation

git clone https://github.com/mhmdez/voicebench.git
cd voicebench

npm install

cp .env.example .env.local
# Add your provider API keys to .env.local

npm run db:push

npm run dev

Open http://localhost:3000 — you'll land on the Live Eval page.

Environment Variables

# Database (SQLite, local by default)
DATABASE_URL=./data/voicebench.db

# Provider keys (add via Settings UI or env)
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=...
RETELL_API_KEY=...

# Judge model for automated scoring (optional)
JUDGE_MODEL=gpt-4o
JUDGE_API_KEY=sk-...

# Whisper for transcription/WER (optional)
WHISPER_API_KEY=sk-...

✨ Features

🎯 Live Eval

The core workflow:

Choose provider + prompt — Select from configured providers and 75+ built-in scenarios, or type a freestyle prompt
Converse — Multi-turn conversation with the voice agent
Rate per turn — Quick thumbs up/down on 8 quality dimensions
Watch metrics live — Auto-detected metrics and sparkline trends update after each turn
End & save — Session saved with full metrics for cross-session analysis

📊 Auto-Detected Metrics

Measured automatically during every conversation — no manual work required:

Metric	What it measures
TTFB	Time to first byte — how fast the agent starts responding
Response Time	Total end-to-end response latency
Word Count	Response verbosity per turn
Speech Rate	Words per minute — pacing analysis
WER	Word Error Rate — transcription accuracy vs expected
Audio Duration	Length of audio responses in seconds

Real-time sparkline charts show trends across turns so you can spot degradation mid-conversation.

👤 Human Rating Metrics

Some things only a human can judge. One click per turn, per metric:

Metric	What it captures
Naturalness	Does it sound like a real person talking?
Prosody	Rhythm, stress, intonation — the musicality of speech
Emotion	Appropriate emotional expression for the context
Accuracy	Factual correctness of the agent's responses
Helpfulness	Did it actually help accomplish the task?
Efficiency	Got to the point without unnecessary rambling?
Turn-taking	Natural conversational flow and response timing
Interruption Handling	Graceful handling when the user interrupts

Three-state rating: 👍 positive, 👎 negative, or — neutral (skip). Designed for speed — you can rate a full conversation in seconds.

📈 Analytics Dashboard

The Results page aggregates data across all your eval sessions with three analysis views:

Overview — Provider comparison with horizontal bar charts showing average TTFB and human ratings side-by-side. Instantly see which provider performs best.

By Prompt — Which scenarios each provider handles well or poorly. Sorted by usage count. Click a prompt to filter the session table.

By Metric — Per-metric distribution across all sessions. Color-coded: green (>70%), orange (40-70%), red (<40%). See exactly where each provider falls short.

Plus: CSV export of any filtered view, date range filtering (7d/30d/all), provider and status filters.

📝 Prompts Library

75+ built-in evaluation scenarios organized by category:

Task Completion — Restaurant booking, appointment scheduling, order placement, travel planning
Information Retrieval — FAQ lookup, product details, weather queries, knowledge questions
Conversation Flow — Multi-turn dialogue, context retention, topic switching, error recovery

Each scenario includes difficulty rating (easy/medium/hard), expected outcome, and tags. Create custom prompts via the UI or bulk import from YAML.

🔌 Providers

Extensible adapter architecture — add any voice AI provider:

Provider	Type	Pipeline	Status
OpenAI Realtime	`openai`	Whisper → GPT-4o → TTS	✅ Built-in
Google Gemini	`gemini`	Gemini multimodal → Cloud TTS	✅ Built-in
Retell AI	`retell`	End-to-end voice agent API	✅ Built-in
ElevenLabs	`elevenlabs`	Coming soon	🔜
Custom	`custom`	Bring your own endpoint	✅ Supported

Configure providers through the Settings UI or programmatically:

# OpenAI Realtime
curl -X POST http://localhost:3000/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "name": "GPT-4o Realtime",
    "type": "openai",
    "config": { "apiKey": "sk-...", "model": "gpt-4o-realtime", "voiceId": "nova" }
  }'

Writing a Custom Adapter

import { ProviderAdapter } from '@/lib/providers/base-adapter';
import type { AudioPrompt, ProviderResponse } from '@/lib/providers/types';

export class MyAdapter extends ProviderAdapter {
  async generateResponse(prompt: AudioPrompt): Promise<ProviderResponse> {
    // Your voice provider logic here
    // Return: { audioBuffer, transcript, metadata }
  }

  async healthCheck() {
    // Return: { healthy: boolean, latencyMs: number }
  }
}

🏗️ Architecture

voicebench/
├── src/
│   ├── app/
│   │   ├── api/
│   │   │   ├── eval/         # Sessions, turns, ratings, analytics
│   │   │   ├── providers/    # Provider CRUD + health checks
│   │   │   └── scenarios/    # Prompt management + YAML import
│   │   ├── eval/
│   │   │   ├── live/         # Live conversation + real-time metrics
│   │   │   └── demo/         # Demo with sample data
│   │   ├── results/          # Analytics dashboard + session detail
│   │   ├── prompts/          # Scenario library management
│   │   └── settings/         # Provider configuration
│   ├── components/
│   │   ├── layout/           # Sidebar navigation
│   │   ├── settings/         # Provider form + list
│   │   └── ui/               # shadcn/ui components
│   ├── db/                   # Drizzle ORM schemas (SQLite)
│   ├── lib/
│   │   ├── providers/        # Adapter system (OpenAI, Gemini, Retell)
│   │   ├── eval/             # WER calculator, metrics collector, LLM judge
│   │   └── services/         # Business logic
│   └── types/                # TypeScript interfaces
├── data/                     # SQLite database (gitignored)
└── docs/                     # Screenshots, architecture docs

Tech Stack

Layer	Technology
Framework	Next.js 16 (App Router, React 19)
Language	TypeScript 5
Database	SQLite + Drizzle ORM
UI	shadcn/ui, Tailwind CSS 4
State	Zustand
Validation	Zod

📜 Scripts

npm run dev          # Development server (localhost:3000)
npm run build        # Production build
npm run db:push      # Push schema to database
npm run db:seed      # Seed sample data + demo prompts
npm run db:studio    # Open Drizzle Studio (DB browser)

🤝 Contributing

Contributions welcome:

Fork the repo
Create a feature branch (git checkout -b feat/my-feature)
Make sure npm run build passes
Open a PR

Ideas for contributions:

New provider adapters (ElevenLabs, PlayHT, Deepgram)
Additional auto-detected metrics
Batch evaluation mode (run N prompts sequentially)
Real-time audio waveform visualization
Team/workspace features

📄 License

MIT — see LICENSE for details.

📬 Contact

Questions, feedback, or want to collaborate? Reach out at mhmdez@me.com.

Built for teams shipping voice AI

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github		.github
data		data
docs		docs
drizzle		drizzle
public		public
scripts		scripts
specs		specs
src		src
.env.example		.env.example
.gitignore		.gitignore
CASE_STUDY.md		CASE_STUDY.md
README.md		README.md
SECURITY.md		SECURITY.md
SPECS.md		SPECS.md
TASKS.json		TASKS.json
components.json		components.json
drizzle.config.ts		drizzle.config.ts
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ VoiceBench

What is VoiceBench?

The Problem

The Solution

📸 Screenshots

Live Eval — Conversation with Real-Time Metrics

Results — Cross-Session Analytics

Prompts — 75+ Evaluation Scenarios

Settings — Provider Configuration

🚀 Quick Start

Prerequisites

Installation

Environment Variables

✨ Features

🎯 Live Eval

📊 Auto-Detected Metrics

👤 Human Rating Metrics

📈 Analytics Dashboard

📝 Prompts Library

🔌 Providers

Writing a Custom Adapter

🏗️ Architecture

Tech Stack

📜 Scripts

🤝 Contributing

📄 License

📬 Contact

About

Uh oh!

Releases

Packages

Languages

mhmdez/voicebench

Folders and files

Latest commit

History

Repository files navigation

🎙️ VoiceBench

What is VoiceBench?

The Problem

The Solution

📸 Screenshots

Live Eval — Conversation with Real-Time Metrics

Results — Cross-Session Analytics

Prompts — 75+ Evaluation Scenarios

Settings — Provider Configuration

🚀 Quick Start

Prerequisites

Installation

Environment Variables

✨ Features

🎯 Live Eval

📊 Auto-Detected Metrics

👤 Human Rating Metrics

📈 Analytics Dashboard

📝 Prompts Library

🔌 Providers

Writing a Custom Adapter

🏗️ Architecture

Tech Stack

📜 Scripts

🤝 Contributing

📄 License

📬 Contact

About

Topics

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages