STT Benchmark

A framework for benchmarking Speech-to-Text services with TTFS (Time To Final Segment) latency and Semantic WER (Word Error Rate) accuracy measurement.

Results Summary

Benchmark results on 1000 samples from the pipecat-ai/smart-turn-data-v3.1-train dataset.

Service	Transcripts	Perfect	WER Mean	Pooled WER	TTFS Median	TTFS P95	TTFS P99
assemblyai	99.8%	66.8%	3.49%	3.02%	256ms	362ms	417ms
aws	100.0%	77.4%	1.68%	1.75%	1136ms	1527ms	1897ms
azure	100.0%	82.9%	1.21%	1.18%	1016ms	1345ms	1791ms
cartesia	99.9%	60.5%	3.92%	4.36%	266ms	364ms	898ms
deepgram	99.8%	76.5%	1.71%	1.62%	247ms	298ms	326ms
elevenlabs	99.7%	81.3%	3.16%	3.12%	281ms	348ms	407ms
google	100.0%	69.0%	2.84%	2.85%	878ms	1155ms	1570ms
openai	99.3%	75.9%	3.24%	3.06%	637ms	965ms	1655ms
soniox	99.8%	84.1%	1.25%	1.29%	249ms	281ms	310ms
speechmatics	99.7%	83.2%	1.40%	1.07%	495ms	676ms	736ms

Latency vs Accuracy Trade-off

Typical Latency (Median)

Worst-Case Latency (P95)

The Pareto frontier shows services that offer the best trade-off between latency and accuracy—no other service is better on both metrics. Services on the frontier represent efficient choices depending on your priorities.

For production voice agents, P95 latency matters more than median. Even occasional high latency (5% of interactions) can break the conversational flow. A service with great median but poor P95 indicates inconsistent performance.

Metrics Glossary

Metric	Description
Transcripts	Percentage of samples where STT successfully returned a transcription
Perfect	Perfect transcriptions (0% semantic WER) out of total benchmark runs
WER Mean	Average semantic word error rate across all samples
Pooled WER	Weighted WER (total errors / total reference words)
TTFS Median	Median time from user stops speaking to final transcription segment
TTFS P95	95th percentile TTFS - worst 5% of samples have latency above this
TTFS P99	99th percentile TTFS - worst 1% of samples have latency above this

Semantic WER measures only transcription errors that would impact an LLM agent's understanding. Punctuation, contractions, filler words, and equivalent phrasings are ignored.

TTFS (Time To Final Segment) is measured from when the user stops speaking to when the final transcription segment is received. For streaming voice agents, lower TTFS means faster response times.

Measure TTFS for Your Service

If you're using Pipecat and want TTFS latency numbers for your STT service and configuration, see Measuring TTFS for a quick start guide. The P95/P99 values from this tool can be used directly in Pipecat's ttfs_p99_latency service configuration (Pipecat 0.0.102+).

Quick Start

# Install dependencies
uv sync

# Download audio samples
uv run stt-benchmark download --num-samples 100

# Run benchmarks
uv run stt-benchmark run --services deepgram,openai

# Generate ground truth (Gemini)
uv run stt-benchmark ground-truth

# Calculate semantic WER (Claude)
uv run stt-benchmark wer

# View results
uv run stt-benchmark report

Installation

Requires Python 3.12+ and uv.

git clone <repo-url>
cd stt-benchmark
uv sync

Environment Variables

Copy env.example to .env and set your API keys:

cp env.example .env

How It Works

TTFS Measurement

TTFS for STT is different from typical request/response latency. Since STT services receive continuous audio input, there's no discrete request to measure from. Instead, we measure from when the user stops speaking to when the final transcription arrives.

┌─────────────────────────────────────────────────────────────────────────────┐
│ VADUserStartedSpeaking              Actual speech    VADUserStopped         │
│        t=0                            ends           SpeakingFrame          │
│         │                              │                  │                 │
│         ▼                              ▼                  ▼                 │
│  ═══════╪══════════════════════════════╪══════════════════╪════             │
│         │      Audio streaming to STT  │   VAD stop_secs  │                 │
│         │                              │◄────────────────►│                 │
│         │                              │                  │                 │
│         │                              └──── TTFS ────────┼────────►        │
│         │                           speech_end_time       │     T3          │
│         │                                                 │  (final         │
│         │     T1              T2                          │ transcript)     │
│         │      │               │                          │                 │
│         │      ▼               ▼                          │                 │
│         │  transcript      transcript                     │                 │
└─────────────────────────────────────────────────────────────────────────────┘

Key points:

speech_end_time = VADUserStoppedSpeakingFrame timestamp − VAD stop_secs
TTFS = final TranscriptionFrame receipt time − speech_end_time
Streaming services emit multiple partial transcripts; we use the final one

Why the final transcript? For LLM/TTS, there's a discrete input→output making latency measurement simple. For streaming STT, audio flows continuously and generates multiple TranscriptionFrames. We can't know when the STT service finalized audio for intermediate transcripts, so we measure from the final one and use the VAD signal to determine when the user actually stopped speaking.

Semantic WER

Traditional WER penalizes every word difference equally. "gonna" vs "going to" counts as 2 errors.

Semantic WER uses Claude to evaluate whether differences actually matter:

Ignored (not errors)	Counted (errors)
Punctuation, capitalization	Word substitutions that change meaning
Contractions ("don't" → "do not")	Nonsense/hallucinated words
Singular/plural ("license" → "licenses")	Missing words that change intent
Filler words ("um", "uh")	Wrong names, numbers, negations
Number formats ("3" → "three")	Factual errors

This gives accuracy metrics that reflect real-world impact on downstream LLM applications.

Supported Services

assemblyai, aws, azure, cartesia, deepgram, deepgram_flux, elevenlabs, fal, gladia, google, gradium, groq, hathora, nvidia, openai, sambanova, sarvam, soniox, speechmatics, whisper

See env.example for required API keys.

CLI Commands

Running Benchmarks

# Benchmark specific services
uv run stt-benchmark run --services deepgram,openai

# Benchmark all configured services
uv run stt-benchmark run --services all

# Limit samples and adjust VAD
uv run stt-benchmark run --services deepgram --limit 50 --vad-stop-secs 0.3

Generating Ground Truth

# Generate ground truth for all samples
uv run stt-benchmark ground-truth

# Interactive review with audio playback
uv run stt-benchmark ground-truth review <run_id>

Calculating Semantic WER

# Calculate for all services
uv run stt-benchmark wer

# Force recalculate
uv run stt-benchmark wer --services deepgram --force-recalculate

Viewing Reports

# Compare all services
uv run stt-benchmark report

# Detailed report for one service
uv run stt-benchmark report --service deepgram

# Show worst samples
uv run stt-benchmark report --service deepgram --errors 10

See docs/cli.md for complete CLI reference.

Output Structure

stt_benchmark_data/
├── audio/                    # Downloaded audio files
├── results.db                # SQLite database
├── ground_truth_runs/        # Iteration JSONL files
├── validation_summary.txt    # Generated reports
└── validation_full.csv

Database Tables

Table	Description
`samples`	Audio sample metadata
`benchmark_results`	TTFS and transcription results
`ground_truths`	Reference transcriptions (Gemini)
`wer_metrics`	Semantic WER calculations
`semantic_wer_traces`	Full Claude reasoning traces

Architecture

┌──────────────────────────────────────────────────────────┐
│                        PipelineTask                      │
│  observers=[MetricsCollector, TranscriptionCollector]    │
├──────────────────────────────────────────────────────────┤
│                                                          │
│       ┌──────────────────┐    ┌───────────────┐          │
│       │ SyntheticInput   │───▶│  STTService   │          │
│       │ Transport        │    │               │          │
│       │                  │    │ Emits:        │          │
│       │ - Plays audio    │    │ - Transcript  │          │
│       │ - Silero VAD     │    │ - MetricsFrame│          │
│       │ - Real-time pace │    │   (TTFS)      │          │
│       └──────────────────┘    └───────────────┘          │
│                                     │                    │
│                           Observers capture frames       │
└──────────────────────────────────────────────────────────┘

Dataset

The benchmark dataset (audio samples and ground truth transcriptions) is publicly available on Hugging Face:

pipecat-ai/stt-benchmark-data

Audio samples are sourced from the pipecat-ai/smart-turn-data-v3.1-train dataset. Ground truth transcriptions are generated with Gemini and human-reviewed.

Documentation

CLI Reference - Complete command documentation
Running Analysis - Step-by-step analysis guide

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
docs		docs
scripts		scripts
src/stt_benchmark		src/stt_benchmark
.gitignore		.gitignore
README.md		README.md
env.example		env.example
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STT Benchmark

Results Summary

Latency vs Accuracy Trade-off

Metrics Glossary

Measure TTFS for Your Service

Quick Start

Installation

Environment Variables

How It Works

TTFS Measurement

Semantic WER

Supported Services

CLI Commands

Running Benchmarks

Generating Ground Truth

Calculating Semantic WER

Viewing Reports

Output Structure

Database Tables

Architecture

Dataset

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

STT Benchmark

Results Summary

Latency vs Accuracy Trade-off

Metrics Glossary

Measure TTFS for Your Service

Quick Start

Installation

Environment Variables

How It Works

TTFS Measurement

Semantic WER

Supported Services

CLI Commands

Running Benchmarks

Generating Ground Truth

Calculating Semantic WER

Viewing Reports

Output Structure

Database Tables

Architecture

Dataset

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages