Replay - The first open source framework for e2e benchmark of voice applications

An open source simulator-driven evaluation setup for end-to-end testing of voice-to-voice applications.

Currently, there are no good setups available to test voice-to-voice apps comprehensively, especially for latency, interruption handling, and other production critical aspects. This project fills that gap by enabling you to test any agent against another agent at scale for production use.

Architecture

Replay consists of two main components: a simulator for running conversations and an analysis pipeline for evaluating conversation quality.

Simulator

The simulator orchestrates real-time voice conversations between agents deployed on Pipecat Cloud. It uses the Pipecat Cloud Session API to instantiate both a simulator agent and your tested agent, connecting them to the same Daily.co room for bidirectional audio communication. The simulator handles room creation, token generation, and session management, allowing agents to interact naturally while recording the full conversation audio.

Key features:

Cloud-native deployment: Agents run on Pipecat Cloud infrastructure, enabling scalable testing
Real-time audio: Uses Daily.co WebRTC for low-latency voice communication
Session orchestration: Programmatically starts and manages agent sessions via Pipecat Cloud API
Room coordination: Ensures both agents connect to the same room by passing room URLs via session parameters

Analysis Pipeline

The analysis pipeline processes recorded conversation audio to extract quantitative metrics about conversation quality. It uses speaker diarization (pyannote.ai) to segment audio by speaker, then analyzes three critical dimensions:

Latency Analysis: Measures response times between speaker turns, identifying slow responses (>2s threshold) and intra-speaker pauses. Uses Voxtral (audio LLM) to classify pauses as natural conversational pauses vs unnatural processing delays.
Overlap Analysis: Detects interruptions and simultaneous speech by identifying temporal overlaps between speakers. Analyzes overlap duration and uses Voxtral to distinguish natural backchanneling from problematic interruptions.
Repetition Analysis: Identifies consecutive n-gram repetitions (1-10 grams) within speaker segments, flagging potential stuttering, processing errors, or system failures.

The pipeline outputs consolidated metrics including event counts, average latencies, natural vs unnatural classifications, and detailed event timelines. All analysis uses audio-level understanding via Voxtral rather than transcription, enabling evaluation of prosody, timing, and naturalness that text-based metrics miss.

Simulator

End-to-end voice evaluation simulator for testing agents in conversation. Deploy both a simulator agent and your tested agent to Pipecat Cloud, then use run_conversation.py to put them in the same Daily room for conversation testing.

Quick start:

Deploy simulator and tested agents to Pipecat Cloud (see simulator/README.md for details)
Configure environment variables in simulator/.env
Run the conversation:

cd simulator
uv run run_conversation.py

See simulator/README.md for detailed setup and deployment instructions.

Script:

uv run final_submission/main.py audio/convo.wav --output results.json

Gradio demo:

uv run s2s_eval_gradio.py

Note: will work (way) faster if you have a cuda or mps devices.

Diarization

Scripts for speaker diarization using pyannote.ai.

Upload a local audio file:

uv run diarization/upload.py path/to/audio.wav

Diarize an audio file (public URL or uploaded media):

uv run diarization/diarize.py 'media://your-object-key'
# or with a public URL
uv run diarization/diarize.py 'https://example.com/audio.wav'

Contributing

To avoid merge conflicts, each contributor should push work in their own folder.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
audio		audio
diarization		diarization
final_submission		final_submission
images		images
model_usage		model_usage
simulator		simulator
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
full_results.json		full_results.json
pyproject.toml		pyproject.toml
s2s_eval_gradio.py		s2s_eval_gradio.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Replay - The first open source framework for e2e benchmark of voice applications

Architecture

Simulator

Analysis Pipeline

Simulator

Diarization

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Replay - The first open source framework for e2e benchmark of voice applications

Architecture

Simulator

Analysis Pipeline

Simulator

Diarization

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages