Awesome Voice Agents

A curated list of voice AI agent frameworks, tools, resources, and best practices | 精选的语音 AI Agent 框架、工具、资源和最佳实践

Welcome to Awesome Voice Agents! This is a carefully curated collection of resources related to voice AI agents, covering core technologies such as endpoint detection, turn-taking management, real-time speech recognition, and speech synthesis.

本项目精心收录语音 AI Agent 相关的优质资源，涵盖端点检测、话轮管理、实时语音识别、语音合成等核心技术。

Maintainer | 维护者: 云中江树

💬 Join Voice Agent Community | 加入 Voice Agent 交流群
Add WeChat | 添加微信: 1796060717
欢迎 Voice Agent 从业者和爱好者加入交流！

Contents | 目录

Frameworks & Platforms | 框架与平台
VAD (Voice Activity Detection) | 语音活动检测
Turn Detection & Endpointing | 话轮检测与端点检测
STT (Speech-to-Text) | 语音转文本
TTS (Text-to-Speech) | 文本转语音
Developer Communities & Resources | 开发者社区与资源
Learning Resources | 学习资源
Contributing | 贡献指南

Frameworks & Platforms | 框架与平台

Comprehensive Frameworks | 综合性框架

Name	Description	Notes
TEN Framework	Open-source framework for conversational voice AI agents with multimodal capabilities (voice, vision, avatar). Low-latency, high-quality real-time assistant.	支持多模态，低延迟高质量。Demo
Pipecat	Open Source framework for voice and multimodal conversational AI. Modular design with support for multiple STT, LLM, TTS services.	模块化设计，支持多平台 SDK
LiveKit Agents	Powerful framework for building realtime voice AI agents. Fully open-source, WebRTC support, built-in semantic turn detection.	完全开源，内置语义话轮检测
OpenAI Realtime Agents	Advanced agentic patterns built on OpenAI Realtime API. Multi-agent collaboration, handoffs, tool use.	OpenAI 官方示例，支持多 Agent
openai-agents-js	A lightweight, powerful framework for multi-agent workflows and voice agents.	OpenAI 官方示例，支持多 Agent 和语音Agent
call-center-ai	Send a phone call from AI agent, in an API call. Or, directly call the bot from the configured phone number!	基于 Azure 和 OpenAI GPT 的 AI 驱动呼叫中心解决方案。
Vocode Core	Build voice-based LLM agents. Modular and open source. Real-time streaming conversations.	支持电话、Zoom 等场景部署
Bolna	End-to-end open source production-ready voice agents platform. Build voice assistants through JSON config.	生产就绪，支持 Twilio/Plivo

Specialized Solutions | 专用解决方案

Name	Stars	Description	Notes
BentoVoiceAgent		Build phone calling voice agents fully powered by open source models. Uses BentoML for deployment.	完全基于开源模型

VAD (Voice Activity Detection) | 语音活动检测

Voice Activity Detection (VAD) is a key technology for identifying the presence of human speech in audio streams.

VAD 是识别音频流中是否存在人声的关键技术，用于过滤静音、减少计算成本、改善下游处理准确性。

Core VAD Models | 核心 VAD 模型

Name	Description	Notes
Silero VAD	Pre-trained enterprise-grade Voice Activity Detector. High-performance, low-latency, multi-language support.	⭐ 最流行的 VAD 模型，支持 WebAssembly
VAD (Browser)	Voice activity detector for the browser with a simple API. Pure frontend implementation.	浏览器端 VAD，零后端依赖
py-webrtcvad	Python interface to Google WebRTC Voice Activity Detector. Classic signal processing approach.	经典轻量方案，快速

Noise Cancellation | 降噪增强

Name	Stars	Description	Notes
Krisp AI	-	AI-powered background voice and noise cancellation. Krisp Server SDK for voice agents.	显著改善 VAD 准确性，减少误触发

Turn Detection & Endpointing | 话轮检测与端点检测

Turn Detection/Endpointing determines when a user has finished speaking - a core component of natural conversation.

话轮检测判断用户何时结束说话，是实现自然对话的核心技术。

Intelligent Turn Detection Models | 智能话轮检测模型

Name	Stars	Description	Notes
Smart Turn (Pipecat)		Open-source turn detection model (BSD 2-clause). Supports 23 languages. Semantic and audio feature-based. No GPU required for real-time inference.	完全开源，提供 Fal 托管服务（免费）
LiveKit Turn Detector		Transformer-based semantic analysis. 85% true positive rate, ~50ms inference time.	基于 Transformer，结合 VAD 效果最佳

Commercial Solutions | 商业解决方案

Name	Description	Notes
OpenAI Realtime API	Semantic VAD with context-aware turn detection. Built into end-to-end model.	上下文感知，端到端
AssemblyAI Universal-Streaming	Intelligent endpointing with semantic analysis. Real-time streaming transcription.	语义端点检测，实时流式
Retell AI Turn-Taking	Enterprise-grade turn-taking management. Adaptive endpointing, handles complex noise environments.	企业级方案，适应噪音环境

STT (Speech-to-Text) | 语音转文本

Realtime Whisper Implementations | Whisper 实时流式实现

OpenAI Whisper is the most powerful open-source speech recognition model, but doesn't natively support real-time streaming. The following projects implement streaming transcription:

Name	Description	Notes
Whisper Streaming (UFAL)	Whisper realtime streaming for long speech-to-text. Local agreement policy with self-adaptive latency. 3.3s latency.	使用局部一致性策略，自适应延迟
WhisperLive	Nearly-live implementation of OpenAI's Whisper. TensorRT acceleration support. Browser extensions and iOS client.	支持 TensorRT 加速和多平台
Whisper Real Time	Real time transcription with OpenAI Whisper. Continuously records and concatenates audio.	简单易用的实时转录演示
VoiceStreamAI	Near-realtime audio transcription using self-hosted Whisper and WebSocket. Supports Faster Whisper, integrated VAD.	WebSocket 架构，分块处理策略
speech-to-text (reriiasu)	Real-time transcription using faster-whisper. HTML GUI with WebSocket support. SRT subtitle generation.	提供 GUI，支持字幕生成
Whispering	Streaming transcriber with whisper. Client-server architecture support.	支持客户端-服务器架构

Commercial STT APIs | 商业 STT 服务

Name	Description	Notes
OpenAI Whisper API	Cloud-based Whisper API. Easy integration, pay-as-you-go pricing.	云端 Whisper，按需付费
Azure Speech Services	Microsoft's STT service. Multi-language support, custom model training.	支持多语言，自定义模型
Deepgram	Enterprise-grade real-time STT API. Ultra-low latency, streaming transcription.	超低延迟，流式转录
AssemblyAI	Speech AI API with Universal-Streaming model. Immutable transcription, intelligent endpointing.	不可变转录，智能端点检测

TTS (Text-to-Speech) | 文本转语音

Open Source TTS Models | 开源 TTS 模型

Name	Description	Notes
GPT-SoVITS	Few-shot voice cloning TTS. High-quality synthesis.	少样本语音克隆
Bark	Transformer-based TTS model. Generates highly realistic audio including music, sound effects.	可生成音乐和音效
Coqui TTS	Deep learning TTS toolkit. 1100+ pre-trained models. Voice cloning support.	⭐ 最全面的开源 TTS 工具箱
Piper TTS	Fast, local neural TTS. Real-time synthesis with low resource usage.	实时合成，低资源占用
Silero Models	Pre-trained text-to-speech models made embarrassingly simple. Multi-language support.	简单易用，高质量

Commercial TTS APIs | 商业 TTS 服务

Name	Description	Notes
ElevenLabs	High-quality AI voice generation. Ultra-realistic speech, voice cloning, multi-language support.	最高质量的商业 TTS
OpenAI TTS	OpenAI's text-to-speech API. High-quality, low-latency, multiple voice options.	高质量低延迟
Azure Speech Services	Microsoft TTS with neural voices. Custom voice creation, SSML support.	神经语音，自定义音色
Cartesia	Real-time streaming TTS. Ultra-low latency, natural intonation.	超低延迟流式 TTS
Deepgram Aura	Real-time text-to-speech. Conversational voice quality.	对话式语音，低延迟

Developer Communities & Resources | 开发者社区与资源

Communities | 社区平台

Name	Description	Notes
RTE Community	Real-Time Engagement developer community. Technical articles, developer exchange, best practices.	实时互动技术开发者社区
Voice Agent Knowledge Base (Feishu)	Comprehensive Voice Agent knowledge base in Chinese. Systematic tutorials, practical experience.	Voice Agent 中文知识库

Platforms & Tools | 开发平台与工具

Name	Description	Notes
Vapi	Platform for quickly building voice AI agents. Low-code, rich integrations, telephony support.	低代码快速构建平台
Retell AI	Conversational AI platform with enterprise-grade turn-taking management.	企业级话轮管理
Tavus	Real-time conversational video API. Transformer-based turn detection, multimodal video+voice.	视频+语音多模态

Technical Blogs & Documentation | 技术博客与文档

Name	Description	Notes
Voice AI & Voice Agents Primer	Comprehensive illustrated guide to voice AI. Architecture design, technical overview, best practices.	全面的语音 AI 图解指南
AssemblyAI Blog: Turn Detection	In-depth analysis of turn detection. Algorithm comparison, latency analysis.	话轮检测深度解析
LiveKit Blog: Transformer Turn Detection	Using transformers to improve endpointing. Technical details, performance comparison.	Transformer 改进端点检测
Krisp Blog: Turn-Taking	Background noise cancellation improves turn-taking. Evaluation and results.	背景噪音消除改善话轮
Speechmatics: Semantic Turn Detection	Semantic turn detection with SLM. Implementation guide, threshold tuning.	使用 SLM 的语义话轮检测
Agora: TEN VAD & Turn Detection	Making voice agents more human with TEN VAD and Turn Detection.	TEN 的 VAD 和话轮检测

Related Awesome Lists | 相关资源

License | 许可证

To the extent possible under law, the contributors have waived all copyright and related rights to this work.

Acknowledgments | 致谢

Thanks to all open-source contributors making voice AI technology more accessible and powerful!

If this list helps you, please give it a ⭐️!

Maintainer | 维护者: 云中江树
WeChat | 微信: 1796060717 (加入 Voice Agent 交流群)

Welcome to exchange and discuss Voice Agent technology through Issues or PRs!
欢迎通过 Issues 或 PR 交流讨论 Voice Agent 相关技术！

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Voice Agents

Contents | 目录