A curated list of voice AI agent frameworks, tools, resources, and best practices | 精选的语音 AI Agent 框架、工具、资源和最佳实践
Welcome to Awesome Voice Agents! This is a carefully curated collection of resources related to voice AI agents, covering core technologies such as endpoint detection, turn-taking management, real-time speech recognition, and speech synthesis.
本项目精心收录语音 AI Agent 相关的优质资源,涵盖端点检测、话轮管理、实时语音识别、语音合成等核心技术。
Maintainer | 维护者: 云中江树
💬 Join Voice Agent Community | 加入 Voice Agent 交流群
Add WeChat | 添加微信: 1796060717
欢迎 Voice Agent 从业者和爱好者加入交流!
- Frameworks & Platforms | 框架与平台
- VAD (Voice Activity Detection) | 语音活动检测
- Turn Detection & Endpointing | 话轮检测与端点检测
- STT (Speech-to-Text) | 语音转文本
- TTS (Text-to-Speech) | 文本转语音
- Developer Communities & Resources | 开发者社区与资源
- Learning Resources | 学习资源
- Contributing | 贡献指南
| Name | Stars | Description | Notes |
|---|---|---|---|
| TEN Framework | Open-source framework for conversational voice AI agents with multimodal capabilities (voice, vision, avatar). Low-latency, high-quality real-time assistant. | 支持多模态,低延迟高质量。Demo | |
| Pipecat | Open Source framework for voice and multimodal conversational AI. Modular design with support for multiple STT, LLM, TTS services. | 模块化设计,支持多平台 SDK | |
| LiveKit Agents | Powerful framework for building realtime voice AI agents. Fully open-source, WebRTC support, built-in semantic turn detection. | 完全开源,内置语义话轮检测 | |
| OpenAI Realtime Agents | Advanced agentic patterns built on OpenAI Realtime API. Multi-agent collaboration, handoffs, tool use. | OpenAI 官方示例,支持多 Agent | |
| openai-agents-js | A lightweight, powerful framework for multi-agent workflows and voice agents. | OpenAI 官方示例,支持多 Agent 和语音Agent | |
| call-center-ai | Send a phone call from AI agent, in an API call. Or, directly call the bot from the configured phone number! | 基于 Azure 和 OpenAI GPT 的 AI 驱动呼叫中心解决方案。 | |
| Vocode Core | Build voice-based LLM agents. Modular and open source. Real-time streaming conversations. | 支持电话、Zoom 等场景部署 | |
| Bolna | End-to-end open source production-ready voice agents platform. Build voice assistants through JSON config. | 生产就绪,支持 Twilio/Plivo |
| Name | Stars | Description | Notes |
|---|---|---|---|
| BentoVoiceAgent | Build phone calling voice agents fully powered by open source models. Uses BentoML for deployment. | 完全基于开源模型 |
Voice Activity Detection (VAD) is a key technology for identifying the presence of human speech in audio streams.
VAD 是识别音频流中是否存在人声的关键技术,用于过滤静音、减少计算成本、改善下游处理准确性。
| Name | Stars | Description | Notes |
|---|---|---|---|
| Silero VAD | Pre-trained enterprise-grade Voice Activity Detector. High-performance, low-latency, multi-language support. | ⭐ 最流行的 VAD 模型,支持 WebAssembly | |
| VAD (Browser) | Voice activity detector for the browser with a simple API. Pure frontend implementation. | 浏览器端 VAD,零后端依赖 | |
| py-webrtcvad | Python interface to Google WebRTC Voice Activity Detector. Classic signal processing approach. | 经典轻量方案,快速 |
| Name | Stars | Description | Notes |
|---|---|---|---|
| Krisp AI | - | AI-powered background voice and noise cancellation. Krisp Server SDK for voice agents. | 显著改善 VAD 准确性,减少误触发 |
Turn Detection/Endpointing determines when a user has finished speaking - a core component of natural conversation.
话轮检测判断用户何时结束说话,是实现自然对话的核心技术。
| Name | Stars | Description | Notes |
|---|---|---|---|
| Smart Turn (Pipecat) | Open-source turn detection model (BSD 2-clause). Supports 23 languages. Semantic and audio feature-based. No GPU required for real-time inference. | 完全开源,提供 Fal 托管服务(免费) | |
| LiveKit Turn Detector | Transformer-based semantic analysis. 85% true positive rate, ~50ms inference time. | 基于 Transformer,结合 VAD 效果最佳 |
| Name | Description | Notes |
|---|---|---|
| OpenAI Realtime API | Semantic VAD with context-aware turn detection. Built into end-to-end model. | 上下文感知,端到端 |
| AssemblyAI Universal-Streaming | Intelligent endpointing with semantic analysis. Real-time streaming transcription. | 语义端点检测,实时流式 |
| Retell AI Turn-Taking | Enterprise-grade turn-taking management. Adaptive endpointing, handles complex noise environments. | 企业级方案,适应噪音环境 |
OpenAI Whisper is the most powerful open-source speech recognition model, but doesn't natively support real-time streaming. The following projects implement streaming transcription:
| Name | Stars | Description | Notes |
|---|---|---|---|
| Whisper Streaming (UFAL) | Whisper realtime streaming for long speech-to-text. Local agreement policy with self-adaptive latency. 3.3s latency. | 使用局部一致性策略,自适应延迟 | |
| WhisperLive | Nearly-live implementation of OpenAI's Whisper. TensorRT acceleration support. Browser extensions and iOS client. | 支持 TensorRT 加速和多平台 | |
| Whisper Real Time | Real time transcription with OpenAI Whisper. Continuously records and concatenates audio. | 简单易用的实时转录演示 | |
| VoiceStreamAI | Near-realtime audio transcription using self-hosted Whisper and WebSocket. Supports Faster Whisper, integrated VAD. | WebSocket 架构,分块处理策略 | |
| speech-to-text (reriiasu) | Real-time transcription using faster-whisper. HTML GUI with WebSocket support. SRT subtitle generation. | 提供 GUI,支持字幕生成 | |
| Whispering | Streaming transcriber with whisper. Client-server architecture support. | 支持客户端-服务器架构 |
| Name | Description | Notes |
|---|---|---|
| OpenAI Whisper API | Cloud-based Whisper API. Easy integration, pay-as-you-go pricing. | 云端 Whisper,按需付费 |
| Azure Speech Services | Microsoft's STT service. Multi-language support, custom model training. | 支持多语言,自定义模型 |
| Deepgram | Enterprise-grade real-time STT API. Ultra-low latency, streaming transcription. | 超低延迟,流式转录 |
| AssemblyAI | Speech AI API with Universal-Streaming model. Immutable transcription, intelligent endpointing. | 不可变转录,智能端点检测 |
| Name | Stars | Description | Notes |
|---|---|---|---|
| GPT-SoVITS | Few-shot voice cloning TTS. High-quality synthesis. | 少样本语音克隆 | |
| Bark | Transformer-based TTS model. Generates highly realistic audio including music, sound effects. | 可生成音乐和音效 | |
| Coqui TTS | Deep learning TTS toolkit. 1100+ pre-trained models. Voice cloning support. | ⭐ 最全面的开源 TTS 工具箱 | |
| Piper TTS | Fast, local neural TTS. Real-time synthesis with low resource usage. | 实时合成,低资源占用 | |
| Silero Models | Pre-trained text-to-speech models made embarrassingly simple. Multi-language support. | 简单易用,高质量 |
| Name | Description | Notes |
|---|---|---|
| ElevenLabs | High-quality AI voice generation. Ultra-realistic speech, voice cloning, multi-language support. | 最高质量的商业 TTS |
| OpenAI TTS | OpenAI's text-to-speech API. High-quality, low-latency, multiple voice options. | 高质量低延迟 |
| Azure Speech Services | Microsoft TTS with neural voices. Custom voice creation, SSML support. | 神经语音,自定义音色 |
| Cartesia | Real-time streaming TTS. Ultra-low latency, natural intonation. | 超低延迟流式 TTS |
| Deepgram Aura | Real-time text-to-speech. Conversational voice quality. | 对话式语音,低延迟 |
| Name | Description | Notes |
|---|---|---|
| RTE Community | Real-Time Engagement developer community. Technical articles, developer exchange, best practices. | 实时互动技术开发者社区 |
| Voice Agent Knowledge Base (Feishu) | Comprehensive Voice Agent knowledge base in Chinese. Systematic tutorials, practical experience. | Voice Agent 中文知识库 |
| Name | Description | Notes |
|---|---|---|
| Vapi | Platform for quickly building voice AI agents. Low-code, rich integrations, telephony support. | 低代码快速构建平台 |
| Retell AI | Conversational AI platform with enterprise-grade turn-taking management. | 企业级话轮管理 |
| Tavus | Real-time conversational video API. Transformer-based turn detection, multimodal video+voice. | 视频+语音多模态 |
| Name | Description | Notes |
|---|---|---|
| Voice AI & Voice Agents Primer | Comprehensive illustrated guide to voice AI. Architecture design, technical overview, best practices. | 全面的语音 AI 图解指南 |
| AssemblyAI Blog: Turn Detection | In-depth analysis of turn detection. Algorithm comparison, latency analysis. | 话轮检测深度解析 |
| LiveKit Blog: Transformer Turn Detection | Using transformers to improve endpointing. Technical details, performance comparison. | Transformer 改进端点检测 |
| Krisp Blog: Turn-Taking | Background noise cancellation improves turn-taking. Evaluation and results. | 背景噪音消除改善话轮 |
| Speechmatics: Semantic Turn Detection | Semantic turn detection with SLM. Implementation guide, threshold tuning. | 使用 SLM 的语义话轮检测 |
| Agora: TEN VAD & Turn Detection | Making voice agents more human with TEN VAD and Turn Detection. | TEN 的 VAD 和话轮检测 |
To the extent possible under law, the contributors have waived all copyright and related rights to this work.
Thanks to all open-source contributors making voice AI technology more accessible and powerful!
If this list helps you, please give it a ⭐️!
Maintainer | 维护者: 云中江树
WeChat | 微信: 1796060717 (加入 Voice Agent 交流群)
Welcome to exchange and discuss Voice Agent technology through Issues or PRs!
欢迎通过 Issues 或 PR 交流讨论 Voice Agent 相关技术!
