A WebSocket server library for building voice AI agents. It handles the full speech pipeline: receive audio from a client, transcribe it (STT), pass the text to your agent, synthesize the response (TTS), and stream the audio back.
Client → [binary audio chunks] → WebSocket → STT → AgentHandler → TTS → [WAV audio] → Client
- Client streams raw audio as binary WebSocket frames, then sends
{ type: "audio_end" }. - Server transcribes the audio via a Speaches-compatible STT API.
- The transcript and conversation history are passed to your
AgentHandler. - The agent's text response is synthesized to WAV audio via a Speaches TTS API.
- Audio is sent back to the client as base64-encoded chunks.
pnpm add @myeungdev/voice-serverimport { startServer } from "@myeungdev/voice-server";
import type { AgentHandler } from "@myeungdev/voice-server";
const handler: AgentHandler = async (transcript, history) => {
// Call your LLM, use history for context, return a response
const text = `You said: ${transcript}`;
return { text, updatedHistory: history };
};
startServer(3000, handler);Starts a WebSocket server on the given port.
| Parameter | Type | Description |
|---|---|---|
port |
number |
Port to listen on |
handler |
AgentHandler |
Default handler called for each utterance |
options |
ServerOptions |
Optional lifecycle hooks (see below) |
Directly synthesize text to a WAV Buffer using the configured TTS service.
type AgentHandler = (
transcript: string,
history: BaseMessage[], // LangChain message history
) => Promise<{ text: string; updatedHistory: BaseMessage[] }>;interface ServerOptions {
// Called on new connection. Return an AgentHandler to override the default for this session.
onConnect?: (ws: WebSocket, session: Session) => Promise<AgentHandler | void>;
// Called when a client disconnects.
onDisconnect?: (session: Session) => Promise<void>;
}Client → Server
| Message | Description |
|---|---|
| Binary frame | Raw audio data (WAV) |
{ type: "audio_end" } |
Signals end of utterance, triggers processing |
Server → Client
| Message | Description |
|---|---|
{ type: "ready" } |
Server is ready to receive audio |
{ type: "processing" } |
Utterance received, transcription started |
{ type: "transcription", text } |
STT result |
{ type: "response_text", text } |
Agent text response |
{ type: "audio_start" } |
TTS audio stream begins |
{ type: "audio_chunk", data } |
Base64-encoded WAV chunk |
{ type: "audio_end" } |
TTS audio stream complete |
{ type: "error", message } |
Error during any stage |
All configuration is via environment variables.
| Variable | Default | Description |
|---|---|---|
SPEACHES_BASE_URL |
http://speaches:8000 |
Base URL for both STT and TTS services |
STT_MODEL |
Systran/faster-whisper-base.en |
Whisper model for transcription |
TTS_MODEL |
speaches-ai/Kokoro-82M-v1.0-ONNX-int8 |
Kokoro TTS model |
TTS_VOICE |
af_heart |
TTS voice ID |
TTS_SPEED |
1.0 |
TTS playback speed multiplier |
ws— WebSocket server@langchain/core— Message history types