feat(serve): mlx-live-style voice pipeline (#134) by crrow · Pull Request #135 · rararulab/kotoba

crrow · 2026-04-09T08:29:13Z

Summary

Complete rewrite of the voice chat demo to match mlx-live's architecture. Everything runs server-side through a single WebSocket.

Architecture

```
Browser: AudioWorklet → 16kHz PCM int16 → WS binary
Server (/ws/voice):
VAD (RMS energy) → speech end → ASR (Whisper API) → LLM (OpenAI stream)
→ sentence buffer → TTS (Kokoro) → float32 PCM → WS binary
Browser: float32 PCM → AudioContext scheduled playback
```

What changed

New `src/serve/voice.rs` (779 lines) — full pipeline: VAD, ASR proxy, LLM streaming proxy, TTS, interruption, chat history
New `src/serve/pcm_worklet.js` — AudioWorklet for 16kHz recording
Rewritten `src/serve/demo.html` — Siri-like orb UI, no browser-side STT/LLM, settings modal
New routes: `WS /ws/voice`, `GET /static/pcm-recorder-worklet.js`

Why

The previous demo (browser Web Speech API + fetch LLM + /ws/tts) was:

Slow: 5-10s first-audio latency due to multiple round trips
Buggy: WS race conditions, AudioContext suspended, null.send errors
Browser-dependent: Web Speech API only works in Chrome

Server-side pipeline eliminates all three problems.

Test plan

26 tests pass (including 2 new: demo content + worklet JS)
Manual: start kotoba serve + Whisper server + Ollama, open /demo, speak
Manual: interruption works (speak while AI is talking)

Closes #134

…LM/TTS (#134) Replace browser-side orchestration with a server-side full pipeline, matching mlx-live's architecture. Single WebSocket, raw PCM streaming. Server pipeline (/ws/voice): - VAD: RMS energy threshold, speech start/end detection - ASR: proxy to configurable Whisper-compatible endpoint - LLM: streaming proxy to OpenAI-compatible endpoint with SSE parsing - TTS: Kokoro via existing TtsBackend, output as float32 PCM frames - Interruption: VAD detects speech during TTS, cancels remaining output - Chat history: 10-turn rolling window per connection Frontend (demo.html rewrite): - Siri-like orb animation (idle/recording/generating states) - AudioWorklet for 16kHz PCM recording (pcm_worklet.js) - Float32 PCM playback via AudioContext scheduling - Settings modal with localStorage persistence - No Web Speech API, no browser-side LLM fetch, no CORS issues New endpoints: - WS /ws/voice — full voice conversation pipeline - GET /static/pcm-recorder-worklet.js — AudioWorklet module Closes #134 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

crrow added the enhancement New feature or request label Apr 9, 2026

crrow merged commit 97fce46 into main Apr 9, 2026
0 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(serve): mlx-live-style voice pipeline (#134)#135

feat(serve): mlx-live-style voice pipeline (#134)#135
crrow merged 1 commit intomainfrom
issue-134-voice-pipeline

crrow commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crrow commented Apr 9, 2026

Summary

Architecture

What changed

Why

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant