Skip to content

feat(serve): mlx-live-style voice pipeline (#134)#135

Merged
crrow merged 1 commit intomainfrom
issue-134-voice-pipeline
Apr 9, 2026
Merged

feat(serve): mlx-live-style voice pipeline (#134)#135
crrow merged 1 commit intomainfrom
issue-134-voice-pipeline

Conversation

@crrow
Copy link
Copy Markdown
Contributor

@crrow crrow commented Apr 9, 2026

Summary

Complete rewrite of the voice chat demo to match mlx-live's architecture. Everything runs server-side through a single WebSocket.

Architecture

```
Browser: AudioWorklet → 16kHz PCM int16 → WS binary
Server (/ws/voice):
VAD (RMS energy) → speech end → ASR (Whisper API) → LLM (OpenAI stream)
→ sentence buffer → TTS (Kokoro) → float32 PCM → WS binary
Browser: float32 PCM → AudioContext scheduled playback
```

What changed

  • New `src/serve/voice.rs` (779 lines) — full pipeline: VAD, ASR proxy, LLM streaming proxy, TTS, interruption, chat history
  • New `src/serve/pcm_worklet.js` — AudioWorklet for 16kHz recording
  • Rewritten `src/serve/demo.html` — Siri-like orb UI, no browser-side STT/LLM, settings modal
  • New routes: `WS /ws/voice`, `GET /static/pcm-recorder-worklet.js`

Why

The previous demo (browser Web Speech API + fetch LLM + /ws/tts) was:

  • Slow: 5-10s first-audio latency due to multiple round trips
  • Buggy: WS race conditions, AudioContext suspended, null.send errors
  • Browser-dependent: Web Speech API only works in Chrome

Server-side pipeline eliminates all three problems.

Test plan

  • 26 tests pass (including 2 new: demo content + worklet JS)
  • Manual: start kotoba serve + Whisper server + Ollama, open /demo, speak
  • Manual: interruption works (speak while AI is talking)

Closes #134

…LM/TTS (#134)

Replace browser-side orchestration with a server-side full pipeline,
matching mlx-live's architecture. Single WebSocket, raw PCM streaming.

Server pipeline (/ws/voice):
- VAD: RMS energy threshold, speech start/end detection
- ASR: proxy to configurable Whisper-compatible endpoint
- LLM: streaming proxy to OpenAI-compatible endpoint with SSE parsing
- TTS: Kokoro via existing TtsBackend, output as float32 PCM frames
- Interruption: VAD detects speech during TTS, cancels remaining output
- Chat history: 10-turn rolling window per connection

Frontend (demo.html rewrite):
- Siri-like orb animation (idle/recording/generating states)
- AudioWorklet for 16kHz PCM recording (pcm_worklet.js)
- Float32 PCM playback via AudioContext scheduling
- Settings modal with localStorage persistence
- No Web Speech API, no browser-side LLM fetch, no CORS issues

New endpoints:
- WS /ws/voice — full voice conversation pipeline
- GET /static/pcm-recorder-worklet.js — AudioWorklet module

Closes #134

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@crrow crrow added the enhancement New feature or request label Apr 9, 2026
@crrow crrow merged commit 97fce46 into main Apr 9, 2026
0 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(serve): mlx-live-style voice pipeline (/ws/voice)

1 participant