Conversation
Add design document for three interaction modes: press-to-talk (existing), hold-to-talk, and hands-free with silence detection and barge-in support.
There was a problem hiding this comment.
Pull request overview
Adds a design document describing new voice interaction modes (hold-to-talk and hands-free) and a latency plan for faster TTS, outlining a hybrid Swift (real-time audio/state) + Python (Whisper/Kokoro) architecture.
Changes:
- Documents three interaction modes (press-to-talk, hold-to-talk, hands-free) and their expected UX behavior.
- Proposes an app/server responsibility split and a state machine for recording/transcribing/TTS.
- Lists TTS latency optimizations (pre-warm, reduce hook overhead, reduce redundant health checks) and barge-in via keyword detection.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| ### Swift app responsibilities | ||
|
|
||
| - Mic capture and audio buffering (all modes). | ||
| - Mode state machine: idle, recording, waiting-for-silence, playing-tts. |
There was a problem hiding this comment.
The state list here (idle, recording, waiting-for-silence, playing-tts) doesn’t match the later diagram states (IDLE, RECORDING, TRANSCRIBING, WAITING FOR AGENT, PLAYING TTS). To avoid implementation drift, please align the names/definitions (and include or remove “waiting-for-silence” explicitly in the diagram if it’s a distinct state).
| - Mode state machine: idle, recording, waiting-for-silence, playing-tts. | |
| - Mode state machine: IDLE, RECORDING, TRANSCRIBING, WAITING FOR AGENT, PLAYING TTS. |
| - Kill TTS on barge-in signal from Swift app (existing `kill_tts`). | ||
|
|
||
| ### Communication | ||
|
|
||
| - Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing). | ||
| - Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`). |
There was a problem hiding this comment.
Calling out “direct pkill” from the Swift app to stop TTS has security and reliability drawbacks (process-name collisions, killing unrelated processes, needing shell access). Prefer defining a dedicated server API (or existing kill_tts) with clear auth/validation and a documented request/response and failure behavior.
| - Kill TTS on barge-in signal from Swift app (existing `kill_tts`). | |
| ### Communication | |
| - Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing). | |
| - Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`). | |
| - Kill TTS on barge-in signal from Swift app via dedicated API wrapping existing `kill_tts`. | |
| ### Communication | |
| - Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing). | |
| - Swift → Python: HTTP POST `/v1/tts/kill` to stop TTS (new; calls server-side `kill_tts` with auth/validation). |
| | Hands-free | Instant submit | — | Auto-submit | | ||
|
|
There was a problem hiding this comment.
For hands-free, the table says “Key down: Instant submit,” while earlier you describe “tapping” the PTT key to submit instantly. Submitting on key-down can fire even if the user presses-and-holds (or if key-repeat/keydown events are noisy). Please specify whether it’s a tap (down+up within a threshold) or key-down, and how debouncing/long-press is handled across modes.
| | Hands-free | Instant submit | — | Auto-submit | | |
| | Hands-free | Start tap window (no submit yet) | If tap (down+up within threshold) → instant submit | Auto-submit | | |
| Tap = key-down followed by key-up within a short threshold (e.g., ~200–500 ms). In hands-free mode, long-presses beyond this window do **not** submit, and key-repeat/extra keydown events are debounced so a single tap produces at most one instant submit. |
|
|
||
| Optimizations: | ||
| 1. **Pre-warm Kokoro model** — generate a silent/dummy request on server start so the model and pipeline are loaded before the first real request. | ||
| 2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`. |
There was a problem hiding this comment.
The “skip health check” optimization relies on cached liveness for 30s, but the doc doesn’t specify a fallback if the server died or became unhealthy within that window. Please document the failure handling (e.g., on request failure/timeout, retry once with a fresh health check and surface a clear error) so latency improvements don’t increase flakiness.
| 2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`. | |
| 2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`. If a TTS request fails or times out while using cached liveness, retry once after forcing a fresh health check; if that retry also fails, surface a clear error to the user/logs and mark the server as unhealthy instead of continuing to skip health checks. |
|
|
||
| Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost. |
There was a problem hiding this comment.
Using SFSpeechRecognizer generally requires user Speech Recognition authorization and has platform/availability constraints. The design should note the permission flow and what happens if authorization is denied/unavailable (e.g., fall back to energy-spike barge-in or disable barge-in with UI messaging).
| Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost. | |
| - Permission flow: | |
| - On first entry into hands-free mode (or first attempt to enable keyword barge-in), request Speech Recognition authorization via `SFSpeechRecognizer.requestAuthorization`. | |
| - Enable keyword spotting only if authorization is `.authorized` **and** the recognizer for the current locale reports `isAvailable == true`. | |
| - If authorization is denied/restricted/not granted or `SFSpeechRecognizer` is unavailable on the current platform: | |
| - Disable Speech-framework-based keyword barge-in. | |
| - Fall back to an energy-spike-based barge-in trigger (if enabled in settings), or disable barge-in entirely and show non-intrusive UI messaging (e.g., tooltip/banner: “Barge-in unavailable: speech recognition permission not granted”). | |
| - Platform constraints: | |
| - Only expose Speech-based keyword barge-in on macOS versions that support `SFSpeechRecognizer` in the target app context; hide or gray out the option on unsupported systems. | |
| Alternative (or fallback) if Speech framework is too heavy or unavailable: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost. |
|
|
||
| Use RMS (root mean square) energy of audio buffer frames. | ||
|
|
||
| - Threshold calibrated against ambient noise level (sample on mode activation). |
There was a problem hiding this comment.
The silence section includes calibrating the threshold against ambient noise on mode activation, but the Out of Scope list says “Automatic silence threshold tuning.” Please reconcile this by clarifying whether one-time baseline sampling is considered in-scope (and what it does), or by adjusting the out-of-scope item to match the proposed behavior.
| - Threshold calibrated against ambient noise level (sample on mode activation). | |
| - One-time baseline sampling on mode activation to set the initial silence threshold relative to current ambient noise. | |
| - Threshold is not continuously re-tuned during a session; fully automatic/ongoing silence threshold tuning remains out of scope (see Out of Scope). |
| **During agent TTS playback:** | ||
| - Mic is muted to prevent echo/feedback. | ||
| - Lightweight keyword detection stays active, listening for "hold on." | ||
|
|
There was a problem hiding this comment.
The doc says the mic is muted during agent TTS playback, but also that keyword detection stays active listening for "hold on." If the mic is truly muted at the input level, keyword spotting can’t work; if only Whisper/recording is muted while still capturing audio for keyword spotting, please clarify that distinction (e.g., separate capture path and what exactly is muted).
Four-phase plan covering hold-to-talk, TTS latency optimization, hands-free mode with silence detection and barge-in, and polish. Phases 1 and 2 can run in parallel.
Summary
Design doc
docs/plans/2026-03-10-hands-free-mode-design.md