Design: hands-free voice mode#2

Open

hakanensari wants to merge 2 commits intoPerIPan:mainfrom

hakanensari:design/hands-free-mode

hakanensari commented Mar 10, 2026

Summary

Design document for three interaction modes: press-to-talk (existing), hold-to-talk (new), and hands-free (new)
Hands-free: mic always on, 3s silence auto-submits, "hold on" for barge-in, mic mutes during TTS
Hold-to-talk: hold key to record, release to submit
TTS latency target: sub-1-second (pre-warm model, skip redundant health checks)
Hybrid architecture: Swift handles mic/keyword detection, Python handles Whisper/Kokoro

Design doc

docs/plans/2026-03-10-hands-free-mode-design.md


          design: hands-free voice mode

bf3ae44

Add design document for three interaction modes: press-to-talk (existing),
hold-to-talk, and hands-free with silence detection and barge-in support.

Copilot AI review requested due to automatic review settings

March 10, 2026 11:19

Copilot started reviewing on behalf of hakanensari

March 10, 2026 11:19

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

Adds a design document describing new voice interaction modes (hold-to-talk and hands-free) and a latency plan for faster TTS, outlining a hybrid Swift (real-time audio/state) + Python (Whisper/Kokoro) architecture.

Changes:

Documents three interaction modes (press-to-talk, hold-to-talk, hands-free) and their expected UX behavior.
Proposes an app/server responsibility split and a state machine for recording/transcribing/TTS.
Lists TTS latency optimizations (pre-warm, reduce hook overhead, reduce redundant health checks) and barge-in via keyword detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

docs/plans/2026-03-10-hands-free-mode-design.md

+              ### Swift app responsibilities
+              - Mic capture and audio buffering (all modes).
+              - Mode state machine: idle, recording, waiting-for-silence, playing-tts.

Copilot AI Mar 10, 2026

The state list here (idle, recording, waiting-for-silence, playing-tts) doesn’t match the later diagram states (IDLE, RECORDING, TRANSCRIBING, WAITING FOR AGENT, PLAYING TTS). To avoid implementation drift, please align the names/definitions (and include or remove “waiting-for-silence” explicitly in the diagram if it’s a distinct state).

Suggested change

      
            - Mode state machine: idle, recording, waiting-for-silence, playing-tts.
          
            - Mode state machine: IDLE, RECORDING, TRANSCRIBING, WAITING FOR AGENT, PLAYING TTS.

Copilot uses AI. Check for mistakes.

docs/plans/2026-03-10-hands-free-mode-design.md

Comment on lines +55 to +60

+              - Kill TTS on barge-in signal from Swift app (existing `kill_tts`).
+              ### Communication
+              - Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
+              - Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`).

Copilot AI Mar 10, 2026

Calling out “direct pkill” from the Swift app to stop TTS has security and reliability drawbacks (process-name collisions, killing unrelated processes, needing shell access). Prefer defining a dedicated server API (or existing kill_tts) with clear auth/validation and a documented request/response and failure behavior.

Suggested change

      
            - Kill TTS on barge-in signal from Swift app (existing `kill_tts`).
          
            ### Communication
          
            - Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
          
            - Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`).
          
            - Kill TTS on barge-in signal from Swift app via dedicated API wrapping existing `kill_tts`.
          
            ### Communication
          
            - Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
          
            - Swift → Python: HTTP POST `/v1/tts/kill` to stop TTS (new; calls server-side `kill_tts` with auth/validation).

Copilot uses AI. Check for mistakes.

docs/plans/2026-03-10-hands-free-mode-design.md

Comment on lines +71 to +72

		\| Hands-free \| Instant submit \| — \| Auto-submit \|

Copilot AI Mar 10, 2026

For hands-free, the table says “Key down: Instant submit,” while earlier you describe “tapping” the PTT key to submit instantly. Submitting on key-down can fire even if the user presses-and-holds (or if key-repeat/keydown events are noisy). Please specify whether it’s a tap (down+up within a threshold) or key-down, and how debouncing/long-press is handled across modes.

Suggested change

      
            | Hands-free | Instant submit | — | Auto-submit |
          
            | Hands-free | Start tap window (no submit yet) | If tap (down+up within threshold) → instant submit | Auto-submit |
          
            Tap = key-down followed by key-up within a short threshold (e.g., ~200–500 ms). In hands-free mode, long-presses beyond this window do **not** submit, and key-repeat/extra keydown events are debounced so a single tap produces at most one instant submit.

Copilot uses AI. Check for mistakes.

docs/plans/2026-03-10-hands-free-mode-design.md

+              Optimizations:
+. **Pre-warm Kokoro model** — generate a silent/dummy request on server start so the model and pipeline are loaded before the first real request.
+. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`.

Copilot AI Mar 10, 2026

The “skip health check” optimization relies on cached liveness for 30s, but the doc doesn’t specify a fallback if the server died or became unhealthy within that window. Please document the failure handling (e.g., on request failure/timeout, retry once with a fresh health check and surface a clear error) so latency improvements don’t increase flakiness.

Suggested change

      
            2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`.
          
            2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`. If a TTS request fails or times out while using cached liveness, retry once after forcing a fresh health check; if that retry also fails, surface a clear error to the user/logs and mark the server as unhealthy instead of continuing to skip health checks.

Copilot uses AI. Check for mistakes.

docs/plans/2026-03-10-hands-free-mode-design.md

Comment on lines +97 to +98


		Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.

Copilot AI Mar 10, 2026

Using SFSpeechRecognizer generally requires user Speech Recognition authorization and has platform/availability constraints. The design should note the permission flow and what happens if authorization is denied/unavailable (e.g., fall back to energy-spike barge-in or disable barge-in with UI messaging).

Suggested change

      
            Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.
          
            - Permission flow:
          
              - On first entry into hands-free mode (or first attempt to enable keyword barge-in), request Speech Recognition authorization via `SFSpeechRecognizer.requestAuthorization`.
          
              - Enable keyword spotting only if authorization is `.authorized` **and** the recognizer for the current locale reports `isAvailable == true`.
          
            - If authorization is denied/restricted/not granted or `SFSpeechRecognizer` is unavailable on the current platform:
          
              - Disable Speech-framework-based keyword barge-in.
          
              - Fall back to an energy-spike-based barge-in trigger (if enabled in settings), or disable barge-in entirely and show non-intrusive UI messaging (e.g., tooltip/banner: “Barge-in unavailable: speech recognition permission not granted”).
          
            - Platform constraints:
          
              - Only expose Speech-based keyword barge-in on macOS versions that support `SFSpeechRecognizer` in the target app context; hide or gray out the option on unsupported systems.
          
            Alternative (or fallback) if Speech framework is too heavy or unavailable: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.

Copilot uses AI. Check for mistakes.

docs/plans/2026-03-10-hands-free-mode-design.md


		Use RMS (root mean square) energy of audio buffer frames.

		- Threshold calibrated against ambient noise level (sample on mode activation).

Copilot AI Mar 10, 2026

The silence section includes calibrating the threshold against ambient noise on mode activation, but the Out of Scope list says “Automatic silence threshold tuning.” Please reconcile this by clarifying whether one-time baseline sampling is considered in-scope (and what it does), or by adjusting the out-of-scope item to match the proposed behavior.

Suggested change

      
            - Threshold calibrated against ambient noise level (sample on mode activation).
          
            - One-time baseline sampling on mode activation to set the initial silence threshold relative to current ambient noise.
          
            - Threshold is not continuously re-tuned during a session; fully automatic/ongoing silence threshold tuning remains out of scope (see Out of Scope).

Copilot uses AI. Check for mistakes.

docs/plans/2026-03-10-hands-free-mode-design.md

Comment on lines +27 to +30

+              **During agent TTS playback:**
+              - Mic is muted to prevent echo/feedback.
+              - Lightweight keyword detection stays active, listening for "hold on."

Copilot AI Mar 10, 2026

The doc says the mic is muted during agent TTS playback, but also that keyword detection stays active listening for "hold on." If the mic is truly muted at the input level, keyword spotting can’t work; if only Whisper/recording is muted while still capturing audio for keyword spotting, please clarify that distinction (e.g., separate capture path and what exactly is muted).

Copilot uses AI. Check for mistakes.


          plan: hands-free mode implementation steps

79bd47b

Four-phase plan covering hold-to-talk, TTS latency optimization,
hands-free mode with silence detection and barge-in, and polish.
Phases 1 and 2 can run in parallel.

hakanensari mentioned this pull request

Request: include source for HotkeyManager, DictationManager, AudioRecorder #3

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet