Skip to content

Design: hands-free voice mode#2

Open
hakanensari wants to merge 2 commits intoPerIPan:mainfrom
hakanensari:design/hands-free-mode
Open

Design: hands-free voice mode#2
hakanensari wants to merge 2 commits intoPerIPan:mainfrom
hakanensari:design/hands-free-mode

Conversation

@hakanensari
Copy link

Summary

  • Design document for three interaction modes: press-to-talk (existing), hold-to-talk (new), and hands-free (new)
  • Hands-free: mic always on, 3s silence auto-submits, "hold on" for barge-in, mic mutes during TTS
  • Hold-to-talk: hold key to record, release to submit
  • TTS latency target: sub-1-second (pre-warm model, skip redundant health checks)
  • Hybrid architecture: Swift handles mic/keyword detection, Python handles Whisper/Kokoro

Design doc

docs/plans/2026-03-10-hands-free-mode-design.md

Add design document for three interaction modes: press-to-talk (existing),
hold-to-talk, and hands-free with silence detection and barge-in support.
Copilot AI review requested due to automatic review settings March 10, 2026 11:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design document describing new voice interaction modes (hold-to-talk and hands-free) and a latency plan for faster TTS, outlining a hybrid Swift (real-time audio/state) + Python (Whisper/Kokoro) architecture.

Changes:

  • Documents three interaction modes (press-to-talk, hold-to-talk, hands-free) and their expected UX behavior.
  • Proposes an app/server responsibility split and a state machine for recording/transcribing/TTS.
  • Lists TTS latency optimizations (pre-warm, reduce hook overhead, reduce redundant health checks) and barge-in via keyword detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

### Swift app responsibilities

- Mic capture and audio buffering (all modes).
- Mode state machine: idle, recording, waiting-for-silence, playing-tts.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state list here (idle, recording, waiting-for-silence, playing-tts) doesn’t match the later diagram states (IDLE, RECORDING, TRANSCRIBING, WAITING FOR AGENT, PLAYING TTS). To avoid implementation drift, please align the names/definitions (and include or remove “waiting-for-silence” explicitly in the diagram if it’s a distinct state).

Suggested change
- Mode state machine: idle, recording, waiting-for-silence, playing-tts.
- Mode state machine: IDLE, RECORDING, TRANSCRIBING, WAITING FOR AGENT, PLAYING TTS.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +60
- Kill TTS on barge-in signal from Swift app (existing `kill_tts`).

### Communication

- Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
- Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`).
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out “direct pkill” from the Swift app to stop TTS has security and reliability drawbacks (process-name collisions, killing unrelated processes, needing shell access). Prefer defining a dedicated server API (or existing kill_tts) with clear auth/validation and a documented request/response and failure behavior.

Suggested change
- Kill TTS on barge-in signal from Swift app (existing `kill_tts`).
### Communication
- Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
- Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`).
- Kill TTS on barge-in signal from Swift app via dedicated API wrapping existing `kill_tts`.
### Communication
- Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
- Swift → Python: HTTP POST `/v1/tts/kill` to stop TTS (new; calls server-side `kill_tts` with auth/validation).

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +72
| Hands-free | Instant submit | — | Auto-submit |

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hands-free, the table says “Key down: Instant submit,” while earlier you describe “tapping” the PTT key to submit instantly. Submitting on key-down can fire even if the user presses-and-holds (or if key-repeat/keydown events are noisy). Please specify whether it’s a tap (down+up within a threshold) or key-down, and how debouncing/long-press is handled across modes.

Suggested change
| Hands-free | Instant submit | | Auto-submit |
| Hands-free | Start tap window (no submit yet) | If tap (down+up within threshold) → instant submit | Auto-submit |
Tap = key-down followed by key-up within a short threshold (e.g., ~200–500 ms). In hands-free mode, long-presses beyond this window do **not** submit, and key-repeat/extra keydown events are debounced so a single tap produces at most one instant submit.

Copilot uses AI. Check for mistakes.

Optimizations:
1. **Pre-warm Kokoro model** — generate a silent/dummy request on server start so the model and pipeline are loaded before the first real request.
2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “skip health check” optimization relies on cached liveness for 30s, but the doc doesn’t specify a fallback if the server died or became unhealthy within that window. Please document the failure handling (e.g., on request failure/timeout, retry once with a fresh health check and surface a clear error) so latency improvements don’t increase flakiness.

Suggested change
2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`.
2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`. If a TTS request fails or times out while using cached liveness, retry once after forcing a fresh health check; if that retry also fails, surface a clear error to the user/logs and mark the server as unhealthy instead of continuing to skip health checks.

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +98

Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using SFSpeechRecognizer generally requires user Speech Recognition authorization and has platform/availability constraints. The design should note the permission flow and what happens if authorization is denied/unavailable (e.g., fall back to energy-spike barge-in or disable barge-in with UI messaging).

Suggested change
Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.
- Permission flow:
- On first entry into hands-free mode (or first attempt to enable keyword barge-in), request Speech Recognition authorization via `SFSpeechRecognizer.requestAuthorization`.
- Enable keyword spotting only if authorization is `.authorized` **and** the recognizer for the current locale reports `isAvailable == true`.
- If authorization is denied/restricted/not granted or `SFSpeechRecognizer` is unavailable on the current platform:
- Disable Speech-framework-based keyword barge-in.
- Fall back to an energy-spike-based barge-in trigger (if enabled in settings), or disable barge-in entirely and show non-intrusive UI messaging (e.g., tooltip/banner: “Barge-in unavailable: speech recognition permission not granted”).
- Platform constraints:
- Only expose Speech-based keyword barge-in on macOS versions that support `SFSpeechRecognizer` in the target app context; hide or gray out the option on unsupported systems.
Alternative (or fallback) if Speech framework is too heavy or unavailable: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.

Copilot uses AI. Check for mistakes.

Use RMS (root mean square) energy of audio buffer frames.

- Threshold calibrated against ambient noise level (sample on mode activation).
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The silence section includes calibrating the threshold against ambient noise on mode activation, but the Out of Scope list says “Automatic silence threshold tuning.” Please reconcile this by clarifying whether one-time baseline sampling is considered in-scope (and what it does), or by adjusting the out-of-scope item to match the proposed behavior.

Suggested change
- Threshold calibrated against ambient noise level (sample on mode activation).
- One-time baseline sampling on mode activation to set the initial silence threshold relative to current ambient noise.
- Threshold is not continuously re-tuned during a session; fully automatic/ongoing silence threshold tuning remains out of scope (see Out of Scope).

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +30
**During agent TTS playback:**
- Mic is muted to prevent echo/feedback.
- Lightweight keyword detection stays active, listening for "hold on."

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says the mic is muted during agent TTS playback, but also that keyword detection stays active listening for "hold on." If the mic is truly muted at the input level, keyword spotting can’t work; if only Whisper/recording is muted while still capturing audio for keyword spotting, please clarify that distinction (e.g., separate capture path and what exactly is muted).

Copilot uses AI. Check for mistakes.
Four-phase plan covering hold-to-talk, TTS latency optimization,
hands-free mode with silence detection and barge-in, and polish.
Phases 1 and 2 can run in parallel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants