diff --git a/docs/plans/2026-03-10-hands-free-mode-design.md b/docs/plans/2026-03-10-hands-free-mode-design.md new file mode 100644 index 0000000..c554b55 --- /dev/null +++ b/docs/plans/2026-03-10-hands-free-mode-design.md @@ -0,0 +1,160 @@ +# Hands-Free Mode Design + +## Goal + +Add hands-free voice interaction to Claude Whisperer so users can have a conversation with the agent without pressing buttons. Also add hold-to-talk as a lighter-weight alternative. Improve TTS latency to sub-one-second. + +## Interaction Modes + +Three modes, selectable from the menubar: + +### 1. Press-to-talk (existing) + +Press PTT key to start recording, press again to stop. Transcribes and auto-submits. + +### 2. Hold-to-talk (new) + +Hold PTT key to record. Release to stop, transcribe, and auto-submit. Feels immediate — no second keypress needed. + +### 3. Hands-free (new) + +Mic listens continuously. No buttons required for normal conversation flow. + +**End of turn:** +- 3 seconds of silence triggers transcription and submit. +- Tapping the PTT key submits instantly (skip the silence wait). + +**During agent TTS playback:** +- Mic is muted to prevent echo/feedback. +- Lightweight keyword detection stays active, listening for "hold on." + +**Barge-in:** +- Say "hold on" while the agent is speaking. +- TTS stops immediately, mic unmutes, starts capturing new input. +- The interrupted response remains visible on screen. + +## Architecture + +Hybrid approach — Swift handles real-time audio, Python handles heavy ML. + +### Swift app responsibilities + +- Mic capture and audio buffering (all modes). +- Mode state machine: idle, recording, waiting-for-silence, playing-tts. +- Silence detection using audio energy levels (3-second threshold). +- Keyword spotting during TTS playback via Apple Speech framework ("hold on"). +- Hold-to-talk: detect key-down/key-up events (not just toggles). +- Send buffered audio to Python server for transcription. +- UI: mode selector in menubar, overlay updates per mode. + +### Python server responsibilities + +- Whisper transcription (unchanged). +- Kokoro TTS (unchanged). +- Auto-submit: press Enter after transcription (already implemented). +- Kill TTS on barge-in signal from Swift app (existing `kill_tts`). + +### Communication + +- Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing). +- Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`). +- Python → Swift: transcription result in JSON response (existing). + +## PTT Key + +Same configurable key (Control, Option, Cmd, Fn) across all three modes. Already configurable in the app. Behavior changes based on selected mode: + +| Mode | Key down | Key up | Silence (3s) | +|------|----------|--------|--------------| +| Press-to-talk | Toggle recording | Toggle recording | — | +| Hold-to-talk | Start recording | Stop + submit | — | +| Hands-free | Instant submit | — | Auto-submit | + +## TTS Latency + +Target: sub-one-second from response to first audio. + +Current breakdown (~1.5-2s): +- Hook startup (bash, jq, curl health check): ~300ms +- TTS generation (Kokoro): ~1000ms +- Sequencing overhead: ~200ms + +Optimizations: +1. **Pre-warm Kokoro model** — generate a silent/dummy request on server start so the model and pipeline are loaded before the first real request. +2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`. +3. **Reduce hook startup** — consider moving TTS request logic from bash into the Python server (eliminate shell/jq/curl overhead entirely). + +Stretch goal: streaming TTS (start playback while still generating) — depends on mlx_audio support. + +## Barge-in: Keyword Detection + +Use Apple's Speech framework (`SFSpeechRecognizer`) for lightweight, on-device keyword spotting. + +- Only active during TTS playback. +- Listens for the phrase "hold on." +- On detection: send kill signal to TTS, transition to recording state. +- Low resource usage — no Whisper inference needed, runs on Apple's built-in speech engine. + +Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost. + +## Silence Detection + +Use RMS (root mean square) energy of audio buffer frames. + +- Threshold calibrated against ambient noise level (sample on mode activation). +- Timer starts when energy drops below threshold. +- After 3 seconds of continuous silence, trigger transcription and submit. +- Any speech resets the timer. +- PTT key tap bypasses the timer for instant submit. + +## State Machine + +``` + ┌─────────────────────────┐ + │ IDLE │ + │ (mic on in hands-free) │ + └──────┬──────────────────┘ + │ voice detected / key press + v + ┌─────────────────────────┐ + │ RECORDING │ + │ (buffering audio) │ + └──────┬──────────────────┘ + │ silence 3s / key tap / key release + v + ┌─────────────────────────┐ + │ TRANSCRIBING │ + │ (Whisper processing) │ + └──────┬──────────────────┘ + │ text submitted + v + ┌─────────────────────────┐ + │ WAITING FOR AGENT │ + │ (mic muted) │ + └──────┬──────────────────┘ + │ TTS starts + v + ┌─────────────────────────┐ + │ PLAYING TTS │ + │ (keyword spotting on) │ + └──────┬──────────────────┘ + │ TTS ends / "hold on" detected + v + back to IDLE +``` + +## Out of Scope (v1) + +- Full echo cancellation (mic open during TTS without keyword gating). +- Automatic silence threshold tuning. +- Multiple keyword phrases. +- Streaming TTS playback (stretch goal, depends on mlx_audio). +- Wake word to activate hands-free from sleep. + +## Success Criteria + +- Hold-to-talk works reliably: hold key, speak, release, text appears and submits. +- Hands-free works for a multi-turn conversation: speak, wait 3s or tap key, agent responds via TTS, speak again. +- Saying "hold on" during TTS stops playback and starts listening. +- TTS latency under 1 second for typical VOICE tag content. +- No echo: agent's TTS output is not transcribed as user input. diff --git a/docs/plans/2026-03-10-hands-free-mode-implementation.md b/docs/plans/2026-03-10-hands-free-mode-implementation.md new file mode 100644 index 0000000..5bb655e --- /dev/null +++ b/docs/plans/2026-03-10-hands-free-mode-implementation.md @@ -0,0 +1,224 @@ +# Hands-Free Mode — Implementation Plan + +Companion to [hands-free-mode-design.md](./2026-03-10-hands-free-mode-design.md). + +## Prerequisites + +`HotkeyManager`, `DictationManager`, and `AudioRecorder` are compiled into the app binary but their source is not in the repo. These classes handle hotkey capture, recording lifecycle, and audio engine management. The plan below works within that constraint — extending behavior through the existing callback pattern rather than modifying those classes directly. + +If source becomes available, steps marked **(source-only)** can be done more cleanly. + +## Phase 1: Hold-to-Talk + +**Goal:** Hold PTT key to record, release to submit. Minimal change, high value. + +### Step 1.1 — Add interaction mode enum and persistence + +- Create `InteractionMode` enum: `.pressToTalk`, `.holdToTalk`, `.handsFree` +- Add `Paths.interactionMode` file path (e.g., `~/Library/Application Support/ClaudeWhisperer/interaction_mode`) +- Read/write mode on app launch and settings change. + +**Files:** `Paths.swift` + +### Step 1.2 — Add mode selector to menubar + +- Add a `Picker` in `MenuBarView.swift` within the Push-to-Talk section. +- Three options: "Press to Talk", "Hold to Talk", "Hands-Free". +- Persist selection via the file from Step 1.1. +- Publish mode to `AppDelegate` via environment or notification. + +**Files:** `MenuBarView.swift` + +### Step 1.3 — Wire key-up callback in AppDelegate + +- `HotkeyManager` currently exposes `onKeyDown` and `onToggle`. +- For hold-to-talk, we need `onKeyUp` (or repurpose `onToggle` based on mode). +- **If source available (source-only):** add `onKeyUp` callback to `HotkeyManager`. +- **Without source:** `onToggle` fires on both press and release in toggle mode. We can track timing in `AppDelegate` — if the gap between two toggles is short and mode is hold-to-talk, treat the second toggle as "release → submit." +- On release in hold-to-talk mode: call `dictationManager.toggle()` to stop recording, then trigger transcription + submit. + +**Files:** `AppDelegate.swift` + +### Step 1.4 — Auto-submit on transcription complete + +- Already implemented: `press_enter()` in `unified_server.py` fires when `auto_submit` flag exists. +- Verify hold-to-talk triggers the same transcription → submit flow. +- The Swift app's `DictationManager` posts audio to `/v1/audio/transcriptions` and pastes the result. The server's `press_enter()` then submits it. + +**Files:** Verification only, no changes expected. + +### Step 1.5 — Update overlay status text + +- In `TranscriptionOverlay.swift`, update `statusText` to reflect mode: + - Hold-to-talk recording: "Recording... release to submit" + - Hold-to-talk idle: "Hold [key] to talk" + +**Files:** `TranscriptionOverlay.swift` + +--- + +## Phase 2: TTS Latency + +**Goal:** Sub-one-second from response to first audio. + +### Step 2.1 — Pre-warm Kokoro model + +- On server startup, fire a dummy TTS request (empty or single-word) to load the model and pipeline into memory. +- This eliminates the "Fetching 63 files" and "Creating new KokoroPipeline" delay on first real request. + +**Files:** `servers/unified_server.py` (add `@app.on_event("startup")` handler) + +### Step 2.2 — Cache server health in TTS hook + +- Currently `tts-hook.sh` curls `/models` on every invocation (~150ms). +- Write a timestamp file on successful health check. Skip the check if timestamp is less than 30 seconds old. + +**Files:** `hooks/tts-hook.sh` + +### Step 2.3 — Reduce hook shell overhead + +- The hook spawns bash, loads jq, builds JSON, runs curl — ~300ms before the TTS request even fires. +- Alternative: add a `/v1/audio/speak` endpoint to `unified_server.py` that accepts text directly (no audio file upload) and returns audio. The hook becomes a single curl POST. +- Or: have the server itself watch for response completion and generate TTS proactively (eliminates the hook entirely for hands-free mode). + +**Files:** `servers/unified_server.py`, `hooks/tts-hook.sh` + +### Step 2.4 — Measure and validate + +- Add timing logs at each stage (hook start, TTS request sent, audio received, playback started). +- Benchmark against the 1-second target. + +**Files:** `hooks/tts-hook.sh`, `servers/unified_server.py` + +--- + +## Phase 3: Hands-Free Mode + +**Goal:** Continuous mic, silence-based submit, barge-in with "hold on." + +### Step 3.1 — Continuous mic capture + +- In hands-free mode, `AudioRecorder` should start on mode activation and stay running. +- **Without source:** We can toggle the recorder on via `dictationManager.toggle()` when entering hands-free mode, and keep it running across transcription cycles. +- **With source (source-only):** Add a `startContinuous()` method to `AudioRecorder` that captures without the stop-on-toggle behavior. + +**Files:** `AppDelegate.swift` (or `AudioRecorder.swift` if source available) + +### Step 3.2 — Silence detection + +- Use `AudioRecorder.levelHistory` (already tracks audio energy for the waveform) to detect silence. +- Add a timer in `AppDelegate` or a new `HandsFreeController`: + - Sample audio level every 100ms. + - If level stays below threshold for 3 seconds, trigger transcription + submit. + - Any spike above threshold resets the timer. +- Calibrate threshold: sample ambient level for 1 second when entering hands-free mode. + +**Files:** New `HandsFreeController.swift` (or inline in `AppDelegate.swift`) + +### Step 3.3 — Mic muting during TTS + +- When TTS playback starts (detected via `tts_playing.lock` file, already polled in `TranscriptionOverlay`), pause mic capture. +- When TTS ends (lock file removed), resume mic capture. +- This prevents Whisper from transcribing the agent's voice. + +**Files:** `AppDelegate.swift` or `HandsFreeController.swift` + +### Step 3.4 — Barge-in keyword detection + +- Use `SFSpeechRecognizer` (Apple Speech framework) for lightweight on-device keyword spotting. +- Only active during TTS playback (mic is otherwise muted, but keyword spotter listens on a separate tap). +- On detecting "hold on": + 1. Kill TTS playback (HTTP to server or direct `pkill afplay`). + 2. Remove `tts_playing.lock`. + 3. Unmute mic, transition to recording state. + +**Implementation notes:** +- `SFSpeechRecognizer` works with `SFSpeechAudioBufferRecognitionRequest` for live audio. +- Set `shouldReportPartialResults = true` to detect "hold on" as soon as it's spoken. +- Requires Speech Recognition permission (add to Info.plist: `NSSpeechRecognitionUsageDescription`). + +**Files:** New `KeywordSpotter.swift`, `Info.plist` + +### Step 3.5 — PTT key as instant submit in hands-free + +- In hands-free mode, tapping the PTT key should: + 1. Cancel the silence timer. + 2. Immediately stop recording and trigger transcription + submit. +- Same as press-to-talk "stop" behavior, just without needing a "start" press first (mic is already on). + +**Files:** `AppDelegate.swift` + +### Step 3.6 — Update overlay for hands-free + +- New status states: + - "Listening..." (idle, mic on, waiting for speech) + - "Recording..." (speech detected, buffering) + - "Submitting..." (silence detected or key tapped) + - "Agent thinking..." (waiting for response, mic muted) + - "Agent speaking..." (TTS playing, keyword spotter active) +- Show silence countdown in the last 1-2 seconds before auto-submit. + +**Files:** `TranscriptionOverlay.swift` + +### Step 3.7 — Mode selector wiring + +- Connect the mode selector from Step 1.2 to the hands-free controller. +- On mode change: + - `pressToTalk` → stop continuous mic, use toggle behavior. + - `holdToTalk` → stop continuous mic, use hold behavior. + - `handsFree` → start continuous mic, activate silence detection. +- Persist across app restart. + +**Files:** `AppDelegate.swift`, `MenuBarView.swift` + +--- + +## Phase 4: Polish and Edge Cases + +### Step 4.1 — Graceful mode transitions + +- Switching modes while recording: stop current recording, discard audio, enter new mode cleanly. +- Switching modes while TTS playing: let TTS finish, then enter new mode. + +### Step 4.2 — Error recovery + +- If server crashes during hands-free mode, detect via health check failure, show error in overlay, pause mic until server recovers. +- If keyword spotter fails to initialize (permission denied), fall back to key-only barge-in. + +### Step 4.3 — Power and resource management + +- Hands-free mode keeps mic and possibly Speech framework active continuously. +- Add auto-sleep: if no interaction for N minutes, pause mic and show "Paused" in overlay. Any keypress resumes. +- Monitor energy impact and optimize polling intervals. + +### Step 4.4 — Testing + +- Manual test matrix: each mode × each action (start, stop, submit, barge-in, mode switch). +- Edge cases: rapid mode switching, server restart during recording, very long silence, very short utterances. + +--- + +## Dependency Graph + +``` +Phase 1 (Hold-to-Talk) + 1.1 → 1.2 → 1.3 → 1.4 → 1.5 + +Phase 2 (TTS Latency) — independent of Phase 1 + 2.1 → 2.2 → 2.3 → 2.4 + +Phase 3 (Hands-Free) — depends on Phase 1 + 1.5 → 3.1 → 3.2 → 3.3 → 3.4 → 3.5 → 3.6 → 3.7 + +Phase 4 (Polish) — depends on Phase 3 + 3.7 → 4.1 → 4.2 → 4.3 → 4.4 +``` + +Phases 1 and 2 can be worked on in parallel. + +## Open Questions + +1. **HotkeyManager source** — Is source available? Hold-to-talk is much cleaner with a real `onKeyUp` callback. Without it, we rely on timing heuristics. +2. **AudioRecorder source** — Continuous capture for hands-free ideally needs a mode where the recorder doesn't stop on toggle. Without source, we work around it. +3. **SFSpeechRecognizer resource usage** — Need to benchmark. If too heavy for continuous keyword spotting, fall back to energy-spike detection for barge-in. +4. **Silence threshold** — 3 seconds is the starting point. May need user-configurable setting after testing.