Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions docs/plans/2026-03-10-hands-free-mode-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Hands-Free Mode Design

## Goal

Add hands-free voice interaction to Claude Whisperer so users can have a conversation with the agent without pressing buttons. Also add hold-to-talk as a lighter-weight alternative. Improve TTS latency to sub-one-second.

## Interaction Modes

Three modes, selectable from the menubar:

### 1. Press-to-talk (existing)

Press PTT key to start recording, press again to stop. Transcribes and auto-submits.

### 2. Hold-to-talk (new)

Hold PTT key to record. Release to stop, transcribe, and auto-submit. Feels immediate — no second keypress needed.

### 3. Hands-free (new)

Mic listens continuously. No buttons required for normal conversation flow.

**End of turn:**
- 3 seconds of silence triggers transcription and submit.
- Tapping the PTT key submits instantly (skip the silence wait).

**During agent TTS playback:**
- Mic is muted to prevent echo/feedback.
- Lightweight keyword detection stays active, listening for "hold on."

Comment on lines +27 to +30
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says the mic is muted during agent TTS playback, but also that keyword detection stays active listening for "hold on." If the mic is truly muted at the input level, keyword spotting can’t work; if only Whisper/recording is muted while still capturing audio for keyword spotting, please clarify that distinction (e.g., separate capture path and what exactly is muted).

Copilot uses AI. Check for mistakes.
**Barge-in:**
- Say "hold on" while the agent is speaking.
- TTS stops immediately, mic unmutes, starts capturing new input.
- The interrupted response remains visible on screen.

## Architecture

Hybrid approach — Swift handles real-time audio, Python handles heavy ML.

### Swift app responsibilities

- Mic capture and audio buffering (all modes).
- Mode state machine: idle, recording, waiting-for-silence, playing-tts.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state list here (idle, recording, waiting-for-silence, playing-tts) doesn’t match the later diagram states (IDLE, RECORDING, TRANSCRIBING, WAITING FOR AGENT, PLAYING TTS). To avoid implementation drift, please align the names/definitions (and include or remove “waiting-for-silence” explicitly in the diagram if it’s a distinct state).

Suggested change
- Mode state machine: idle, recording, waiting-for-silence, playing-tts.
- Mode state machine: IDLE, RECORDING, TRANSCRIBING, WAITING FOR AGENT, PLAYING TTS.

Copilot uses AI. Check for mistakes.
- Silence detection using audio energy levels (3-second threshold).
- Keyword spotting during TTS playback via Apple Speech framework ("hold on").
- Hold-to-talk: detect key-down/key-up events (not just toggles).
- Send buffered audio to Python server for transcription.
- UI: mode selector in menubar, overlay updates per mode.

### Python server responsibilities

- Whisper transcription (unchanged).
- Kokoro TTS (unchanged).
- Auto-submit: press Enter after transcription (already implemented).
- Kill TTS on barge-in signal from Swift app (existing `kill_tts`).

### Communication

- Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
- Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`).
Comment on lines +55 to +60
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out “direct pkill” from the Swift app to stop TTS has security and reliability drawbacks (process-name collisions, killing unrelated processes, needing shell access). Prefer defining a dedicated server API (or existing kill_tts) with clear auth/validation and a documented request/response and failure behavior.

Suggested change
- Kill TTS on barge-in signal from Swift app (existing `kill_tts`).
### Communication
- Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
- Swift → Python: signal to kill TTS (new — could be HTTP endpoint or direct `pkill`).
- Kill TTS on barge-in signal from Swift app via dedicated API wrapping existing `kill_tts`.
### Communication
- Swift → Python: HTTP POST `/v1/audio/transcriptions` (existing).
- Swift → Python: HTTP POST `/v1/tts/kill` to stop TTS (new; calls server-side `kill_tts` with auth/validation).

Copilot uses AI. Check for mistakes.
- Python → Swift: transcription result in JSON response (existing).

## PTT Key

Same configurable key (Control, Option, Cmd, Fn) across all three modes. Already configurable in the app. Behavior changes based on selected mode:

| Mode | Key down | Key up | Silence (3s) |
|------|----------|--------|--------------|
| Press-to-talk | Toggle recording | Toggle recording | — |
| Hold-to-talk | Start recording | Stop + submit | — |
| Hands-free | Instant submit | — | Auto-submit |

Comment on lines +71 to +72
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hands-free, the table says “Key down: Instant submit,” while earlier you describe “tapping” the PTT key to submit instantly. Submitting on key-down can fire even if the user presses-and-holds (or if key-repeat/keydown events are noisy). Please specify whether it’s a tap (down+up within a threshold) or key-down, and how debouncing/long-press is handled across modes.

Suggested change
| Hands-free | Instant submit | | Auto-submit |
| Hands-free | Start tap window (no submit yet) | If tap (down+up within threshold) → instant submit | Auto-submit |
Tap = key-down followed by key-up within a short threshold (e.g., ~200–500 ms). In hands-free mode, long-presses beyond this window do **not** submit, and key-repeat/extra keydown events are debounced so a single tap produces at most one instant submit.

Copilot uses AI. Check for mistakes.
## TTS Latency

Target: sub-one-second from response to first audio.

Current breakdown (~1.5-2s):
- Hook startup (bash, jq, curl health check): ~300ms
- TTS generation (Kokoro): ~1000ms
- Sequencing overhead: ~200ms

Optimizations:
1. **Pre-warm Kokoro model** — generate a silent/dummy request on server start so the model and pipeline are loaded before the first real request.
2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “skip health check” optimization relies on cached liveness for 30s, but the doc doesn’t specify a fallback if the server died or became unhealthy within that window. Please document the failure handling (e.g., on request failure/timeout, retry once with a fresh health check and surface a clear error) so latency improvements don’t increase flakiness.

Suggested change
2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`.
2. **Skip health check in hook** — cache server status; if confirmed alive within last 30s, skip the curl check to `/models`. If a TTS request fails or times out while using cached liveness, retry once after forcing a fresh health check; if that retry also fails, surface a clear error to the user/logs and mark the server as unhealthy instead of continuing to skip health checks.

Copilot uses AI. Check for mistakes.
3. **Reduce hook startup** — consider moving TTS request logic from bash into the Python server (eliminate shell/jq/curl overhead entirely).

Stretch goal: streaming TTS (start playback while still generating) — depends on mlx_audio support.

## Barge-in: Keyword Detection

Use Apple's Speech framework (`SFSpeechRecognizer`) for lightweight, on-device keyword spotting.

- Only active during TTS playback.
- Listens for the phrase "hold on."
- On detection: send kill signal to TTS, transition to recording state.
- Low resource usage — no Whisper inference needed, runs on Apple's built-in speech engine.

Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.
Comment on lines +97 to +98
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using SFSpeechRecognizer generally requires user Speech Recognition authorization and has platform/availability constraints. The design should note the permission flow and what happens if authorization is denied/unavailable (e.g., fall back to energy-spike barge-in or disable barge-in with UI messaging).

Suggested change
Alternative if Speech framework is too heavy: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.
- Permission flow:
- On first entry into hands-free mode (or first attempt to enable keyword barge-in), request Speech Recognition authorization via `SFSpeechRecognizer.requestAuthorization`.
- Enable keyword spotting only if authorization is `.authorized` **and** the recognizer for the current locale reports `isAvailable == true`.
- If authorization is denied/restricted/not granted or `SFSpeechRecognizer` is unavailable on the current platform:
- Disable Speech-framework-based keyword barge-in.
- Fall back to an energy-spike-based barge-in trigger (if enabled in settings), or disable barge-in entirely and show non-intrusive UI messaging (e.g., tooltip/banner: “Barge-in unavailable: speech recognition permission not granted”).
- Platform constraints:
- Only expose Speech-based keyword barge-in on macOS versions that support `SFSpeechRecognizer` in the target app context; hide or gray out the option on unsupported systems.
Alternative (or fallback) if Speech framework is too heavy or unavailable: use audio energy spike detection as a simpler barge-in trigger (any loud input during TTS = interrupt). Less precise but near-zero cost.

Copilot uses AI. Check for mistakes.

## Silence Detection

Use RMS (root mean square) energy of audio buffer frames.

- Threshold calibrated against ambient noise level (sample on mode activation).
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The silence section includes calibrating the threshold against ambient noise on mode activation, but the Out of Scope list says “Automatic silence threshold tuning.” Please reconcile this by clarifying whether one-time baseline sampling is considered in-scope (and what it does), or by adjusting the out-of-scope item to match the proposed behavior.

Suggested change
- Threshold calibrated against ambient noise level (sample on mode activation).
- One-time baseline sampling on mode activation to set the initial silence threshold relative to current ambient noise.
- Threshold is not continuously re-tuned during a session; fully automatic/ongoing silence threshold tuning remains out of scope (see Out of Scope).

Copilot uses AI. Check for mistakes.
- Timer starts when energy drops below threshold.
- After 3 seconds of continuous silence, trigger transcription and submit.
- Any speech resets the timer.
- PTT key tap bypasses the timer for instant submit.

## State Machine

```
┌─────────────────────────┐
│ IDLE │
│ (mic on in hands-free) │
└──────┬──────────────────┘
│ voice detected / key press
v
┌─────────────────────────┐
│ RECORDING │
│ (buffering audio) │
└──────┬──────────────────┘
│ silence 3s / key tap / key release
v
┌─────────────────────────┐
│ TRANSCRIBING │
│ (Whisper processing) │
└──────┬──────────────────┘
│ text submitted
v
┌─────────────────────────┐
│ WAITING FOR AGENT │
│ (mic muted) │
└──────┬──────────────────┘
│ TTS starts
v
┌─────────────────────────┐
│ PLAYING TTS │
│ (keyword spotting on) │
└──────┬──────────────────┘
│ TTS ends / "hold on" detected
v
back to IDLE
```

## Out of Scope (v1)

- Full echo cancellation (mic open during TTS without keyword gating).
- Automatic silence threshold tuning.
- Multiple keyword phrases.
- Streaming TTS playback (stretch goal, depends on mlx_audio).
- Wake word to activate hands-free from sleep.

## Success Criteria

- Hold-to-talk works reliably: hold key, speak, release, text appears and submits.
- Hands-free works for a multi-turn conversation: speak, wait 3s or tap key, agent responds via TTS, speak again.
- Saying "hold on" during TTS stops playback and starts listening.
- TTS latency under 1 second for typical VOICE tag content.
- No echo: agent's TTS output is not transcribed as user input.
224 changes: 224 additions & 0 deletions docs/plans/2026-03-10-hands-free-mode-implementation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Hands-Free Mode — Implementation Plan

Companion to [hands-free-mode-design.md](./2026-03-10-hands-free-mode-design.md).

## Prerequisites

`HotkeyManager`, `DictationManager`, and `AudioRecorder` are compiled into the app binary but their source is not in the repo. These classes handle hotkey capture, recording lifecycle, and audio engine management. The plan below works within that constraint — extending behavior through the existing callback pattern rather than modifying those classes directly.

If source becomes available, steps marked **(source-only)** can be done more cleanly.

## Phase 1: Hold-to-Talk

**Goal:** Hold PTT key to record, release to submit. Minimal change, high value.

### Step 1.1 — Add interaction mode enum and persistence

- Create `InteractionMode` enum: `.pressToTalk`, `.holdToTalk`, `.handsFree`
- Add `Paths.interactionMode` file path (e.g., `~/Library/Application Support/ClaudeWhisperer/interaction_mode`)
- Read/write mode on app launch and settings change.

**Files:** `Paths.swift`

### Step 1.2 — Add mode selector to menubar

- Add a `Picker` in `MenuBarView.swift` within the Push-to-Talk section.
- Three options: "Press to Talk", "Hold to Talk", "Hands-Free".
- Persist selection via the file from Step 1.1.
- Publish mode to `AppDelegate` via environment or notification.

**Files:** `MenuBarView.swift`

### Step 1.3 — Wire key-up callback in AppDelegate

- `HotkeyManager` currently exposes `onKeyDown` and `onToggle`.
- For hold-to-talk, we need `onKeyUp` (or repurpose `onToggle` based on mode).
- **If source available (source-only):** add `onKeyUp` callback to `HotkeyManager`.
- **Without source:** `onToggle` fires on both press and release in toggle mode. We can track timing in `AppDelegate` — if the gap between two toggles is short and mode is hold-to-talk, treat the second toggle as "release → submit."
- On release in hold-to-talk mode: call `dictationManager.toggle()` to stop recording, then trigger transcription + submit.

**Files:** `AppDelegate.swift`

### Step 1.4 — Auto-submit on transcription complete

- Already implemented: `press_enter()` in `unified_server.py` fires when `auto_submit` flag exists.
- Verify hold-to-talk triggers the same transcription → submit flow.
- The Swift app's `DictationManager` posts audio to `/v1/audio/transcriptions` and pastes the result. The server's `press_enter()` then submits it.

**Files:** Verification only, no changes expected.

### Step 1.5 — Update overlay status text

- In `TranscriptionOverlay.swift`, update `statusText` to reflect mode:
- Hold-to-talk recording: "Recording... release to submit"
- Hold-to-talk idle: "Hold [key] to talk"

**Files:** `TranscriptionOverlay.swift`

---

## Phase 2: TTS Latency

**Goal:** Sub-one-second from response to first audio.

### Step 2.1 — Pre-warm Kokoro model

- On server startup, fire a dummy TTS request (empty or single-word) to load the model and pipeline into memory.
- This eliminates the "Fetching 63 files" and "Creating new KokoroPipeline" delay on first real request.

**Files:** `servers/unified_server.py` (add `@app.on_event("startup")` handler)

### Step 2.2 — Cache server health in TTS hook

- Currently `tts-hook.sh` curls `/models` on every invocation (~150ms).
- Write a timestamp file on successful health check. Skip the check if timestamp is less than 30 seconds old.

**Files:** `hooks/tts-hook.sh`

### Step 2.3 — Reduce hook shell overhead

- The hook spawns bash, loads jq, builds JSON, runs curl — ~300ms before the TTS request even fires.
- Alternative: add a `/v1/audio/speak` endpoint to `unified_server.py` that accepts text directly (no audio file upload) and returns audio. The hook becomes a single curl POST.
- Or: have the server itself watch for response completion and generate TTS proactively (eliminates the hook entirely for hands-free mode).

**Files:** `servers/unified_server.py`, `hooks/tts-hook.sh`

### Step 2.4 — Measure and validate

- Add timing logs at each stage (hook start, TTS request sent, audio received, playback started).
- Benchmark against the 1-second target.

**Files:** `hooks/tts-hook.sh`, `servers/unified_server.py`

---

## Phase 3: Hands-Free Mode

**Goal:** Continuous mic, silence-based submit, barge-in with "hold on."

### Step 3.1 — Continuous mic capture

- In hands-free mode, `AudioRecorder` should start on mode activation and stay running.
- **Without source:** We can toggle the recorder on via `dictationManager.toggle()` when entering hands-free mode, and keep it running across transcription cycles.
- **With source (source-only):** Add a `startContinuous()` method to `AudioRecorder` that captures without the stop-on-toggle behavior.

**Files:** `AppDelegate.swift` (or `AudioRecorder.swift` if source available)

### Step 3.2 — Silence detection

- Use `AudioRecorder.levelHistory` (already tracks audio energy for the waveform) to detect silence.
- Add a timer in `AppDelegate` or a new `HandsFreeController`:
- Sample audio level every 100ms.
- If level stays below threshold for 3 seconds, trigger transcription + submit.
- Any spike above threshold resets the timer.
- Calibrate threshold: sample ambient level for 1 second when entering hands-free mode.

**Files:** New `HandsFreeController.swift` (or inline in `AppDelegate.swift`)

### Step 3.3 — Mic muting during TTS

- When TTS playback starts (detected via `tts_playing.lock` file, already polled in `TranscriptionOverlay`), pause mic capture.
- When TTS ends (lock file removed), resume mic capture.
- This prevents Whisper from transcribing the agent's voice.

**Files:** `AppDelegate.swift` or `HandsFreeController.swift`

### Step 3.4 — Barge-in keyword detection

- Use `SFSpeechRecognizer` (Apple Speech framework) for lightweight on-device keyword spotting.
- Only active during TTS playback (mic is otherwise muted, but keyword spotter listens on a separate tap).
- On detecting "hold on":
1. Kill TTS playback (HTTP to server or direct `pkill afplay`).
2. Remove `tts_playing.lock`.
3. Unmute mic, transition to recording state.

**Implementation notes:**
- `SFSpeechRecognizer` works with `SFSpeechAudioBufferRecognitionRequest` for live audio.
- Set `shouldReportPartialResults = true` to detect "hold on" as soon as it's spoken.
- Requires Speech Recognition permission (add to Info.plist: `NSSpeechRecognitionUsageDescription`).

**Files:** New `KeywordSpotter.swift`, `Info.plist`

### Step 3.5 — PTT key as instant submit in hands-free

- In hands-free mode, tapping the PTT key should:
1. Cancel the silence timer.
2. Immediately stop recording and trigger transcription + submit.
- Same as press-to-talk "stop" behavior, just without needing a "start" press first (mic is already on).

**Files:** `AppDelegate.swift`

### Step 3.6 — Update overlay for hands-free

- New status states:
- "Listening..." (idle, mic on, waiting for speech)
- "Recording..." (speech detected, buffering)
- "Submitting..." (silence detected or key tapped)
- "Agent thinking..." (waiting for response, mic muted)
- "Agent speaking..." (TTS playing, keyword spotter active)
- Show silence countdown in the last 1-2 seconds before auto-submit.

**Files:** `TranscriptionOverlay.swift`

### Step 3.7 — Mode selector wiring

- Connect the mode selector from Step 1.2 to the hands-free controller.
- On mode change:
- `pressToTalk` → stop continuous mic, use toggle behavior.
- `holdToTalk` → stop continuous mic, use hold behavior.
- `handsFree` → start continuous mic, activate silence detection.
- Persist across app restart.

**Files:** `AppDelegate.swift`, `MenuBarView.swift`

---

## Phase 4: Polish and Edge Cases

### Step 4.1 — Graceful mode transitions

- Switching modes while recording: stop current recording, discard audio, enter new mode cleanly.
- Switching modes while TTS playing: let TTS finish, then enter new mode.

### Step 4.2 — Error recovery

- If server crashes during hands-free mode, detect via health check failure, show error in overlay, pause mic until server recovers.
- If keyword spotter fails to initialize (permission denied), fall back to key-only barge-in.

### Step 4.3 — Power and resource management

- Hands-free mode keeps mic and possibly Speech framework active continuously.
- Add auto-sleep: if no interaction for N minutes, pause mic and show "Paused" in overlay. Any keypress resumes.
- Monitor energy impact and optimize polling intervals.

### Step 4.4 — Testing

- Manual test matrix: each mode × each action (start, stop, submit, barge-in, mode switch).
- Edge cases: rapid mode switching, server restart during recording, very long silence, very short utterances.

---

## Dependency Graph

```
Phase 1 (Hold-to-Talk)
1.1 → 1.2 → 1.3 → 1.4 → 1.5

Phase 2 (TTS Latency) — independent of Phase 1
2.1 → 2.2 → 2.3 → 2.4

Phase 3 (Hands-Free) — depends on Phase 1
1.5 → 3.1 → 3.2 → 3.3 → 3.4 → 3.5 → 3.6 → 3.7

Phase 4 (Polish) — depends on Phase 3
3.7 → 4.1 → 4.2 → 4.3 → 4.4
```

Phases 1 and 2 can be worked on in parallel.

## Open Questions

1. **HotkeyManager source** — Is source available? Hold-to-talk is much cleaner with a real `onKeyUp` callback. Without it, we rely on timing heuristics.
2. **AudioRecorder source** — Continuous capture for hands-free ideally needs a mode where the recorder doesn't stop on toggle. Without source, we work around it.
3. **SFSpeechRecognizer resource usage** — Need to benchmark. If too heavy for continuous keyword spotting, fall back to energy-spike detection for barge-in.
4. **Silence threshold** — 3 seconds is the starting point. May need user-configurable setting after testing.