Streaming voice input with trigger word to auto-send

## Summary

Replace the current push-to-talk voice input (record → stop → transcribe → paste) with continuous streaming transcription using the OpenAI Realtime API. The user taps mic to start, speaks freely, and sees their transcript appear in real-time. When they say a configurable trigger word (default: "Over.") the accumulated transcript is automatically submitted to Claude. Without a trigger word configured, the user taps mic again to stop and sends manually — same flow as today but with live transcript preview.

## Motivation

Currently voice input requires: tap to start → speak → tap to stop → wait for transcription → manually send. This is clunky on mobile. Streaming transcription with a trigger word enables fully hands-free interaction — the user just speaks and says "Over." to send.

Even without a trigger word, streaming is better than the current batch approach because users see their transcript building in real-time instead of waiting after stop.

## Technical Plan

### Architecture: Browser-Direct WebSocket

The browser connects directly to OpenAI's Realtime API using an ephemeral token generated by the backend. This avoids relaying audio through our server.

```
Browser                         Backend                    OpenAI
  │                               │                          │
  │── POST /api/voice/rt-token ──►│                          │
  │                               │── transcriptionSessions  │
  │                               │   .create() ────────────►│
  │◄── { client_secret } ────────│◄── { client_secret } ────│
  │                               │                          │
  │── WebSocket (ephemeral key) ─────────────────────────────►│
  │── input_audio_buffer.append ─────────────────────────────►│
  │◄── conversation.item.input_audio_transcription.completed ─│
  │   (transcript after each pause)                          │
  │                                                          │
  │ [detects trigger word → auto-submit to Claude]           │
```

### Implementation Steps

#### 1. Backend: Ephemeral Token Endpoint

**New file**: `src/app/api/voice/realtime-token/route.ts`

- `POST /api/voice/realtime-token` — authenticated endpoint
- Uses `openai.beta.realtime.transcriptionSessions.create()` with:
  - `input_audio_format: 'pcm16'`
  - `input_audio_transcription: { model: 'gpt-4o-transcribe', language: 'en' }`
  - `turn_detection: { type: 'semantic_vad' }` (or `server_vad` — needs testing)
  - `client_secret: { expires_after: { anchor: 'created_at', seconds: 120 } }`
- Returns `{ client_secret: string, expires_at: number }`

#### 2. Browser: Audio Capture as PCM16

**New file**: `src/lib/pcm-audio-worklet.ts` (AudioWorklet processor)

The Realtime API requires PCM16 at 24kHz. MediaRecorder outputs webm/opus which won't work. Instead:

- Use `navigator.mediaDevices.getUserMedia({ audio: true })`
- Create `AudioContext` with `sampleRate: 24000`
- Use an `AudioWorkletNode` to capture raw PCM16 samples
- Send samples to the WebSocket via `input_audio_buffer.append`

#### 3. Browser: WebSocket Management

**Replace hook**: `src/hooks/useVoiceRecording.ts`

- Fetches ephemeral token from backend
- Opens WebSocket to `wss://api.openai.com/v1/realtime?intent=transcription`
- Sends `transcription_session.update` to configure the session
- Streams PCM16 audio chunks via `input_audio_buffer.append`
- Listens for `conversation.item.input_audio_transcription.completed` events
- Accumulates transcript text across multiple utterances
- Provides live transcript preview to the UI
- Checks for trigger word at end of each transcript completion
- On trigger: stops recording, submits accumulated text (minus trigger word) to Claude
- Without trigger word: user taps mic to stop, transcript is placed in input box for manual send
- Handles token expiry (re-fetch if needed), errors, and cleanup

#### 4. Trigger Word Detection

- Configurable trigger word, default: `"Over."`
- Stored in global settings (new field on `GlobalSettings` model)
- Detection: after each transcript completion, check if the accumulated text ends with the trigger phrase as a standalone phrase (case-insensitive, with some fuzzy matching for punctuation variations like "over", "Over.", "over.")
- "Standalone" means it's its own sentence/utterance — "it's over" should NOT trigger, but "fix the bug. Over." should
- When triggered: strip the trigger word from the text and auto-submit

#### 5. UI Changes

**Update**: `src/components/voice/VoiceMicButton.tsx`

- Replace current push-to-talk with streaming mode
- When recording:
  - Mic button shows "listening" state (pulsing animation)
  - Live transcript preview appears in/above the input box
  - Each completed utterance appends to the preview
  - Trigger word auto-submits (if configured)
  - User can tap mic to stop recording
    - With trigger word configured: stops without sending (cancel)
    - Without trigger word: places transcript in input box

**Update**: Settings UI to add trigger word configuration

#### 6. Settings: Trigger Word

**Schema change**: Add `voiceTriggerWord` field to `GlobalSettings` model (nullable string, default null)

- When null/empty: streaming voice still works, but user must tap mic to stop and send manually
- When set: saying the trigger word auto-submits
- UI: text input in Settings → Voice section

#### 7. Cleanup

- Remove old `POST /api/voice/transcribe` endpoint
- Remove old MediaRecorder-based recording logic from `useVoiceRecording.ts`
- Remove `gpt-4o-mini-transcribe` usage from `voice.ts` service (keep TTS)

### Key Design Decisions

1. **Single mode, not two** — Streaming replaces push-to-talk entirely. Without a trigger word it behaves the same (tap to start, tap to stop) but with live preview. No mode toggle needed.

2. **Browser-direct WebSocket** — Audio never touches our server. Backend only generates ephemeral tokens. Lower latency, simpler server.

3. **`semantic_vad` vs `server_vad`** — `semantic_vad` uses ML to detect natural speech boundaries (smarter than volume-based). Should test both; `semantic_vad` may work better for detecting "Over." as a distinct utterance. Fallback to `server_vad` if semantic is too slow.

4. **PCM16 via AudioWorklet** — Required by the Realtime API. AudioWorklet is well-supported in modern browsers (Safari 14.5+, Chrome Android). The worklet runs off-main-thread so it won't impact UI performance.

### Open Questions

- **Mobile browser AudioWorklet support** — AudioWorklet is supported in Safari 14.5+ and Chrome Android. Need to verify it works well on the target mobile browsers.
- **Token refresh** — Transcription sessions have a 30-minute limit. For long conversations, we may need to transparently reconnect.
- **Cost** — Realtime API charges per minute of audio. Should we show a warning or auto-stop after some duration?
- **Trigger word variations** — How fuzzy should matching be? "over", "Over.", "over.", "Over" should all work. What about "it's over" (shouldn't trigger)?

### References

- [OpenAI Realtime Transcription Guide](https://platform.openai.com/docs/guides/realtime-transcription)
- [OpenAI Realtime API](https://platform.openai.com/docs/guides/realtime)
- Current voice implementation: `src/hooks/useVoiceRecording.ts`, `src/components/voice/VoiceMicButton.tsx`, `src/server/services/voice.ts`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming voice input with trigger word to auto-send #284

Summary

Motivation

Technical Plan

Architecture: Browser-Direct WebSocket

Implementation Steps

1. Backend: Ephemeral Token Endpoint

2. Browser: Audio Capture as PCM16

3. Browser: WebSocket Management

4. Trigger Word Detection

5. UI Changes

6. Settings: Trigger Word

7. Cleanup

Key Design Decisions

Open Questions

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Streaming voice input with trigger word to auto-send #284

Description

Summary

Motivation

Technical Plan

Architecture: Browser-Direct WebSocket

Implementation Steps

1. Backend: Ephemeral Token Endpoint

2. Browser: Audio Capture as PCM16

3. Browser: WebSocket Management

4. Trigger Word Detection

5. UI Changes

6. Settings: Trigger Word

7. Cleanup

Key Design Decisions

Open Questions

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions