Summary
Replace the current push-to-talk voice input (record → stop → transcribe → paste) with continuous streaming transcription using the OpenAI Realtime API. The user taps mic to start, speaks freely, and sees their transcript appear in real-time. When they say a configurable trigger word (default: "Over.") the accumulated transcript is automatically submitted to Claude. Without a trigger word configured, the user taps mic again to stop and sends manually — same flow as today but with live transcript preview.
Motivation
Currently voice input requires: tap to start → speak → tap to stop → wait for transcription → manually send. This is clunky on mobile. Streaming transcription with a trigger word enables fully hands-free interaction — the user just speaks and says "Over." to send.
Even without a trigger word, streaming is better than the current batch approach because users see their transcript building in real-time instead of waiting after stop.
Technical Plan
Architecture: Browser-Direct WebSocket
The browser connects directly to OpenAI's Realtime API using an ephemeral token generated by the backend. This avoids relaying audio through our server.
Browser Backend OpenAI
│ │ │
│── POST /api/voice/rt-token ──►│ │
│ │── transcriptionSessions │
│ │ .create() ────────────►│
│◄── { client_secret } ────────│◄── { client_secret } ────│
│ │ │
│── WebSocket (ephemeral key) ─────────────────────────────►│
│── input_audio_buffer.append ─────────────────────────────►│
│◄── conversation.item.input_audio_transcription.completed ─│
│ (transcript after each pause) │
│ │
│ [detects trigger word → auto-submit to Claude] │
Implementation Steps
1. Backend: Ephemeral Token Endpoint
New file: src/app/api/voice/realtime-token/route.ts
POST /api/voice/realtime-token — authenticated endpoint
- Uses
openai.beta.realtime.transcriptionSessions.create() with:
input_audio_format: 'pcm16'
input_audio_transcription: { model: 'gpt-4o-transcribe', language: 'en' }
turn_detection: { type: 'semantic_vad' } (or server_vad — needs testing)
client_secret: { expires_after: { anchor: 'created_at', seconds: 120 } }
- Returns
{ client_secret: string, expires_at: number }
2. Browser: Audio Capture as PCM16
New file: src/lib/pcm-audio-worklet.ts (AudioWorklet processor)
The Realtime API requires PCM16 at 24kHz. MediaRecorder outputs webm/opus which won't work. Instead:
- Use
navigator.mediaDevices.getUserMedia({ audio: true })
- Create
AudioContext with sampleRate: 24000
- Use an
AudioWorkletNode to capture raw PCM16 samples
- Send samples to the WebSocket via
input_audio_buffer.append
3. Browser: WebSocket Management
Replace hook: src/hooks/useVoiceRecording.ts
- Fetches ephemeral token from backend
- Opens WebSocket to
wss://api.openai.com/v1/realtime?intent=transcription
- Sends
transcription_session.update to configure the session
- Streams PCM16 audio chunks via
input_audio_buffer.append
- Listens for
conversation.item.input_audio_transcription.completed events
- Accumulates transcript text across multiple utterances
- Provides live transcript preview to the UI
- Checks for trigger word at end of each transcript completion
- On trigger: stops recording, submits accumulated text (minus trigger word) to Claude
- Without trigger word: user taps mic to stop, transcript is placed in input box for manual send
- Handles token expiry (re-fetch if needed), errors, and cleanup
4. Trigger Word Detection
- Configurable trigger word, default:
"Over."
- Stored in global settings (new field on
GlobalSettings model)
- Detection: after each transcript completion, check if the accumulated text ends with the trigger phrase as a standalone phrase (case-insensitive, with some fuzzy matching for punctuation variations like "over", "Over.", "over.")
- "Standalone" means it's its own sentence/utterance — "it's over" should NOT trigger, but "fix the bug. Over." should
- When triggered: strip the trigger word from the text and auto-submit
5. UI Changes
Update: src/components/voice/VoiceMicButton.tsx
- Replace current push-to-talk with streaming mode
- When recording:
- Mic button shows "listening" state (pulsing animation)
- Live transcript preview appears in/above the input box
- Each completed utterance appends to the preview
- Trigger word auto-submits (if configured)
- User can tap mic to stop recording
- With trigger word configured: stops without sending (cancel)
- Without trigger word: places transcript in input box
Update: Settings UI to add trigger word configuration
6. Settings: Trigger Word
Schema change: Add voiceTriggerWord field to GlobalSettings model (nullable string, default null)
- When null/empty: streaming voice still works, but user must tap mic to stop and send manually
- When set: saying the trigger word auto-submits
- UI: text input in Settings → Voice section
7. Cleanup
- Remove old
POST /api/voice/transcribe endpoint
- Remove old MediaRecorder-based recording logic from
useVoiceRecording.ts
- Remove
gpt-4o-mini-transcribe usage from voice.ts service (keep TTS)
Key Design Decisions
-
Single mode, not two — Streaming replaces push-to-talk entirely. Without a trigger word it behaves the same (tap to start, tap to stop) but with live preview. No mode toggle needed.
-
Browser-direct WebSocket — Audio never touches our server. Backend only generates ephemeral tokens. Lower latency, simpler server.
-
semantic_vad vs server_vad — semantic_vad uses ML to detect natural speech boundaries (smarter than volume-based). Should test both; semantic_vad may work better for detecting "Over." as a distinct utterance. Fallback to server_vad if semantic is too slow.
-
PCM16 via AudioWorklet — Required by the Realtime API. AudioWorklet is well-supported in modern browsers (Safari 14.5+, Chrome Android). The worklet runs off-main-thread so it won't impact UI performance.
Open Questions
- Mobile browser AudioWorklet support — AudioWorklet is supported in Safari 14.5+ and Chrome Android. Need to verify it works well on the target mobile browsers.
- Token refresh — Transcription sessions have a 30-minute limit. For long conversations, we may need to transparently reconnect.
- Cost — Realtime API charges per minute of audio. Should we show a warning or auto-stop after some duration?
- Trigger word variations — How fuzzy should matching be? "over", "Over.", "over.", "Over" should all work. What about "it's over" (shouldn't trigger)?
References
Summary
Replace the current push-to-talk voice input (record → stop → transcribe → paste) with continuous streaming transcription using the OpenAI Realtime API. The user taps mic to start, speaks freely, and sees their transcript appear in real-time. When they say a configurable trigger word (default: "Over.") the accumulated transcript is automatically submitted to Claude. Without a trigger word configured, the user taps mic again to stop and sends manually — same flow as today but with live transcript preview.
Motivation
Currently voice input requires: tap to start → speak → tap to stop → wait for transcription → manually send. This is clunky on mobile. Streaming transcription with a trigger word enables fully hands-free interaction — the user just speaks and says "Over." to send.
Even without a trigger word, streaming is better than the current batch approach because users see their transcript building in real-time instead of waiting after stop.
Technical Plan
Architecture: Browser-Direct WebSocket
The browser connects directly to OpenAI's Realtime API using an ephemeral token generated by the backend. This avoids relaying audio through our server.
Implementation Steps
1. Backend: Ephemeral Token Endpoint
New file:
src/app/api/voice/realtime-token/route.tsPOST /api/voice/realtime-token— authenticated endpointopenai.beta.realtime.transcriptionSessions.create()with:input_audio_format: 'pcm16'input_audio_transcription: { model: 'gpt-4o-transcribe', language: 'en' }turn_detection: { type: 'semantic_vad' }(orserver_vad— needs testing)client_secret: { expires_after: { anchor: 'created_at', seconds: 120 } }{ client_secret: string, expires_at: number }2. Browser: Audio Capture as PCM16
New file:
src/lib/pcm-audio-worklet.ts(AudioWorklet processor)The Realtime API requires PCM16 at 24kHz. MediaRecorder outputs webm/opus which won't work. Instead:
navigator.mediaDevices.getUserMedia({ audio: true })AudioContextwithsampleRate: 24000AudioWorkletNodeto capture raw PCM16 samplesinput_audio_buffer.append3. Browser: WebSocket Management
Replace hook:
src/hooks/useVoiceRecording.tswss://api.openai.com/v1/realtime?intent=transcriptiontranscription_session.updateto configure the sessioninput_audio_buffer.appendconversation.item.input_audio_transcription.completedevents4. Trigger Word Detection
"Over."GlobalSettingsmodel)5. UI Changes
Update:
src/components/voice/VoiceMicButton.tsxUpdate: Settings UI to add trigger word configuration
6. Settings: Trigger Word
Schema change: Add
voiceTriggerWordfield toGlobalSettingsmodel (nullable string, default null)7. Cleanup
POST /api/voice/transcribeendpointuseVoiceRecording.tsgpt-4o-mini-transcribeusage fromvoice.tsservice (keep TTS)Key Design Decisions
Single mode, not two — Streaming replaces push-to-talk entirely. Without a trigger word it behaves the same (tap to start, tap to stop) but with live preview. No mode toggle needed.
Browser-direct WebSocket — Audio never touches our server. Backend only generates ephemeral tokens. Lower latency, simpler server.
semantic_vadvsserver_vad—semantic_vaduses ML to detect natural speech boundaries (smarter than volume-based). Should test both;semantic_vadmay work better for detecting "Over." as a distinct utterance. Fallback toserver_vadif semantic is too slow.PCM16 via AudioWorklet — Required by the Realtime API. AudioWorklet is well-supported in modern browsers (Safari 14.5+, Chrome Android). The worklet runs off-main-thread so it won't impact UI performance.
Open Questions
References
src/hooks/useVoiceRecording.ts,src/components/voice/VoiceMicButton.tsx,src/server/services/voice.ts