Skip to content

Streaming voice input with trigger word to auto-send #284

@brendanlong

Description

@brendanlong

Summary

Replace the current push-to-talk voice input (record → stop → transcribe → paste) with continuous streaming transcription using the OpenAI Realtime API. The user taps mic to start, speaks freely, and sees their transcript appear in real-time. When they say a configurable trigger word (default: "Over.") the accumulated transcript is automatically submitted to Claude. Without a trigger word configured, the user taps mic again to stop and sends manually — same flow as today but with live transcript preview.

Motivation

Currently voice input requires: tap to start → speak → tap to stop → wait for transcription → manually send. This is clunky on mobile. Streaming transcription with a trigger word enables fully hands-free interaction — the user just speaks and says "Over." to send.

Even without a trigger word, streaming is better than the current batch approach because users see their transcript building in real-time instead of waiting after stop.

Technical Plan

Architecture: Browser-Direct WebSocket

The browser connects directly to OpenAI's Realtime API using an ephemeral token generated by the backend. This avoids relaying audio through our server.

Browser                         Backend                    OpenAI
  │                               │                          │
  │── POST /api/voice/rt-token ──►│                          │
  │                               │── transcriptionSessions  │
  │                               │   .create() ────────────►│
  │◄── { client_secret } ────────│◄── { client_secret } ────│
  │                               │                          │
  │── WebSocket (ephemeral key) ─────────────────────────────►│
  │── input_audio_buffer.append ─────────────────────────────►│
  │◄── conversation.item.input_audio_transcription.completed ─│
  │   (transcript after each pause)                          │
  │                                                          │
  │ [detects trigger word → auto-submit to Claude]           │

Implementation Steps

1. Backend: Ephemeral Token Endpoint

New file: src/app/api/voice/realtime-token/route.ts

  • POST /api/voice/realtime-token — authenticated endpoint
  • Uses openai.beta.realtime.transcriptionSessions.create() with:
    • input_audio_format: 'pcm16'
    • input_audio_transcription: { model: 'gpt-4o-transcribe', language: 'en' }
    • turn_detection: { type: 'semantic_vad' } (or server_vad — needs testing)
    • client_secret: { expires_after: { anchor: 'created_at', seconds: 120 } }
  • Returns { client_secret: string, expires_at: number }

2. Browser: Audio Capture as PCM16

New file: src/lib/pcm-audio-worklet.ts (AudioWorklet processor)

The Realtime API requires PCM16 at 24kHz. MediaRecorder outputs webm/opus which won't work. Instead:

  • Use navigator.mediaDevices.getUserMedia({ audio: true })
  • Create AudioContext with sampleRate: 24000
  • Use an AudioWorkletNode to capture raw PCM16 samples
  • Send samples to the WebSocket via input_audio_buffer.append

3. Browser: WebSocket Management

Replace hook: src/hooks/useVoiceRecording.ts

  • Fetches ephemeral token from backend
  • Opens WebSocket to wss://api.openai.com/v1/realtime?intent=transcription
  • Sends transcription_session.update to configure the session
  • Streams PCM16 audio chunks via input_audio_buffer.append
  • Listens for conversation.item.input_audio_transcription.completed events
  • Accumulates transcript text across multiple utterances
  • Provides live transcript preview to the UI
  • Checks for trigger word at end of each transcript completion
  • On trigger: stops recording, submits accumulated text (minus trigger word) to Claude
  • Without trigger word: user taps mic to stop, transcript is placed in input box for manual send
  • Handles token expiry (re-fetch if needed), errors, and cleanup

4. Trigger Word Detection

  • Configurable trigger word, default: "Over."
  • Stored in global settings (new field on GlobalSettings model)
  • Detection: after each transcript completion, check if the accumulated text ends with the trigger phrase as a standalone phrase (case-insensitive, with some fuzzy matching for punctuation variations like "over", "Over.", "over.")
  • "Standalone" means it's its own sentence/utterance — "it's over" should NOT trigger, but "fix the bug. Over." should
  • When triggered: strip the trigger word from the text and auto-submit

5. UI Changes

Update: src/components/voice/VoiceMicButton.tsx

  • Replace current push-to-talk with streaming mode
  • When recording:
    • Mic button shows "listening" state (pulsing animation)
    • Live transcript preview appears in/above the input box
    • Each completed utterance appends to the preview
    • Trigger word auto-submits (if configured)
    • User can tap mic to stop recording
      • With trigger word configured: stops without sending (cancel)
      • Without trigger word: places transcript in input box

Update: Settings UI to add trigger word configuration

6. Settings: Trigger Word

Schema change: Add voiceTriggerWord field to GlobalSettings model (nullable string, default null)

  • When null/empty: streaming voice still works, but user must tap mic to stop and send manually
  • When set: saying the trigger word auto-submits
  • UI: text input in Settings → Voice section

7. Cleanup

  • Remove old POST /api/voice/transcribe endpoint
  • Remove old MediaRecorder-based recording logic from useVoiceRecording.ts
  • Remove gpt-4o-mini-transcribe usage from voice.ts service (keep TTS)

Key Design Decisions

  1. Single mode, not two — Streaming replaces push-to-talk entirely. Without a trigger word it behaves the same (tap to start, tap to stop) but with live preview. No mode toggle needed.

  2. Browser-direct WebSocket — Audio never touches our server. Backend only generates ephemeral tokens. Lower latency, simpler server.

  3. semantic_vad vs server_vadsemantic_vad uses ML to detect natural speech boundaries (smarter than volume-based). Should test both; semantic_vad may work better for detecting "Over." as a distinct utterance. Fallback to server_vad if semantic is too slow.

  4. PCM16 via AudioWorklet — Required by the Realtime API. AudioWorklet is well-supported in modern browsers (Safari 14.5+, Chrome Android). The worklet runs off-main-thread so it won't impact UI performance.

Open Questions

  • Mobile browser AudioWorklet support — AudioWorklet is supported in Safari 14.5+ and Chrome Android. Need to verify it works well on the target mobile browsers.
  • Token refresh — Transcription sessions have a 30-minute limit. For long conversations, we may need to transparently reconnect.
  • Cost — Realtime API charges per minute of audio. Should we show a warning or auto-stop after some duration?
  • Trigger word variations — How fuzzy should matching be? "over", "Over.", "over.", "Over" should all work. What about "it's over" (shouldn't trigger)?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions