Skip to content

fix: filter silence hallucinations from cloud transcription models#23

Merged
SeoFood merged 1 commit intomainfrom
issue-19
Apr 5, 2026
Merged

fix: filter silence hallucinations from cloud transcription models#23
SeoFood merged 1 commit intomainfrom
issue-19

Conversation

@SeoFood
Copy link
Copy Markdown
Contributor

@SeoFood SeoFood commented Apr 3, 2026

Summary

Fixes #19. Certain Whisper models (Large V3 Turbo, GPT-4o Transcribe) hallucinate random multi-language text when given silent audio. This adds two layers of silence detection:

  • Energy gate (pre-filter): Tracks pre-gain peak RMS per audio chunk. If no chunk during the recording exceeds the speech energy threshold (0.01 RMS), transcription is skipped entirely and "No speech detected" is shown. This works correctly even when AGC/WhisperMode amplifies the audio, since it checks the raw microphone level.
  • no_speech_prob filter (post-filter): Parses the no_speech_prob field from verbose_json Whisper API responses (OpenAI whisper-1, Groq, OpenAI Compatible). If all segments report > 0.8 non-speech probability, the result is discarded as a hallucination.

Both filters apply to the final transcription path (ProcessSingleJobAsync) and the live polling fallback (RunPollingFallbackAsync). WebSocket streaming providers (Deepgram, AssemblyAI) handle silence server-side and are unaffected.

Test plan

  • Record 2-3 seconds of silence with Groq Whisper V3 Turbo - should show "No speech detected"
  • Record silence with GPT-4o Transcribe - should show "No speech detected"
  • Record normal speech - should transcribe correctly (no false positives)
  • Verify whispered speech is still captured (low energy but above threshold)
  • Verify local models (Parakeet, Canary) still work correctly via VAD path

…ixes #19)

Add two-layer silence detection to prevent Whisper models from
hallucinating random multi-language text when given silent audio:

1. Client-side energy gate: check pre-gain peak RMS against a threshold
   before sending audio to cloud APIs. Skips transcription entirely when
   no speech energy is detected.

2. Server-side no_speech_prob filter: parse the no_speech_prob field from
   verbose_json Whisper API responses (OpenAI, Groq, OpenAI Compatible)
   and discard results where all segments report high non-speech
   probability (> 0.8).
@SeoFood SeoFood merged commit 5b753cb into main Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Silence yields random text for certain models.

1 participant