An OpenAI API compatible speech to text server for audio transcription and translations, aka. Whisper.
- Compatible with the OpenAI audio/transcriptions and audio/translations API
- Does not connect to the OpenAI API and does not require an OpenAI API Key
- Not affiliated with OpenAI in any way
- NEW: Automatic alert tone and silence skipping with Silero VAD
π Documentation:
- Quick Start Guide - Get running in 5 minutes
- Installation Guide - Detailed setup for all platforms
- Dependencies Explained - What gets installed and why
- Change Log - What's new in v0.2.0
π Quick Start:
pip install -r requirements.txt
python whisper_server.py --model small
curl http://localhost:8000/healthAPI Compatibility:
- /v1/audio/transcriptions
- /v1/audio/translations
Parameter Support:
-
file -
model(only whisper-1 exists, so this is ignored) -
language -
prompt(FULLY SUPPORTED - guides transcription with custom terminology, formatting, etc.) -
temperature -
response_format: -
-
json
-
-
-
text
-
-
-
srt
-
-
-
vtt
-
-
-
verbose_json
-
Details:
- CUDA or CPU support (automatically detected)
- float32, float16 or bfloat16 support (automatically detected)
- Silero VAD tone skipping - Automatically detects and skips alert tones and silence at the beginning of audio files using Silero Voice Activity Detection
Tested whisper models:
- large-v3 (the default)
- large-v2
- large
- medium
- small
- base
- tiny
Version: 0.2.0, Last update: 2026-01-11
- Python 3.8 or higher
- FFmpeg (for audio processing)
- (Optional) CUDA-capable GPU for faster transcription
-
Install FFmpeg
# Ubuntu/Debian sudo apt install ffmpeg # macOS (using Homebrew) brew install ffmpeg # Windows (using Chocolatey) choco install ffmpeg
-
Install Python Dependencies
pip install -r requirements.txt
This will install:
- FastAPI and Uvicorn (API server)
- OpenAI Whisper (transcription engine)
- PyTorch and torchaudio (deep learning framework)
- Silero VAD (automatic tone/silence detection, downloaded on first run)
- Python-multipart (file upload support)
-
(Optional) CUDA Support
For GPU acceleration, install CUDA for your operating system. PyTorch will automatically detect and use CUDA if available.
On the first run, Silero VAD model will be automatically downloaded from torch.hub (~2MB). This is a one-time operation.
Note: This implementation uses the official OpenAI Whisper library which has full prompt support built-in!
Usage: whisper_server.py [-m <model_name>] [-d <device>] [-P <port>] [-H <host>] [--preload]
Description:
OpenedAI Whisper API Server (Silero VAD tone skipping)
Options:
-h, --help Show this help message and exit.
-m MODEL, --model MODEL
The model to use for transcription.
Options: tiny, base, small, medium, large, large-v2, large-v3 (default: large-v3)
-d DEVICE, --device DEVICE
Set the torch device for the model. Ex. cuda:0 or cpu (default: auto)
-P PORT, --port PORT Server tcp port (default: 8000)
-H HOST, --host HOST Host to listen on, Ex. 0.0.0.0 (default: 0.0.0.0)
--preload Preload model and exit. (default: False)
This server includes Silero VAD (Voice Activity Detection) which automatically:
- Detects the start of speech in audio files
- Skips alert tones, beeps, and silence at the beginning of recordings
- Preserves a 150ms buffer before speech starts to maintain context
- Improves transcription accuracy by removing non-speech audio
This feature is especially useful for:
- Radio dispatch recordings with alert tones
- Pager recordings with notification beeps
- Any audio with leading silence or tones
The VAD processing is automatic and requires no configuration. It processes the audio at 16kHz mono and uses a 250ms minimum speech duration threshold with 0.5 confidence.
Check if the server is running and ready:
curl http://localhost:8000/healthResponse: {"status":"ok"}
You can use it like this:
curl -s http://localhost:8000/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F model="whisper-1" -F file="@audio.mp3" -F response_format=textOr just like this:
curl -s http://localhost:8000/v1/audio/transcriptions -F model="whisper-1" -F file="@audio.mp3"Or like this example from the OpenAI Speech to text guide Quickstart:
from openai import OpenAI
client = OpenAI(api_key='sk-1111', base_url='http://localhost:8000/v1')
audio_file = open("/path/to/file/audio.mp3", "rb")
transcription = client.audio.transcriptions.create(model="whisper-1", file=audio_file)
print(transcription.text)The prompt parameter helps guide Whisper's transcription by providing context, terminology, and formatting preferences. This is especially useful for domain-specific audio like radio communications, medical terminology, or technical jargon.
Example with curl:
curl -s http://localhost:8000/v1/audio/transcriptions \
-F model="whisper-1" \
-F file="@audio.mp3" \
-F prompt="Emergency radio dispatch communications. Common units: MEDIC, ENGINE, TRUCK, LADDER. Radio procedure: COPY, CLEAR, EN ROUTE, ON SCENE."Example with Python:
from openai import OpenAI
client = OpenAI(api_key='sk-1111', base_url='http://localhost:8000/v1')
# Recommended prompt for radio dispatch transcription
prompt = """Emergency radio dispatch. CRITICAL: Never repeat. Common units: MEDIC, ENGINE, TRUCK, LADDER, SQUAD, BATTALION. Radio words: COPY, CLEAR, EN ROUTE, ON SCENE. Phonetic: ADAM, BAKER, CHARLES, DAVID, FRANK, GEORGE, KING, LINCOLN, MARY, OCEAN, QUEEN, SAM, VICTOR, X-RAY. Ages: NUMBER YEAR OLD MALE/FEMALE. Medical: GSW, SOB, CPR, AED, MVA. Use periods between statements."""
audio_file = open("/path/to/file/radio_audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
prompt=prompt
)
print(transcription.text)Recommended Radio Dispatch Prompt:
For emergency radio dispatch communications, this prompt has been tested and works well:
Emergency radio dispatch. CRITICAL: Never repeat. Common units: MEDIC, ENGINE, TRUCK, LADDER, SQUAD, BATTALION. Radio words: COPY, CLEAR, EN ROUTE, ON SCENE. Phonetic: ADAM, BAKER, CHARLES, DAVID, FRANK, GEORGE, KING, LINCOLN, MARY, OCEAN, QUEEN, SAM, VICTOR, X-RAY. Ages: NUMBER YEAR OLD MALE/FEMALE. Medical: GSW, SOB, CPR, AED, MVA. Use periods between statements.
This prompt:
- Prevents repetitive hallucinations with "CRITICAL: Never repeat"
- Provides common emergency service terminology
- Includes phonetic alphabet for call signs
- Guides proper formatting for ages and medical terms
- Achieves ~95% accuracy on radio dispatch audio
Important Notes:
- The prompt provides guidance, not restrictions - Whisper will still transcribe all audio
- Prompts improve accuracy on domain-specific terms and reduce hallucinations
- Keep prompts under 400 characters - longer prompts can trigger hallucinations
- Especially helpful with poor audio quality or background noise
- Customize the prompt based on your specific use case (medical, legal, technical, etc.)
- If you see repeated words (hallucinations), try shortening or removing the prompt
You can run the server via docker like so:
docker compose build
docker compose upOptions can be set via whisper.env.
- The Silero VAD model will be automatically downloaded on first run (~2MB)
- Models are cached in the
hf_homedirectory which is mounted as a volume - GPU support requires NVIDIA Docker runtime and compatible GPU
- For CPU-only Docker, remove the
runtime: nvidiaanddeploysections fromdocker-compose.yml
If you get errors about downloading the Silero VAD model:
- Ensure you have internet connectivity
- Check that
torch.hubhas write access to cache directory - The model is downloaded from GitHub (snakers4/silero-vad)
- First run may take 1-2 minutes to download the model
If you get FileNotFoundError related to ffmpeg:
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows
choco install ffmpegIf you get CUDA out of memory errors:
- Use a smaller model (
small,base, ortiny) - Use CPU mode:
--device cpu - Reduce concurrent requests
If you get import errors:
pip install -r requirements.txt --upgradeMake sure you have Python 3.8 or higher:
python --version