Realtime offline (self-hosted) speech recognition tool based on OpenAI Whisper model.
Turns this (en_chunk.wav):
into this:
Why did I make it:
- to get a better grasp on the modern speech-to-text instruments: their capabilities and limitations
- to reuse this project as a sub-module in my other projects
- a speech-to-text transcription, which works fully offline
- though internet connection still required for the first launch: to download the models
- the output is very responsive: the words appear on the screen with minimal delay
- efficient usage of system resources
- cold start takes quite some time (20-80 sec with SSD)
- consumes a lot of RAM
- occupies 4-8 GB with default preset (Whisper models:
tiny.en+small.en)
- occupies 4-8 GB with default preset (Whisper models:
- CUDA support is strongly advised:
- works 5-10 times slower without CUDA Toolkit
- incapable of running in real-time mode without CUDA
- may require a lot of fine-tuning at first — to adjust to your microphone
- deployment may be tricky
Audio data is stored and transformed between three different formats:
- WAVE PCM encoding — its implementation from Python standard library's wave module
- Pydub AudioSegment — convenient for sound processing
- NumPy array of floats — the Whisper input format
The workflow can be roughly described as follow:
- We attempt to split the audio stream into sentences based on the pauses in speech. Each such sentence is stored in a separated block.
- The last block constantly changes, because we iteratively append small chunks of sound (0.5-1.5 sec) while the user speaks. It is referred to as ongoing block.
- All the previous blocks won't be changing. They are referred to as finalized blocks.
- We apply a slow high quality transcription to each of the finalized blocks — only once.
- We apply a fast low quality transcription to the ongoing block at each iteration.
This approach is a tradeoff between quality and speed. It immediately gives the user a low quality transcription of what he says in the current phrase, and it refines this transcription when the phrase is finished.
flowchart TB
A([Sound source]) -->|sound| B(Listener)
B -->|data chunks| C(Accumulation)
C -->|binary data| D{{Preparations}}
D -->|NumPy f16 array| G(Transcribing)
G --> C
G -->|text| H([Text accumulated])
, where:
Listenerprovides a stream of fixed sized PCM-encoded binary chunks.Accumulation— adds new binary chunks to the currently ongoing audio block.Preparations— transforms binary data into NumPy array, and applies a variety of measures in order to increase the quality of transcription.Transcribing— applies low quality transcription to the ongoing audio block, applies high quality transcription to the blocks to be finalized (if any), sends the ongoing block back toAccumulation.
flowchart LR
C([in])
C -->|binary| D(Adjustment)
D -->|seg| E(Split on silence)
E -->|seg| F(Refinement)
F -->|NumPy array| G([out])
Adjustment- transforms
binary(WAVE PCM encoding) toseg(Pydub AudioSegment) - may apply volume normalization
- transforms
Split on silence— splits audio blocks on silence- all except the last block are marked to be finalized
Refinement- may apply human voice frequency amplification
- may speed up the audio
- applies normalization
- converts to
mono, 16000 Hz - converts to NumPy array of floats
- may apply noise suppression
- Python 3.11
- Poetry
- (optional) make tool
- (optional) CUDA Toolkit
In case you use Windows, CUDA v11.4: just run make init (or poetry install).
Otherwise you may require changing the pyproject.toml file: the torch dependency from [tool.poetry.dependencies] section.
Before launching the app, you might want to tackle with the workflow_debugging.ipynb notebook.
To launch the app, run: make run (or make, or poetry run python -m speech2text). It will transcribe the en_chunk.wav file. The result should resemble the animation from the top of the project.
To transcribe your mic input instead: in __main__.py replace demo_console_realtime(IN_FILE_PATH) with demo_console_realtime().
You also might want to have a look at config.yaml — in order to fine-tune the settings. If having troubles with performance:
- replace
model_name: small.enwithmodel_name: tiny.en - delete all the inclusions of
noisereducesub-section (replace it withnoisereduce:) - make sure CUDA is available:
from torch import cuda assert cuda.is_available()

