-
-
Notifications
You must be signed in to change notification settings - Fork 600
Description
Hi,
First: I sent a tip via Ko-fi and offered to donate my voice for a Dutch TTS voice. Kokoro's quality is impressive.
I'm blind and use NVDA (a screen reader). I've been testing Kokoro to see if it could replace the 20+ year old TTS engines we currently use (eSpeak, Tiflotecnica/old Nuance). The voice quality difference is night and day - but latency is a blocker.
My benchmarks (Core Ultra 7 258V, 32GB RAM, Intel Arc 140V GPU):
Model
Short phrase latency
Verdict
FP32 ONNX (CPU)
~500ms
Too slow
INT8 ONNX (CPU)
~1100ms
Even slower (wrong codepath?)
OpenVINO GPU
Failed
Dynamic STFT shapes not supported
Why screen readers are different:
Most TTS use cases (audiobooks, podcasts, video narration) tolerate 500ms+ latency easily. Screen readers are unique: we generate thousands of tiny utterances per hour ("button", "edit", "link", "checkbox checked"). Each must feel instant or navigation becomes unbearable.
Target latency: <200ms on average laptop CPU (i5/Ryzen 5, no GPU)
Reference: eSpeak achieves 5-10ms, Tiflotecnica ~50-150ms.
The gap:
The blind community is stuck with 20-year-old robotic voices because neural TTS is too slow. We don't typically have gaming PCs with dedicated GPUs. A neural TTS optimized for screen readers would help millions of users worldwide.
What might help:
• Static shapes export for OpenVINO compatibility
• Streaming mode (start audio while still generating)
• Lighter model variant specifically for low-latency use
• Guidance on optimal CPU inference settings
I understand this is a hard problem. Just wanted to share this use case and data in case it's useful for future development.
Thanks for building Kokoro - hoping one day it can power screen readers too.