A flexible text-to-speech system based on Kokoro models with ONNX inference, supporting over 50 voices. Features include a FastAPI server for network-based TTS and a CLI client for local or remote synthesis.
- 50+ high-quality voices
- FastAPI server for network-based TTS
- CLI client for local or remote synthesis
- Docker support for easy deployment
- Configurable speech speed and chunking
- Organized output directory structure
- Raw audio streaming support
- Python 3.11+
- Dependencies listed in
requirements.txt - Docker (optional, for containerized deployment)
- Clone this repository:
git clone https://github.com/JoshuaWink/tts-kokoro.git
cd tts-kokoro- Download the Kokoro model files:
# Download and extract model files from HuggingFace
# Visit: https://huggingface.co/hexgrad/Kokoro-82M- Install dependencies:
pip install -r requirements.txtThe simplest way to use the system locally:
python speak_cli.py "Your text to speak"Options:
--save output.raw: Save the raw audio data to a file- Additional options coming soon
- Using Docker (recommended):
# Build the Docker image
docker build -t tts-kokoro-server .
# Run the server
docker run -d -p 8000:8000 --name tts-kokoro-server tts-kokoro-server- Without Docker:
python tts_server.pyThe server will be available at http://localhost:8000.
- POST
/speak- Request body:
{"text": "Your text to speak"} - Returns: Raw audio data
- Headers:
X-Sample-Rate: 24000(audio sample rate)
- Request body:
For more control, use the tts_streamer.py script directly:
python tts_streamer.py --text "Your text here" --voice kokoro_model/voices/af_sarah.bin --speed 1.0Parameters:
--text: Text to convert to speech--voice: Voice file to use (default: am_puck.bin)--speed: Speech speed (0.8=slower, 1.0=normal, 1.2=faster)--output: Save to file instead of playing--max-tokens: Maximum tokens per segment (200-300 recommended)--silent: Suppress debug output
Create multiple versions with different speeds:
python tts_variant_generator.py --text "Your text here" --all-variantsVoices follow the pattern [language/accent][gender]_[name].bin:
- First letter: language/accent code (a=American, b=British, etc.)
- Second letter: gender (m=male, f=female)
- Name: Unique identifier
Example: am_puck.bin = American Male "Puck" voice
tts-kokoro/
├── kokoro_model/ # Model and voice files
│ ├── onnx/ # ONNX model files
│ └── voices/ # Voice files (.bin)
├── output/ # Generated audio output
├── speak_cli.py # CLI client
├── tts_server.py # FastAPI server
├── tts_streamer.py # Core TTS engine
└── tts_variant_generator.py # Speed variant generator
Contributions are welcome! Please feel free to submit a Pull Request.
Huge thanks to all the people at Hexgrad that gave this out for free. What a great time to be alive. https://huggingface.co/hexgrad/Kokoro-82M