TTS-Kokoro

A flexible text-to-speech system based on Kokoro models with ONNX inference, supporting over 50 voices. Features include a FastAPI server for network-based TTS and a CLI client for local or remote synthesis.

Features

50+ high-quality voices
FastAPI server for network-based TTS
CLI client for local or remote synthesis
Docker support for easy deployment
Configurable speech speed and chunking
Organized output directory structure
Raw audio streaming support

Requirements

Python 3.11+
Dependencies listed in requirements.txt
Docker (optional, for containerized deployment)

Installation

Clone this repository:

git clone https://github.com/JoshuaWink/tts-kokoro.git
cd tts-kokoro

Download the Kokoro model files:

# Download and extract model files from HuggingFace
# Visit: https://huggingface.co/hexgrad/Kokoro-82M

Install dependencies:

pip install -r requirements.txt

Usage

Local CLI

The simplest way to use the system locally:

python speak_cli.py "Your text to speak"

Options:

--save output.raw: Save the raw audio data to a file
Additional options coming soon

Server Deployment

Using Docker (recommended):

# Build the Docker image
docker build -t tts-kokoro-server .

# Run the server
docker run -d -p 8000:8000 --name tts-kokoro-server tts-kokoro-server

Without Docker:

python tts_server.py

The server will be available at http://localhost:8000.

API Endpoints

POST /speak
- Request body: {"text": "Your text to speak"}
- Returns: Raw audio data
- Headers: X-Sample-Rate: 24000 (audio sample rate)

Advanced Usage

For more control, use the tts_streamer.py script directly:

python tts_streamer.py --text "Your text here" --voice kokoro_model/voices/af_sarah.bin --speed 1.0

Parameters:

--text: Text to convert to speech
--voice: Voice file to use (default: am_puck.bin)
--speed: Speech speed (0.8=slower, 1.0=normal, 1.2=faster)
--output: Save to file instead of playing
--max-tokens: Maximum tokens per segment (200-300 recommended)
--silent: Suppress debug output

Generate Speed Variants

Create multiple versions with different speeds:

python tts_variant_generator.py --text "Your text here" --all-variants

Voice Naming Convention

Voices follow the pattern [language/accent][gender]_[name].bin:

First letter: language/accent code (a=American, b=British, etc.)
Second letter: gender (m=male, f=female)
Name: Unique identifier

Example: am_puck.bin = American Male "Puck" voice

Project Structure

tts-kokoro/
├── kokoro_model/          # Model and voice files
│   ├── onnx/             # ONNX model files
│   └── voices/           # Voice files (.bin)
├── output/               # Generated audio output
├── speak_cli.py         # CLI client
├── tts_server.py        # FastAPI server
├── tts_streamer.py      # Core TTS engine
└── tts_variant_generator.py  # Speed variant generator

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Huge thanks to all the people at Hexgrad that gave this out for free. What a great time to be alive. https://huggingface.co/hexgrad/Kokoro-82M

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TTS-Kokoro

Features

Requirements

Installation

Usage

Local CLI

Server Deployment

API Endpoints

Advanced Usage

Generate Speed Variants

Voice Naming Convention

Project Structure

Contributing

About

Uh oh!

Releases

Packages

orchestrate-solutions/tts-kokoro

Folders and files

Latest commit

History

Repository files navigation

TTS-Kokoro

Features

Requirements

Installation

Usage

Local CLI

Server Deployment

API Endpoints

Advanced Usage

Generate Speed Variants

Voice Naming Convention

Project Structure

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages