Skip to content

A flexible text-to-speech system based on Kokoro models with ONNX inference, supporting 50+ voices, FastAPI server, and CLI client.

Notifications You must be signed in to change notification settings

orchestrate-solutions/tts-kokoro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

TTS-Kokoro

A flexible text-to-speech system based on Kokoro models with ONNX inference, supporting over 50 voices. Features include a FastAPI server for network-based TTS and a CLI client for local or remote synthesis.

Features

  • 50+ high-quality voices
  • FastAPI server for network-based TTS
  • CLI client for local or remote synthesis
  • Docker support for easy deployment
  • Configurable speech speed and chunking
  • Organized output directory structure
  • Raw audio streaming support

Requirements

  • Python 3.11+
  • Dependencies listed in requirements.txt
  • Docker (optional, for containerized deployment)

Installation

  1. Clone this repository:
git clone https://github.com/JoshuaWink/tts-kokoro.git
cd tts-kokoro
  1. Download the Kokoro model files:
# Download and extract model files from HuggingFace
# Visit: https://huggingface.co/hexgrad/Kokoro-82M
  1. Install dependencies:
pip install -r requirements.txt

Usage

Local CLI

The simplest way to use the system locally:

python speak_cli.py "Your text to speak"

Options:

  • --save output.raw: Save the raw audio data to a file
  • Additional options coming soon

Server Deployment

  1. Using Docker (recommended):
# Build the Docker image
docker build -t tts-kokoro-server .

# Run the server
docker run -d -p 8000:8000 --name tts-kokoro-server tts-kokoro-server
  1. Without Docker:
python tts_server.py

The server will be available at http://localhost:8000.

API Endpoints

  • POST /speak
    • Request body: {"text": "Your text to speak"}
    • Returns: Raw audio data
    • Headers: X-Sample-Rate: 24000 (audio sample rate)

Advanced Usage

For more control, use the tts_streamer.py script directly:

python tts_streamer.py --text "Your text here" --voice kokoro_model/voices/af_sarah.bin --speed 1.0

Parameters:

  • --text: Text to convert to speech
  • --voice: Voice file to use (default: am_puck.bin)
  • --speed: Speech speed (0.8=slower, 1.0=normal, 1.2=faster)
  • --output: Save to file instead of playing
  • --max-tokens: Maximum tokens per segment (200-300 recommended)
  • --silent: Suppress debug output

Generate Speed Variants

Create multiple versions with different speeds:

python tts_variant_generator.py --text "Your text here" --all-variants

Voice Naming Convention

Voices follow the pattern [language/accent][gender]_[name].bin:

  • First letter: language/accent code (a=American, b=British, etc.)
  • Second letter: gender (m=male, f=female)
  • Name: Unique identifier

Example: am_puck.bin = American Male "Puck" voice

Project Structure

tts-kokoro/
├── kokoro_model/          # Model and voice files
│   ├── onnx/             # ONNX model files
│   └── voices/           # Voice files (.bin)
├── output/               # Generated audio output
├── speak_cli.py         # CLI client
├── tts_server.py        # FastAPI server
├── tts_streamer.py      # Core TTS engine
└── tts_variant_generator.py  # Speed variant generator

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Huge thanks to all the people at Hexgrad that gave this out for free. What a great time to be alive. https://huggingface.co/hexgrad/Kokoro-82M

About

A flexible text-to-speech system based on Kokoro models with ONNX inference, supporting 50+ voices, FastAPI server, and CLI client.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published