OpenSpeakers

  ___                   ____                 _
 / _ \ _ __   ___ _ __ / ___| _ __  ___  __ _| | _____ _ __ ___
| | | | '_ \ / _ \ '_ \\___ \| '_ \/ _ \/ _` | |/ / _ \ '__/ __|
| |_| | |_) |  __/ | | |___) | |_) |  __/ (_| |   <  __/ |  \__ \
 \___/| .__/ \___|_| |_|____/| .__/ \___|\__,_|_|\_\___|_|  |___/
      |_|                    |_|

A unified, self-hosted TTS and voice cloning application supporting 11 open-source models with GPU hot-swap, async job queuing, real-time streaming, and a modern SvelteKit UI.

Screenshots



Home — landing page with model gallery	TTS — main generation page with per-model parameters

Clone Voice — drag-and-drop reference audio + voice library	Compare — run the same text through multiple models side-by-side

Batch — submit up to 100 lines and download as a ZIP	History — searchable, filterable job history

Models — capability browser with VRAM estimates	Settings — live GPU stats, output format, OpenAI API config

Key Features

Generation

11 working TTS models — Kokoro 82M, VibeVoice 0.5B/1.5B, Fish Audio S2-Pro, Qwen3 TTS 1.7B, Orpheus 3B, Dia 1.6B, F5-TTS, Chatterbox, CosyVoice 2.0, Parler TTS Mini
GPU hot-swap — only one model in VRAM at a time; auto-eviction with 60-second idle timer
Ollama-style keep_alive — per-request -1 (hold forever), 0 (evict now), or N seconds
Real-time audio streaming — VibeVoice 0.5B streams PCM16 chunks via Redis pub/sub → WebSocket → Web Audio API
Output formats — WAV, MP3, OGG (ffmpeg transcoding)

Voice Cloning

Zero-shot cloning — Fish Audio S2-Pro, VibeVoice 1.5B, and Qwen3 TTS clone from any 3–30 second reference clip
Voice profiles — store reference audio and metadata; reuse across any generation
50+ built-in voices — Kokoro's preset voice library
Emotion and style control — Fish S2-Pro [whisper] / [excited] tags; Orpheus <laugh> / <sigh> / <gasp> tags
Dialogue mode — Dia 1.6B [S1] / [S2] multi-speaker scripting with nonverbal sounds

Job Management

Async job queue — Celery + Redis; all generation is non-blocking with immediate job_id response
Job cancellation — revoke pending or running jobs mid-flight via the API or UI
Batch generation — submit up to 100 lines in a single request; download all audio as a ZIP
Full job history — searchable, filterable by model / status, paginated

API

OpenAI-compatible /v1/audio/speech — drop-in replacement for openai.audio.speech.create()
WebSocket progress — real-time queued → loading → generating → audio_chunk → complete events
Live GPU stats — WebSocket stream of utilisation, temperature, and power draw
Swagger UI — interactive API docs at /docs

UI / UX

SvelteKit 2 + Svelte 5 runes — reactive, type-safe frontend
WaveSurfer.js waveform — visual audio player with keyboard seek
Dark mode default — toggle persists to localStorage; FOUC prevention
Mobile responsive — sidebar collapses to hamburger on narrow screens
Keyboard shortcuts — press ? for the help modal; Ctrl+Enter to submit; arrows to seek
Toast notifications — non-blocking success/error/warning messages
Model browser — capability table with VRAM estimates

Supported Models

Model	Container	Queue	VRAM	Cloning	Streaming	Status
Kokoro 82M	`worker-kokoro`	`tts.kokoro`	~0.5 GB	—	—	✅ Working (standby)
VibeVoice 0.5B	`worker`	`tts`	~5 GB	—	PCM16	✅ Working
VibeVoice 1.5B	`worker`	`tts`	~12 GB	Zero-shot	—	✅ Working
Fish Audio S2-Pro	`worker-fish`	`tts.fish-speech`	~22 GB	Zero-shot	Chunked	✅ Working
Qwen3 TTS 1.7B	`worker-qwen3`	`tts.qwen3`	~10 GB	Zero-shot	—	✅ Working
Orpheus 3B	`worker-orpheus`	`tts.orpheus`	~7 GB	—	—	✅ Working
Dia 1.6B	`worker-dia`	`tts.dia`	~10 GB	Via prompt	—	✅ Working
F5-TTS	`worker-f5`	`tts.f5-tts`	~3 GB	Zero-shot	—	✅ Working
Chatterbox	`worker-f5`	`tts.f5-tts`	~5 GB	Zero-shot	—	✅ Working
CosyVoice 2.0	`worker-f5`	`tts.f5-tts`	~5 GB	Zero-shot	—	✅ Working
Parler TTS Mini	`worker-f5`	`tts.f5-tts`	~3 GB	—	—	✅ Working

Quick Start

One-Line Install

curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/scripts/install.sh | bash

This downloads the compose files, pulls all images from Docker Hub, and starts the app. That's it.

Requirements: Docker with Compose v2, NVIDIA GPU, NVIDIA Container Toolkit

Once running:

Models download automatically on first use (~1-22 GB each depending on the model).

Manual Install (from Docker Hub)

mkdir openspeakers && cd openspeakers
curl -fsSLO https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/docker-compose.prod.yml
curl -fsSLO https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/docker-compose.gpu.yml
curl -fsSLO https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/.env.example
cp .env.example .env
docker compose -f docker-compose.prod.yml -f docker-compose.gpu.yml up -d

Development Setup (from source)

git clone https://github.com/davidamacey/OpenSpeakers.git
cd OpenSpeakers
cp .env.example .env
docker compose up -d --build

Configuration

Edit .env to customize:

Variable	Default	Description
`GPU_DEVICE_ID`	`0`	Which CUDA GPU to use
`FRONTEND_PORT`	`5200`	UI port
`BACKEND_PORT`	`8080`	API port
`HF_TOKEN`	—	Required for gated models (Orpheus 3B)
`MODEL_CACHE_DIR`	`./model_cache`	Where model weights are stored
`AUDIO_OUTPUT_DIR`	`./audio_output`	Where generated audio is saved

First Use

Open http://localhost:5200 — the TTS page loads automatically
Select a model (Kokoro is fastest for a first test), enter text, click Generate
Browse Models to see all 11 models with help text on speed, quality, and use cases
Try Batch to generate multiple lines at once
Try Compare to hear the same text across different models side-by-side
Use Clone to upload a reference audio clip and create a custom voice

Management CLI

openspeakers.sh is a self-contained management script at the repo root:

./openspeakers.sh <command> [options]

Command	Description
`start [gpu\|dev\|offline\|build]`	Start all services (default: gpu mode)
`stop`	Stop all services
`restart [service]`	Restart all or one service
`status`	Show service health
`logs [service]`	Tail logs (all services or one)
`health`	Check API health endpoint
`workers status`	Show all worker container statuses
`workers logs [name]`	Tail a specific worker's logs
`workers restart [name]`	Restart a worker container
`workers rebuild [name]`	Rebuild and restart a worker
`db migrate`	Apply all pending Alembic migrations
`db revision "message"`	Generate a new Alembic migration
`db reset`	Drop and recreate the database
`db backup`	Dump PostgreSQL to a timestamped file
`db restore <file>`	Restore a PostgreSQL dump
`build [service]`	Build Docker image(s)
`shell [service]`	Open a bash shell in a container
`test [target]`	Run backend or frontend tests
`gpu`	Show live GPU stats (nvidia-smi)
`clean`	Remove stopped containers and dangling images
`purge`	Remove all containers, volumes, and images

API Reference

TTS Jobs

Method	Path	Description
`POST`	`/api/tts/generate`	Submit a generation job; returns `{job_id}`
`GET`	`/api/tts/jobs/{id}`	Get job status and metadata
`GET`	`/api/tts/jobs/{id}/audio`	Stream the generated audio file
`GET`	`/api/tts/jobs`	List jobs (`page`, `page_size`, `status`, `model_id`, `search`)
`DELETE`	`/api/tts/jobs/{id}`	Cancel a pending or running job
`POST`	`/api/tts/batch`	Submit up to 100 lines; returns `{batch_id, job_ids[]}`
`GET`	`/api/tts/batches/{id}`	Aggregate batch status
`GET`	`/api/tts/batches/{id}/zip`	Stream ZIP of all completed audio files

Generate request body

{
  "model_id": "kokoro",
  "text": "Hello world",
  "voice": "af_bella",
  "speed": 1.0,
  "pitch": 0,
  "format": "wav",
  "keep_alive": 60
}

keep_alive controls how long the model stays in VRAM after this request:

-1 — keep loaded indefinitely
0 — unload immediately after generation
N (positive integer) — unload after N seconds of inactivity (default: 60)

Voice Profiles

Method	Path	Description
`GET`	`/api/voices/`	List all voice profiles
`POST`	`/api/voices/`	Create a new profile (multipart: `name`, `model_id`, `audio`)
`GET`	`/api/voices/{id}`	Get a single voice profile
`PATCH`	`/api/voices/{id}`	Update name, description, or tags
`GET`	`/api/voices/{id}/audio`	Stream the reference audio file
`DELETE`	`/api/voices/{id}`	Delete profile and reference audio
`GET`	`/api/voices/builtin/{model_id}`	List built-in preset voices for a model

Models

Method	Path	Description
`GET`	`/api/models/`	List all models with capabilities
`GET`	`/api/models/{id}`	Get single model info
`POST`	`/api/models/{id}/load`	Pre-warm a model (optional `keep_alive`)
`DELETE`	`/api/models/{id}/load`	Force-unload a model from VRAM

System

Method	Path	Description
`GET`	`/health`	Liveness probe
`GET`	`/api/system/info`	GPU + disk + model registry snapshot

OpenAI Compatibility

Method	Path	Description
`POST`	`/v1/audio/speech`	OpenAI-compatible TTS
`GET`	`/v1/models`	OpenAI-format model list

Model mapping: tts-1 → Kokoro 82M, tts-1-hd → Orpheus 3B

WebSocket

Path	Events
`/ws/jobs/{id}`	`queued`, `loading`, `generating`, `audio_chunk`, `complete`, `failed`
`/ws/gpu`	GPU stats every 1 second

OpenAI Compatibility

OpenSpeakers exposes a /v1/audio/speech endpoint compatible with the OpenAI Python SDK and any application that uses the OpenAI TTS API.

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

# Uses Kokoro 82M (tts-1 maps to the fastest model)
audio = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello from OpenSpeakers!",
)
audio.stream_to_file("output.wav")

# Uses Orpheus 3B (tts-1-hd maps to the highest quality model)
audio = client.audio.speech.create(
    model="tts-1-hd",
    voice="zoe",
    input="A higher quality voice.",
)
audio.stream_to_file("output_hd.wav")

curl

curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "voice": "alloy", "input": "Hello world"}' \
  --output speech.wav

keep_alive Usage

The keep_alive parameter on /api/tts/generate and /api/models/{id}/load controls VRAM retention, similar to Ollama's model keep-alive:

# Keep the model loaded for 5 minutes after this request
curl -X POST http://localhost:8080/api/tts/generate \
  -H "Content-Type: application/json" \
  -d '{"model_id": "fish-speech", "text": "Hello", "keep_alive": 300}'

# Pre-warm a model and keep it loaded indefinitely
curl -X POST http://localhost:8080/api/models/kokoro/load \
  -H "Content-Type: application/json" \
  -d '{"keep_alive": -1}'

# Force-unload a model immediately
curl -X DELETE http://localhost:8080/api/models/orpheus/load

Worker Architecture

Each model group runs in its own container on a dedicated Celery queue. This isolates GPU memory, Python dependencies, and container build complexity.

Container	Queue	Models	Dockerfile
`worker-kokoro`	`tts.kokoro`	Kokoro 82M (standby — always loaded)	`Dockerfile.worker`
`worker`	`tts`	VibeVoice 0.5B (streaming), VibeVoice 1.5B	`Dockerfile.worker`
`worker-fish`	`tts.fish-speech`	Fish Audio S2-Pro	`Dockerfile.worker-fish`
`worker-qwen3`	`tts.qwen3`	Qwen3 TTS 1.7B	`Dockerfile.worker-qwen3`
`worker-orpheus`	`tts.orpheus`	Orpheus 3B	`Dockerfile.worker-orpheus`
`worker-dia`	`tts.dia`	Dia 1.6B	`Dockerfile.worker-dia`
`worker-f5`	`tts.f5-tts`	F5-TTS, Chatterbox, CosyVoice 2.0, Parler TTS Mini	`Dockerfile.worker-f5`

The FastAPI backend never touches the GPU. Only Celery workers load ML models. Queue routing is the single source of truth in QUEUE_MAP in backend/app/api/endpoints/tts.py.

All secondary workers inherit from backend/Dockerfile.base-gpu which provides:

PyTorch 2.10.0+cu128 and torchaudio
NVIDIA env vars (NVIDIA_VISIBLE_DEVICES=all, NVIDIA_DRIVER_CAPABILITIES=compute,utility)
Common audio/ML packages: soundfile, numpy, scipy, librosa, accelerate

Frontend Pages

Page	Route	Description
TTS	`/tts`	Main text-to-speech generation with streaming audio
Clone	`/clone`	Upload reference audio, manage voice profiles
Compare	`/compare`	Side-by-side multi-model comparison
Batch	`/batch`	Bulk generation from pasted text or `.txt` file
History	`/history`	Full job history with search and filter
Models	`/models`	Model browser with capability and VRAM reference
Settings	`/settings`	Output format, live GPU stats, storage paths
About	`/about`	Model descriptions and project links

Project Structure

open_speakers/
├── backend/
│   ├── Dockerfile.base-gpu          # Shared GPU base (PyTorch 2.10+cu128)
│   ├── Dockerfile.worker            # Main worker (Kokoro + VibeVoice)
│   ├── Dockerfile.worker-fish       # Fish Speech worker
│   ├── Dockerfile.worker-qwen3      # Qwen3 TTS worker
│   ├── Dockerfile.worker-orpheus    # Orpheus 3B worker (vLLM)
│   ├── Dockerfile.worker-dia        # Dia 1.6B worker
│   ├── Dockerfile.worker-f5         # F5-TTS / Chatterbox / CosyVoice worker
│   └── app/
│       ├── api/endpoints/           # REST API routes + OpenAI compat
│       ├── models/                  # TTS model implementations + ModelManager
│       ├── tasks/                   # Celery tasks (generation, streaming)
│       ├── db/                      # SQLAlchemy ORM models + Alembic
│       └── schemas/                 # Pydantic v2 schemas
├── frontend/src/
│   ├── routes/                      # SvelteKit pages: tts, clone, compare, batch, history, models, settings
│   ├── components/                  # AudioPlayer, ModelParams, ToastContainer, WaveformPreview, etc.
│   └── lib/                         # API clients, Svelte stores (toasts, theme)
├── configs/
│   └── models.yaml                  # Model registry — enable/disable models here
├── docs/
│   ├── PLAN.md                      # Feature roadmap and implementation status
│   └── MARKET_RESEARCH.md           # Competitor analysis
├── scripts/
│   ├── test_all_models.py           # Smoke-test all deployed models sequentially
│   ├── download-models.sh           # Download all model weights from HuggingFace
│   └── package-offline.sh           # Air-gapped install packaging
├── openspeakers.sh                  # Management CLI
├── docker-compose.yml               # Base service definitions
├── docker-compose.override.yml      # Dev build targets (auto-loaded)
├── docker-compose.gpu.yml           # NVIDIA GPU passthrough overlay
└── docker-compose.offline.yml       # Air-gapped / offline deployment

Environment Variables

Copy .env.example to .env and adjust as needed:

Variable	Default	Description
`GPU_DEVICE_ID`	`0`	CUDA device index for all workers
`MODEL_CACHE_DIR`	`./model_cache`	HuggingFace model cache root (volume-mounted)
`AUDIO_OUTPUT_DIR`	`./audio_output`	Generated audio file storage
`POSTGRES_PASSWORD`	`openspeakers`	PostgreSQL password
`HF_TOKEN`	—	HuggingFace token (required for gated models: Orpheus 3B)
`BACKEND_PORT`	`8080`	Exposed backend API port
`FRONTEND_PORT`	`5200`	Exposed frontend port
`DATABASE_URL`	auto	Full PostgreSQL connection string (overrides individual vars)
`CELERY_BROKER_URL`	auto	Redis broker URL (overrides default)

Development

# Start all services (COMPOSE_FILE in .env selects gpu+override automatically)
docker compose up -d

# Or use the management CLI one-liners:
./openspeakers.sh start          # start with GPU
./openspeakers.sh start dev      # start core services only (no GPU workers)
./openspeakers.sh start build    # build images then start (first run)

# Rebuild one worker after Dockerfile changes
docker compose up -d --build worker-orpheus

# Tail worker logs
docker compose logs -f worker-dia

# Open a shell inside a container
docker compose exec backend bash

# Run backend tests
docker compose exec backend pytest tests/ -v

# Frontend type check (rollup native binding requires the container)
docker compose exec frontend npm run check

# Generate a new migration after ORM changes (migrations apply on next restart)
docker compose exec backend alembic revision --autogenerate -m "description"

# Smoke-test all deployed models sequentially
python3 scripts/test_all_models.py

Service URLs

Service	URL
Frontend	http://localhost:5200
Backend API	http://localhost:8080/api
Swagger UI	http://localhost:8080/docs
ReDoc	http://localhost:8080/redoc
PostgreSQL	localhost:5432 (127.0.0.1 only)
Redis	localhost:6379 (127.0.0.1 only)

Tech Stack

Layer	Technology
Frontend	SvelteKit 2, Svelte 5 runes, TypeScript, Tailwind CSS, WaveSurfer.js
Backend	FastAPI, SQLAlchemy 2.0, Alembic, Pydantic v2
Queue	Celery 5 + Redis (concurrency=1 per worker for GPU serialisation)
Database	PostgreSQL
GPU	NVIDIA CUDA 12.8, PyTorch 2.10+cu128

Offline / Air-Gapped Installation

To install OpenSpeakers on a machine without internet access:

# ── On the SOURCE machine (with internet) ───────────────────────────────────

# 1. Download all model weights
./scripts/download-models.sh

# 2. Build images and bundle everything into a transferable package
./scripts/package-offline.sh

# 3. Transfer to the target machine
rsync -avz --progress dist/openspeakers-offline-YYYYMMDD/ user@target:/opt/openspeakers/

# ── On the TARGET machine (no internet required) ─────────────────────────────

cd /opt/openspeakers
./install.sh        # loads images, creates .env, runs docker compose up -d

install.sh will:

Check Docker + NVIDIA runtime prerequisites
Load all Docker images from images/*.tar.gz
Create .env from the example template
Start all services with docker compose up -d
Wait for backend readiness (migrations run automatically on startup)

Model Downloads

To download individual models or refresh the local cache:

# Download all 11 models (~120 GB total)
./scripts/download-models.sh

# Download specific models
./scripts/download-models.sh --models kokoro,f5-tts,chatterbox

# Download to a custom cache directory
./scripts/download-models.sh --cache-dir /mnt/nas/model_cache

# For gated models (Orpheus 3B requires HF account acceptance)
HF_TOKEN=your_token ./scripts/download-models.sh --models orpheus-3b

Available model IDs: kokoro, vibevoice, vibevoice-1.5b, fish-speech-s2, qwen3-tts, f5-tts, f5-tts-vocos, chatterbox, cosyvoice-2, parler-tts, orpheus-3b, dia-1b

Contributing

Fork the repository and create a feature branch
Follow the existing code style: ruff for Python, Prettier + eslint for TypeScript
Add a model: see the "Adding a New Model" section in CLAUDE.md for the step-by-step guide
Run pre-commit hooks before pushing: pre-commit run --all-files
Use conventional commit messages: feat(models): add Parler TTS support
Open a pull request against main

See CLAUDE.md for full developer architecture notes and docs/PLAN.md for the feature roadmap and completion status.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
backend		backend
configs		configs
docs		docs
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.hadolint.yaml		.hadolint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
VERSION		VERSION
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.offline.yml		docker-compose.offline.yml
docker-compose.override.yml		docker-compose.override.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
openspeakers.sh		openspeakers.sh
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

OpenSpeakers

Screenshots

Key Features

Generation

Voice Cloning

Job Management

API

UI / UX

Supported Models

Quick Start

One-Line Install

Manual Install (from Docker Hub)

Development Setup (from source)

Configuration

First Use

Management CLI

API Reference

TTS Jobs

Generate request body

Voice Profiles

Models

System

OpenAI Compatibility

WebSocket

OpenAI Compatibility

Python (openai SDK)

curl

keep_alive Usage

Worker Architecture

Frontend Pages

Project Structure

Environment Variables

Development

Service URLs

Tech Stack

Offline / Air-Gapped Installation

Model Downloads

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages