___ ____ _
/ _ \ _ __ ___ _ __ / ___| _ __ ___ __ _| | _____ _ __ ___
| | | | '_ \ / _ \ '_ \\___ \| '_ \/ _ \/ _` | |/ / _ \ '__/ __|
| |_| | |_) | __/ | | |___) | |_) | __/ (_| | < __/ | \__ \
\___/| .__/ \___|_| |_|____/| .__/ \___|\__,_|_|\_\___|_| |___/
|_| |_|
A unified, self-hosted TTS and voice cloning application supporting 11 open-source models with GPU hot-swap, async job queuing, real-time streaming, and a modern SvelteKit UI.
- 11 working TTS models — Kokoro 82M, VibeVoice 0.5B/1.5B, Fish Audio S2-Pro, Qwen3 TTS 1.7B, Orpheus 3B, Dia 1.6B, F5-TTS, Chatterbox, CosyVoice 2.0, Parler TTS Mini
- GPU hot-swap — only one model in VRAM at a time; auto-eviction with 60-second idle timer
- Ollama-style keep_alive — per-request
-1(hold forever),0(evict now), orNseconds - Real-time audio streaming — VibeVoice 0.5B streams PCM16 chunks via Redis pub/sub → WebSocket → Web Audio API
- Output formats — WAV, MP3, OGG (ffmpeg transcoding)
- Zero-shot cloning — Fish Audio S2-Pro, VibeVoice 1.5B, and Qwen3 TTS clone from any 3–30 second reference clip
- Voice profiles — store reference audio and metadata; reuse across any generation
- 50+ built-in voices — Kokoro's preset voice library
- Emotion and style control — Fish S2-Pro
[whisper]/[excited]tags; Orpheus<laugh>/<sigh>/<gasp>tags - Dialogue mode — Dia 1.6B
[S1]/[S2]multi-speaker scripting with nonverbal sounds
- Async job queue — Celery + Redis; all generation is non-blocking with immediate
job_idresponse - Job cancellation — revoke pending or running jobs mid-flight via the API or UI
- Batch generation — submit up to 100 lines in a single request; download all audio as a ZIP
- Full job history — searchable, filterable by model / status, paginated
- OpenAI-compatible
/v1/audio/speech— drop-in replacement foropenai.audio.speech.create() - WebSocket progress — real-time
queued→loading→generating→audio_chunk→completeevents - Live GPU stats — WebSocket stream of utilisation, temperature, and power draw
- Swagger UI — interactive API docs at
/docs
- SvelteKit 2 + Svelte 5 runes — reactive, type-safe frontend
- WaveSurfer.js waveform — visual audio player with keyboard seek
- Dark mode default — toggle persists to localStorage; FOUC prevention
- Mobile responsive — sidebar collapses to hamburger on narrow screens
- Keyboard shortcuts — press
?for the help modal;Ctrl+Enterto submit; arrows to seek - Toast notifications — non-blocking success/error/warning messages
- Model browser — capability table with VRAM estimates
| Model | Container | Queue | VRAM | Cloning | Streaming | Status |
|---|---|---|---|---|---|---|
| Kokoro 82M | worker-kokoro |
tts.kokoro |
~0.5 GB | — | — | ✅ Working (standby) |
| VibeVoice 0.5B | worker |
tts |
~5 GB | — | PCM16 | ✅ Working |
| VibeVoice 1.5B | worker |
tts |
~12 GB | Zero-shot | — | ✅ Working |
| Fish Audio S2-Pro | worker-fish |
tts.fish-speech |
~22 GB | Zero-shot | Chunked | ✅ Working |
| Qwen3 TTS 1.7B | worker-qwen3 |
tts.qwen3 |
~10 GB | Zero-shot | — | ✅ Working |
| Orpheus 3B | worker-orpheus |
tts.orpheus |
~7 GB | — | — | ✅ Working |
| Dia 1.6B | worker-dia |
tts.dia |
~10 GB | Via prompt | — | ✅ Working |
| F5-TTS | worker-f5 |
tts.f5-tts |
~3 GB | Zero-shot | — | ✅ Working |
| Chatterbox | worker-f5 |
tts.f5-tts |
~5 GB | Zero-shot | — | ✅ Working |
| CosyVoice 2.0 | worker-f5 |
tts.f5-tts |
~5 GB | Zero-shot | — | ✅ Working |
| Parler TTS Mini | worker-f5 |
tts.f5-tts |
~3 GB | — | — | ✅ Working |
curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/scripts/install.sh | bashThis downloads the compose files, pulls all images from Docker Hub, and starts the app. That's it.
Requirements: Docker with Compose v2, NVIDIA GPU, NVIDIA Container Toolkit
Once running:
- UI: http://localhost:5200
- API: http://localhost:8080
- Docs: http://localhost:8080/docs
Models download automatically on first use (~1-22 GB each depending on the model).
mkdir openspeakers && cd openspeakers
curl -fsSLO https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/docker-compose.prod.yml
curl -fsSLO https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/docker-compose.gpu.yml
curl -fsSLO https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/.env.example
cp .env.example .env
docker compose -f docker-compose.prod.yml -f docker-compose.gpu.yml up -dgit clone https://github.com/davidamacey/OpenSpeakers.git
cd OpenSpeakers
cp .env.example .env
docker compose up -d --buildEdit .env to customize:
| Variable | Default | Description |
|---|---|---|
GPU_DEVICE_ID |
0 |
Which CUDA GPU to use |
FRONTEND_PORT |
5200 |
UI port |
BACKEND_PORT |
8080 |
API port |
HF_TOKEN |
— | Required for gated models (Orpheus 3B) |
MODEL_CACHE_DIR |
./model_cache |
Where model weights are stored |
AUDIO_OUTPUT_DIR |
./audio_output |
Where generated audio is saved |
- Open http://localhost:5200 — the TTS page loads automatically
- Select a model (Kokoro is fastest for a first test), enter text, click Generate
- Browse Models to see all 11 models with help text on speed, quality, and use cases
- Try Batch to generate multiple lines at once
- Try Compare to hear the same text across different models side-by-side
- Use Clone to upload a reference audio clip and create a custom voice
openspeakers.sh is a self-contained management script at the repo root:
./openspeakers.sh <command> [options]| Command | Description |
|---|---|
start [gpu|dev|offline|build] |
Start all services (default: gpu mode) |
stop |
Stop all services |
restart [service] |
Restart all or one service |
status |
Show service health |
logs [service] |
Tail logs (all services or one) |
health |
Check API health endpoint |
workers status |
Show all worker container statuses |
workers logs [name] |
Tail a specific worker's logs |
workers restart [name] |
Restart a worker container |
workers rebuild [name] |
Rebuild and restart a worker |
db migrate |
Apply all pending Alembic migrations |
db revision "message" |
Generate a new Alembic migration |
db reset |
Drop and recreate the database |
db backup |
Dump PostgreSQL to a timestamped file |
db restore <file> |
Restore a PostgreSQL dump |
build [service] |
Build Docker image(s) |
shell [service] |
Open a bash shell in a container |
test [target] |
Run backend or frontend tests |
gpu |
Show live GPU stats (nvidia-smi) |
clean |
Remove stopped containers and dangling images |
purge |
Remove all containers, volumes, and images |
| Method | Path | Description |
|---|---|---|
POST |
/api/tts/generate |
Submit a generation job; returns {job_id} |
GET |
/api/tts/jobs/{id} |
Get job status and metadata |
GET |
/api/tts/jobs/{id}/audio |
Stream the generated audio file |
GET |
/api/tts/jobs |
List jobs (page, page_size, status, model_id, search) |
DELETE |
/api/tts/jobs/{id} |
Cancel a pending or running job |
POST |
/api/tts/batch |
Submit up to 100 lines; returns {batch_id, job_ids[]} |
GET |
/api/tts/batches/{id} |
Aggregate batch status |
GET |
/api/tts/batches/{id}/zip |
Stream ZIP of all completed audio files |
{
"model_id": "kokoro",
"text": "Hello world",
"voice": "af_bella",
"speed": 1.0,
"pitch": 0,
"format": "wav",
"keep_alive": 60
}keep_alive controls how long the model stays in VRAM after this request:
-1— keep loaded indefinitely0— unload immediately after generationN(positive integer) — unload after N seconds of inactivity (default: 60)
| Method | Path | Description |
|---|---|---|
GET |
/api/voices/ |
List all voice profiles |
POST |
/api/voices/ |
Create a new profile (multipart: name, model_id, audio) |
GET |
/api/voices/{id} |
Get a single voice profile |
PATCH |
/api/voices/{id} |
Update name, description, or tags |
GET |
/api/voices/{id}/audio |
Stream the reference audio file |
DELETE |
/api/voices/{id} |
Delete profile and reference audio |
GET |
/api/voices/builtin/{model_id} |
List built-in preset voices for a model |
| Method | Path | Description |
|---|---|---|
GET |
/api/models/ |
List all models with capabilities |
GET |
/api/models/{id} |
Get single model info |
POST |
/api/models/{id}/load |
Pre-warm a model (optional keep_alive) |
DELETE |
/api/models/{id}/load |
Force-unload a model from VRAM |
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness probe |
GET |
/api/system/info |
GPU + disk + model registry snapshot |
| Method | Path | Description |
|---|---|---|
POST |
/v1/audio/speech |
OpenAI-compatible TTS |
GET |
/v1/models |
OpenAI-format model list |
Model mapping: tts-1 → Kokoro 82M, tts-1-hd → Orpheus 3B
| Path | Events |
|---|---|
/ws/jobs/{id} |
queued, loading, generating, audio_chunk, complete, failed |
/ws/gpu |
GPU stats every 1 second |
OpenSpeakers exposes a /v1/audio/speech endpoint compatible with the OpenAI Python SDK
and any application that uses the OpenAI TTS API.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)
# Uses Kokoro 82M (tts-1 maps to the fastest model)
audio = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hello from OpenSpeakers!",
)
audio.stream_to_file("output.wav")
# Uses Orpheus 3B (tts-1-hd maps to the highest quality model)
audio = client.audio.speech.create(
model="tts-1-hd",
voice="zoe",
input="A higher quality voice.",
)
audio.stream_to_file("output_hd.wav")curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "tts-1", "voice": "alloy", "input": "Hello world"}' \
--output speech.wavThe keep_alive parameter on /api/tts/generate and /api/models/{id}/load controls
VRAM retention, similar to Ollama's model keep-alive:
# Keep the model loaded for 5 minutes after this request
curl -X POST http://localhost:8080/api/tts/generate \
-H "Content-Type: application/json" \
-d '{"model_id": "fish-speech", "text": "Hello", "keep_alive": 300}'
# Pre-warm a model and keep it loaded indefinitely
curl -X POST http://localhost:8080/api/models/kokoro/load \
-H "Content-Type: application/json" \
-d '{"keep_alive": -1}'
# Force-unload a model immediately
curl -X DELETE http://localhost:8080/api/models/orpheus/loadEach model group runs in its own container on a dedicated Celery queue. This isolates GPU memory, Python dependencies, and container build complexity.
| Container | Queue | Models | Dockerfile |
|---|---|---|---|
worker-kokoro |
tts.kokoro |
Kokoro 82M (standby — always loaded) | Dockerfile.worker |
worker |
tts |
VibeVoice 0.5B (streaming), VibeVoice 1.5B | Dockerfile.worker |
worker-fish |
tts.fish-speech |
Fish Audio S2-Pro | Dockerfile.worker-fish |
worker-qwen3 |
tts.qwen3 |
Qwen3 TTS 1.7B | Dockerfile.worker-qwen3 |
worker-orpheus |
tts.orpheus |
Orpheus 3B | Dockerfile.worker-orpheus |
worker-dia |
tts.dia |
Dia 1.6B | Dockerfile.worker-dia |
worker-f5 |
tts.f5-tts |
F5-TTS, Chatterbox, CosyVoice 2.0, Parler TTS Mini | Dockerfile.worker-f5 |
The FastAPI backend never touches the GPU. Only Celery workers load ML models. Queue
routing is the single source of truth in QUEUE_MAP in backend/app/api/endpoints/tts.py.
All secondary workers inherit from backend/Dockerfile.base-gpu which provides:
- PyTorch 2.10.0+cu128 and torchaudio
- NVIDIA env vars (
NVIDIA_VISIBLE_DEVICES=all,NVIDIA_DRIVER_CAPABILITIES=compute,utility) - Common audio/ML packages: soundfile, numpy, scipy, librosa, accelerate
| Page | Route | Description |
|---|---|---|
| TTS | /tts |
Main text-to-speech generation with streaming audio |
| Clone | /clone |
Upload reference audio, manage voice profiles |
| Compare | /compare |
Side-by-side multi-model comparison |
| Batch | /batch |
Bulk generation from pasted text or .txt file |
| History | /history |
Full job history with search and filter |
| Models | /models |
Model browser with capability and VRAM reference |
| Settings | /settings |
Output format, live GPU stats, storage paths |
| About | /about |
Model descriptions and project links |
open_speakers/
├── backend/
│ ├── Dockerfile.base-gpu # Shared GPU base (PyTorch 2.10+cu128)
│ ├── Dockerfile.worker # Main worker (Kokoro + VibeVoice)
│ ├── Dockerfile.worker-fish # Fish Speech worker
│ ├── Dockerfile.worker-qwen3 # Qwen3 TTS worker
│ ├── Dockerfile.worker-orpheus # Orpheus 3B worker (vLLM)
│ ├── Dockerfile.worker-dia # Dia 1.6B worker
│ ├── Dockerfile.worker-f5 # F5-TTS / Chatterbox / CosyVoice worker
│ └── app/
│ ├── api/endpoints/ # REST API routes + OpenAI compat
│ ├── models/ # TTS model implementations + ModelManager
│ ├── tasks/ # Celery tasks (generation, streaming)
│ ├── db/ # SQLAlchemy ORM models + Alembic
│ └── schemas/ # Pydantic v2 schemas
├── frontend/src/
│ ├── routes/ # SvelteKit pages: tts, clone, compare, batch, history, models, settings
│ ├── components/ # AudioPlayer, ModelParams, ToastContainer, WaveformPreview, etc.
│ └── lib/ # API clients, Svelte stores (toasts, theme)
├── configs/
│ └── models.yaml # Model registry — enable/disable models here
├── docs/
│ ├── PLAN.md # Feature roadmap and implementation status
│ └── MARKET_RESEARCH.md # Competitor analysis
├── scripts/
│ ├── test_all_models.py # Smoke-test all deployed models sequentially
│ ├── download-models.sh # Download all model weights from HuggingFace
│ └── package-offline.sh # Air-gapped install packaging
├── openspeakers.sh # Management CLI
├── docker-compose.yml # Base service definitions
├── docker-compose.override.yml # Dev build targets (auto-loaded)
├── docker-compose.gpu.yml # NVIDIA GPU passthrough overlay
└── docker-compose.offline.yml # Air-gapped / offline deployment
Copy .env.example to .env and adjust as needed:
| Variable | Default | Description |
|---|---|---|
GPU_DEVICE_ID |
0 |
CUDA device index for all workers |
MODEL_CACHE_DIR |
./model_cache |
HuggingFace model cache root (volume-mounted) |
AUDIO_OUTPUT_DIR |
./audio_output |
Generated audio file storage |
POSTGRES_PASSWORD |
openspeakers |
PostgreSQL password |
HF_TOKEN |
— | HuggingFace token (required for gated models: Orpheus 3B) |
BACKEND_PORT |
8080 |
Exposed backend API port |
FRONTEND_PORT |
5200 |
Exposed frontend port |
DATABASE_URL |
auto | Full PostgreSQL connection string (overrides individual vars) |
CELERY_BROKER_URL |
auto | Redis broker URL (overrides default) |
# Start all services (COMPOSE_FILE in .env selects gpu+override automatically)
docker compose up -d
# Or use the management CLI one-liners:
./openspeakers.sh start # start with GPU
./openspeakers.sh start dev # start core services only (no GPU workers)
./openspeakers.sh start build # build images then start (first run)
# Rebuild one worker after Dockerfile changes
docker compose up -d --build worker-orpheus
# Tail worker logs
docker compose logs -f worker-dia
# Open a shell inside a container
docker compose exec backend bash
# Run backend tests
docker compose exec backend pytest tests/ -v
# Frontend type check (rollup native binding requires the container)
docker compose exec frontend npm run check
# Generate a new migration after ORM changes (migrations apply on next restart)
docker compose exec backend alembic revision --autogenerate -m "description"
# Smoke-test all deployed models sequentially
python3 scripts/test_all_models.py| Service | URL |
|---|---|
| Frontend | http://localhost:5200 |
| Backend API | http://localhost:8080/api |
| Swagger UI | http://localhost:8080/docs |
| ReDoc | http://localhost:8080/redoc |
| PostgreSQL | localhost:5432 (127.0.0.1 only) |
| Redis | localhost:6379 (127.0.0.1 only) |
| Layer | Technology |
|---|---|
| Frontend | SvelteKit 2, Svelte 5 runes, TypeScript, Tailwind CSS, WaveSurfer.js |
| Backend | FastAPI, SQLAlchemy 2.0, Alembic, Pydantic v2 |
| Queue | Celery 5 + Redis (concurrency=1 per worker for GPU serialisation) |
| Database | PostgreSQL |
| GPU | NVIDIA CUDA 12.8, PyTorch 2.10+cu128 |
To install OpenSpeakers on a machine without internet access:
# ── On the SOURCE machine (with internet) ───────────────────────────────────
# 1. Download all model weights
./scripts/download-models.sh
# 2. Build images and bundle everything into a transferable package
./scripts/package-offline.sh
# 3. Transfer to the target machine
rsync -avz --progress dist/openspeakers-offline-YYYYMMDD/ user@target:/opt/openspeakers/
# ── On the TARGET machine (no internet required) ─────────────────────────────
cd /opt/openspeakers
./install.sh # loads images, creates .env, runs docker compose up -dinstall.sh will:
- Check Docker + NVIDIA runtime prerequisites
- Load all Docker images from
images/*.tar.gz - Create
.envfrom the example template - Start all services with
docker compose up -d - Wait for backend readiness (migrations run automatically on startup)
To download individual models or refresh the local cache:
# Download all 11 models (~120 GB total)
./scripts/download-models.sh
# Download specific models
./scripts/download-models.sh --models kokoro,f5-tts,chatterbox
# Download to a custom cache directory
./scripts/download-models.sh --cache-dir /mnt/nas/model_cache
# For gated models (Orpheus 3B requires HF account acceptance)
HF_TOKEN=your_token ./scripts/download-models.sh --models orpheus-3bAvailable model IDs: kokoro, vibevoice, vibevoice-1.5b, fish-speech-s2, qwen3-tts,
f5-tts, f5-tts-vocos, chatterbox, cosyvoice-2, parler-tts, orpheus-3b, dia-1b
- Fork the repository and create a feature branch
- Follow the existing code style: ruff for Python, Prettier + eslint for TypeScript
- Add a model: see the "Adding a New Model" section in
CLAUDE.mdfor the step-by-step guide - Run pre-commit hooks before pushing:
pre-commit run --all-files - Use conventional commit messages:
feat(models): add Parler TTS support - Open a pull request against
main
See CLAUDE.md for full developer architecture notes and docs/PLAN.md for the feature
roadmap and completion status.







