OpenSpeakers is a unified TTS and voice cloning application supporting 11 open-source models with GPU hot-swap, async job queuing, real-time streaming, and a SvelteKit UI.
See docs/PLAN.md for the full implementation plan and docs/MARKET_RESEARCH.md for
competitor analysis.
- Frontend: SvelteKit 2 + Svelte 5 runes + TypeScript + Tailwind CSS (port 5200)
- Backend: FastAPI + SQLAlchemy 2.0 + Alembic (port 8080)
- Queue: Celery + Redis (concurrency=1 per worker for GPU serialization)
- Database: PostgreSQL (job history, voice profiles, batch tracking)
- Models: Hot-swapped on GPU via
ModelManagersingleton; Kokoro stays in standby
backend/app/models/manager.py — ModelManager is a singleton that:
- Tracks
current_model_id(which model is in GPU VRAM) - On
load_model(id): unloads current +torch.cuda.empty_cache(), then loads new - GPU access serialized via
threading.Lock(Celery worker concurrency=1) - Idle timer (60 s): auto-unloads non-standby models between tasks
standby: truemodels (Kokoro) stay loaded permanently- Only Celery workers load ML models; the FastAPI backend never touches the GPU
Each model group runs in its own container on a dedicated Celery queue:
| Container | Queue | Models | Dockerfile |
|---|---|---|---|
worker-kokoro |
tts.kokoro |
Kokoro 82M (standby — always loaded) | Dockerfile.worker |
worker |
tts |
VibeVoice 0.5B, VibeVoice 1.5B | Dockerfile.worker |
worker-fish |
tts.fish-speech |
Fish Audio S2-Pro | Dockerfile.worker-fish |
worker-qwen3 |
tts.qwen3 |
Qwen3 TTS | Dockerfile.worker-qwen3 |
worker-orpheus |
tts.orpheus |
Orpheus 3B | Dockerfile.worker-orpheus |
worker-dia |
tts.dia |
Dia 1.6B | Dockerfile.worker-dia |
worker-f5 |
tts.f5-tts |
F5-TTS, Chatterbox, CosyVoice 2.0, Parler TTS Mini | Dockerfile.worker-f5 |
Queue routing is the single source of truth in QUEUE_MAP in
backend/app/api/endpoints/tts.py.
All secondary workers inherit from backend/Dockerfile.base-gpu which provides:
- PyTorch 2.10.0+cu128 and torchaudio
- NVIDIA env vars baked in (
NVIDIA_VISIBLE_DEVICES=all) - Common audio/ML packages (soundfile, numpy, scipy, librosa, accelerate)
Note: flash-attn is NOT in the base image — it requires nvcc at build time which is absent from python:3.12-slim. The main worker uses a separate base with flash-attn pre-built.
All TTS models implement TTSModelBase (backend/app/models/base.py):
class TTSModelBase:
model_id: str
model_name: str
description: str
supports_voice_cloning: bool = False
supports_streaming: bool = False
supports_speed: bool = False # show speed slider in UI
supports_pitch: bool = False # show pitch slider in UI
vram_gb_estimate: float = 0.0
def load(self, device: str = "cuda") -> None: ...
def unload(self) -> None: ...
def generate(self, request: GenerateRequest) -> GenerateResult: ...
def stream_generate(self, request: GenerateRequest) -> Iterator[bytes]: ...
def clone_voice(self, audio_path: str, name: str) -> dict: ...- Create
backend/app/models/<name>.pyimplementingTTSModelBase - Register in
ModelManager._register_defaults()inmanager.py - Add config entry to
configs/models.yaml - If it needs a dedicated worker: add
Dockerfile.worker-<name>and a new service indocker-compose.ymlwith the appropriate queue name
# backend/app/models/my_model.py
from app.models.base import TTSModelBase, GenerateRequest, GenerateResult
class MyModel(TTSModelBase):
model_id = "my-model"
model_name = "My TTS Model"
description = "..."
supports_voice_cloning = False
supports_streaming = False
supports_speed = False
def load(self, device: str = "cuda") -> None:
self._model = ... # load weights
self._loaded = True
def unload(self) -> None:
self._model = None
self._loaded = False
import torch; torch.cuda.empty_cache()
def generate(self, request: GenerateRequest) -> GenerateResult:
audio_bytes = ...
return GenerateResult(audio_bytes=audio_bytes, sample_rate=24000,
duration_seconds=..., format="wav")POST /generate— submit jobGET /jobs/{id}— poll statusGET /jobs/{id}/audio— stream audioGET /jobs— list with pagination + filter (page,page_size,status,model_id,search)DELETE /jobs/{id}— cancel (revokes Celery task viacelery_app.control.revoke)POST /batch— submit up to 100 lines; returnsbatch_id+job_ids[]GET /batches/{id}— aggregate batch statusGET /batches/{id}/zip— stream ZIP of all complete audio files
GET /— list all voice profilesPOST /— create (multipart upload of reference audio)GET /{id}— get single profilePATCH /{id}— update name, description, tagsGET /{id}/audio— stream reference audio fileDELETE /{id}— delete profile + audio fileGET /builtin/{model_id}— list preset voices (e.g. Kokoro's 50+ voices)
GET /— all models with capabilities (supports_speed,supports_pitch, etc.)GET /{id}— single model info
GET /health— health checkGET /gpu— GPU stats snapshot
POST /audio/speech— OpenAI-compatible; mapstts-1→ Kokoro,tts-1-hd→ Orpheus 3BGET /models— OpenAI-format model list
/ws/jobs/{id}— events:queued,loading,generating,audio_chunk,complete,failed/ws/gpu— GPU stats stream (1 s interval)
# Start all services (COMPOSE_FILE in .env auto-loads gpu+override)
docker compose up -d
# Start lightweight (no GPU workers)
docker compose up postgres redis backend frontend
# Run backend tests
docker compose exec backend pytest tests/ -v
# Generate new migration (migrations apply automatically on backend startup)
docker compose exec backend alembic revision --autogenerate -m "description"
# Frontend type check (rollup native binding requires container)
docker compose exec frontend npm run check
# Rebuild one worker
docker compose up -d --build worker-orpheus
# Tail worker logs
docker compose logs -f worker-orpheus
# Access backend shell
docker compose exec backend bash
# Smoke test all models
python3 scripts/test_all_models.py| Path | Purpose |
|---|---|
backend/app/models/manager.py |
ModelManager singleton (hot-swap + idle timer) |
backend/app/models/base.py |
TTSModelBase abstract class |
backend/app/models/kokoro.py |
Kokoro 82M (standby model) |
backend/app/models/vibevoice.py |
VibeVoice 0.5B with streaming |
backend/app/models/vibevoice_1p5b.py |
VibeVoice 1.5B (zero-shot cloning) |
backend/app/models/fish_speech.py |
Fish Audio S2-Pro |
backend/app/models/qwen3_tts.py |
Qwen3 TTS 1.7B |
backend/app/models/orpheus.py |
Orpheus 3B (vLLM backend) |
backend/app/models/dia_tts.py |
Dia 1.6B dialogue model |
backend/app/tasks/tts_tasks.py |
Celery tasks (generation + streaming) |
backend/app/api/endpoints/tts.py |
TTS routes + QUEUE_MAP |
backend/app/api/endpoints/openai_compat.py |
OpenAI /v1/audio/speech |
backend/app/db/models.py |
SQLAlchemy ORM (TTSJob, VoiceProfile) |
backend/alembic/versions/ |
DB migration files |
configs/models.yaml |
Model registry (enable/disable/configure) |
frontend/src/routes/tts/+page.svelte |
Main TTS page |
frontend/src/routes/batch/+page.svelte |
Batch generation page |
frontend/src/routes/history/+page.svelte |
Job history page |
frontend/src/components/ModelParams.svelte |
Per-model parameter controls |
frontend/src/components/ToastContainer.svelte |
Toast notification system |
frontend/src/lib/stores/toasts.ts |
Toast store (addToast, removeToast) |
| Variable | Default | Description |
|---|---|---|
GPU_DEVICE_ID |
0 |
CUDA device index for all workers |
MODEL_CACHE_DIR |
./model_cache |
HuggingFace cache root (mounted as volume) |
AUDIO_OUTPUT_DIR |
./audio_output |
Generated audio storage |
DATABASE_URL |
auto | PostgreSQL connection string |
CELERY_BROKER_URL |
auto | Redis URL |
HF_TOKEN |
— | Required for gated models (Orpheus 3B) |
BACKEND_PORT |
8080 |
Exposed API port |
FRONTEND_PORT |
5200 |
Exposed UI port |
| Service | URL |
|---|---|
| Frontend | http://localhost:5200 |
| Backend API | http://localhost:8080/api |
| API Docs (Swagger) | http://localhost:8080/docs |
| ReDoc | http://localhost:8080/redoc |
| PostgreSQL | localhost:5432 (127.0.0.1 only) |
| Redis | localhost:6379 (127.0.0.1 only) |
TTSJob columns of note:
celery_task_id— set at task start; used by cancel endpoint to revokebatch_id— UUID grouping jobs created by a single batch requeststatus— enum:pending,running,complete,failed,cancelled
VoiceProfile columns of note:
description— optional free-text descriptiontags— JSON array of string tagsreference_audio_path— path to uploaded reference audio file
- Qwen3 streaming:
non_streaming_mode=Falseexists but docs say it only "simulates" streaming text input — not true PCM streaming. Currently forced tonon_streaming_mode=True. - flash-attn in base image: requires nvcc at build time (not in python:3.12-slim). Main worker uses a separate base image with flash-attn pre-built. Secondary workers use sdpa fallback.