Skip to content

Releases: attevon-llc/OpenSpeakers

v0.1.1 — production install hardening

18 Apr 11:48

Choose a tag to compare

Production install hardening

End-to-end validated on a fresh install: 11/11 TTS models pass from an empty model cache, downloading all weights from HuggingFace Hub on first run.

Highlights

  • 11/11 models validated end-to-end on a clean machine with an empty HuggingFace cache.
  • Hardened setup-openspeakers.sh: network reachability checks (github.com, hub.docker.com, huggingface.co), 3-retry download loop for every file, docker compose config validation before up, 120 s backend health poll, OPENSPEAKERS_UNATTENDED=1 env var for CI / scripted installs, OPENSPEAKERS_BRANCH override for testing pre-release branches.
  • Fish Speech upgraded to fishaudio/s2-pro — the installed fish-speech library v2.0.0 expects this DAC architecture; the older fish-speech-1.5 checkpoint was incompatible.
  • All workers forward HF_TOKEN from .env, unblocking gated-model downloads (Orpheus 3B).

Fixed

  • Frontend port mapping (5200:80, not 5200:3000)
  • Workers cannot connect to Redis/Celery (missing broker URL env vars)
  • tts.kokoro queue not registered in Celery app
  • Missing HF_HOME/HOME in secondary workers
  • VibeVoice 0.5B voice files at wrong path (/app/demo/voices/..., not /opt/vibevoice/...)
  • Fish Speech crashed on first-run download (now uses snapshot_download())
  • Fish Speech decoder filename varies by model version (candidate search)
  • Qwen3 TTS blocked first-run download (local_files_only=TrueFalse)
  • Orpheus 3B 401 — workers now forward HF_TOKEN from .env

Added

  • HF_TOKEN documented in .env.example with pointer to the Orpheus model license
  • scripts/fix-model-permissions.sh helper for model_cache/ ownership

See CHANGELOG.md for the full list.

Install

curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenSpeakers/main/setup-openspeakers.sh | bash

Docker images

All tagged v0.1.1 and latest on Docker Hub under davidamacey/openspeakers-*:

  • openspeakers-backend, openspeakers-frontend
  • openspeakers-worker, openspeakers-worker-kokoro, openspeakers-worker-fish, openspeakers-worker-qwen3, openspeakers-worker-orpheus, openspeakers-worker-dia, openspeakers-worker-f5

v0.1.0 — Initial Release: 11 TTS Models, GPU Hot-Swap, Voice Cloning

12 Apr 22:39

Choose a tag to compare

Changelog

All notable changes to OpenSpeakers are documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

0.1.0 - 2026-04-12

Overview

Initial public release of OpenSpeakers — a unified TTS and voice cloning application
supporting 11 open-source models with GPU hot-swap, async job queuing, real-time
streaming, and a SvelteKit UI.

Added

11 TTS Models

  • Kokoro 82M — Fastest model (~1s), 50+ preset voices, standby mode
  • VibeVoice 0.5B — Real-time streaming TTS, 12 built-in voices, 10 languages
  • VibeVoice 1.5B — High-quality long-form, multi-speaker dialogue, zero-shot cloning
  • Fish Audio S2-Pro — Multilingual (80+ languages), emotion tags, voice cloning
  • Qwen3 TTS 1.7B — Expressive multilingual with instruct mode and voice cloning
  • Orpheus 3B — Emotional speech with laugh/sigh/gasp tags, vLLM backend
  • F5-TTS — Fast flow-matching (15x realtime), MIT license, reference-audio cloning
  • Chatterbox — Expressive TTS with exaggeration/CFG controls, voice cloning
  • CosyVoice 2.0 — Ultra-low latency (150ms), voice design via text description
  • Parler TTS Mini — Generate any voice from a text description, no reference audio
  • Dia 1.6B — Multi-speaker dialogue with [S1]/[S2] tags and nonverbal sounds

GPU Hot-Swap Architecture

  • ModelManager singleton with threading.Lock for GPU serialization
  • Automatic model unloading between tasks (gc.collect + torch.cuda.empty_cache)
  • 60-second idle timer auto-unloads non-standby models
  • Kokoro standby mode — stays loaded permanently for instant responses
  • Ollama-style keep_alive TTL per model (-1 = indefinite, 0 = clear, N = seconds)

Worker Architecture

  • 7 dedicated Celery worker containers with model-specific queues
  • QUEUE_MAP routing — single source of truth for model-to-queue mapping
  • Shared GPU base image (torch 2.10+cu128) for all secondary workers
  • nvidia runtime on all containers for reliable GPU access
  • Startup validation warns about unrouted or stale QUEUE_MAP entries

API Endpoints

  • POST /api/tts/generate — Submit TTS job (async, returns job_id)
  • GET /api/tts/jobs/{id} — Poll job status
  • GET /api/tts/jobs/{id}/audio — Download generated audio
  • GET /api/tts/jobs — List jobs with pagination, filtering, search
  • DELETE /api/tts/jobs/{id} — Cancel pending/running job (revokes Celery task)
  • POST /api/tts/batch — Submit up to 100 lines as a batch
  • GET /api/tts/batches/{id} — Batch status with aggregate counts
  • GET /api/tts/batches/{id}/zip — Download all completed audio as ZIP
  • POST /api/voices — Upload reference audio for voice cloning
  • GET /api/voices — List voice profiles
  • GET /api/voices/builtin/{model_id} — List preset voices per model
  • PATCH /api/voices/{id} — Update voice profile name/description/tags
  • DELETE /api/voices/{id} — Delete voice profile and files
  • GET /api/models — List all models with capabilities and status
  • POST /api/models/{id}/load — Pre-warm model into GPU VRAM
  • DELETE /api/models/{id}/load — Force-unload model
  • POST /v1/audio/speech — OpenAI-compatible endpoint (tts-1 → Kokoro, tts-1-hd → Orpheus)
  • GET /health — Docker health check
  • GET /api/system/info — GPU stats, disk usage, registered models

WebSocket Endpoints

  • /ws/jobs/{id} — Real-time job progress (queued, loading, generating, audio_chunk, complete)
  • /ws/gpu — Live GPU stats stream (1s interval)

Frontend (SvelteKit 2 + Svelte 5)

  • TTS Page — Model selector with help text, voice picker, speed/pitch/language controls
  • Dialogue Editor — Structured multi-speaker turn editor for Dia and VibeVoice 1.5B
  • Batch Page — Dynamic add/remove text entries, per-job progress, ZIP download
  • Compare Page — Side-by-side generation across up to 4 models
  • Clone Page — Upload reference audio, manage voice profiles
  • History Page — Full job history with search, filter, pagination, audio playback
  • Models Page — Model catalog with help text, capability badges, VRAM bars, filters
  • Settings Page — Live GPU stats via WebSocket, storage paths, system info
  • About Page — Model descriptions and HuggingFace links
  • Dark mode default with theme toggle
  • Mobile responsive sidebar
  • Real-time streaming audio playback (Web Audio API) for VibeVoice 0.5B
  • Per-model parameter panels with emotion tag quick-insert
  • Keyboard shortcuts modal (press ?)
  • Toast notification system

Infrastructure

  • PostgreSQL for job history, voice profiles, batch tracking
  • Redis for Celery broker and WebSocket pub/sub
  • Alembic migrations (auto-run on backend startup)
  • pynvml GPU stats in API container (no torch dependency)
  • Path traversal guard on batch ZIP downloads
  • Extension whitelist on voice profile uploads
  • CORS configuration via environment variable
  • Pre-commit hooks: ruff, bandit, shellcheck, conventional commits

Testing

  • 18 fast API smoke tests
  • Kokoro end-to-end generation test
  • Full-matrix parametrized test for all 11 models (TEST_ALL_MODELS=1)