Skip to content

Releases: attevon-llc/OpenSpeakers

v0.1.0 — Initial Release: 11 TTS Models, GPU Hot-Swap, Voice Cloning

12 Apr 22:39

Choose a tag to compare

Changelog

All notable changes to OpenSpeakers are documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

0.1.0 - 2026-04-12

Overview

Initial public release of OpenSpeakers — a unified TTS and voice cloning application
supporting 11 open-source models with GPU hot-swap, async job queuing, real-time
streaming, and a SvelteKit UI.

Added

11 TTS Models

  • Kokoro 82M — Fastest model (~1s), 50+ preset voices, standby mode
  • VibeVoice 0.5B — Real-time streaming TTS, 12 built-in voices, 10 languages
  • VibeVoice 1.5B — High-quality long-form, multi-speaker dialogue, zero-shot cloning
  • Fish Audio S2-Pro — Multilingual (80+ languages), emotion tags, voice cloning
  • Qwen3 TTS 1.7B — Expressive multilingual with instruct mode and voice cloning
  • Orpheus 3B — Emotional speech with laugh/sigh/gasp tags, vLLM backend
  • F5-TTS — Fast flow-matching (15x realtime), MIT license, reference-audio cloning
  • Chatterbox — Expressive TTS with exaggeration/CFG controls, voice cloning
  • CosyVoice 2.0 — Ultra-low latency (150ms), voice design via text description
  • Parler TTS Mini — Generate any voice from a text description, no reference audio
  • Dia 1.6B — Multi-speaker dialogue with [S1]/[S2] tags and nonverbal sounds

GPU Hot-Swap Architecture

  • ModelManager singleton with threading.Lock for GPU serialization
  • Automatic model unloading between tasks (gc.collect + torch.cuda.empty_cache)
  • 60-second idle timer auto-unloads non-standby models
  • Kokoro standby mode — stays loaded permanently for instant responses
  • Ollama-style keep_alive TTL per model (-1 = indefinite, 0 = clear, N = seconds)

Worker Architecture

  • 7 dedicated Celery worker containers with model-specific queues
  • QUEUE_MAP routing — single source of truth for model-to-queue mapping
  • Shared GPU base image (torch 2.10+cu128) for all secondary workers
  • nvidia runtime on all containers for reliable GPU access
  • Startup validation warns about unrouted or stale QUEUE_MAP entries

API Endpoints

  • POST /api/tts/generate — Submit TTS job (async, returns job_id)
  • GET /api/tts/jobs/{id} — Poll job status
  • GET /api/tts/jobs/{id}/audio — Download generated audio
  • GET /api/tts/jobs — List jobs with pagination, filtering, search
  • DELETE /api/tts/jobs/{id} — Cancel pending/running job (revokes Celery task)
  • POST /api/tts/batch — Submit up to 100 lines as a batch
  • GET /api/tts/batches/{id} — Batch status with aggregate counts
  • GET /api/tts/batches/{id}/zip — Download all completed audio as ZIP
  • POST /api/voices — Upload reference audio for voice cloning
  • GET /api/voices — List voice profiles
  • GET /api/voices/builtin/{model_id} — List preset voices per model
  • PATCH /api/voices/{id} — Update voice profile name/description/tags
  • DELETE /api/voices/{id} — Delete voice profile and files
  • GET /api/models — List all models with capabilities and status
  • POST /api/models/{id}/load — Pre-warm model into GPU VRAM
  • DELETE /api/models/{id}/load — Force-unload model
  • POST /v1/audio/speech — OpenAI-compatible endpoint (tts-1 → Kokoro, tts-1-hd → Orpheus)
  • GET /health — Docker health check
  • GET /api/system/info — GPU stats, disk usage, registered models

WebSocket Endpoints

  • /ws/jobs/{id} — Real-time job progress (queued, loading, generating, audio_chunk, complete)
  • /ws/gpu — Live GPU stats stream (1s interval)

Frontend (SvelteKit 2 + Svelte 5)

  • TTS Page — Model selector with help text, voice picker, speed/pitch/language controls
  • Dialogue Editor — Structured multi-speaker turn editor for Dia and VibeVoice 1.5B
  • Batch Page — Dynamic add/remove text entries, per-job progress, ZIP download
  • Compare Page — Side-by-side generation across up to 4 models
  • Clone Page — Upload reference audio, manage voice profiles
  • History Page — Full job history with search, filter, pagination, audio playback
  • Models Page — Model catalog with help text, capability badges, VRAM bars, filters
  • Settings Page — Live GPU stats via WebSocket, storage paths, system info
  • About Page — Model descriptions and HuggingFace links
  • Dark mode default with theme toggle
  • Mobile responsive sidebar
  • Real-time streaming audio playback (Web Audio API) for VibeVoice 0.5B
  • Per-model parameter panels with emotion tag quick-insert
  • Keyboard shortcuts modal (press ?)
  • Toast notification system

Infrastructure

  • PostgreSQL for job history, voice profiles, batch tracking
  • Redis for Celery broker and WebSocket pub/sub
  • Alembic migrations (auto-run on backend startup)
  • pynvml GPU stats in API container (no torch dependency)
  • Path traversal guard on batch ZIP downloads
  • Extension whitelist on voice profile uploads
  • CORS configuration via environment variable
  • Pre-commit hooks: ruff, bandit, shellcheck, conventional commits

Testing

  • 18 fast API smoke tests
  • Kokoro end-to-end generation test
  • Full-matrix parametrized test for all 11 models (TEST_ALL_MODELS=1)