Releases: jamiepine/voicebox
v0.3.0
This release rewrites the backend into a modular architecture, overhauls the settings UI into routed sub-pages, fixes audio player freezing, migrates documentation to Fumadocs, and ships a batch of bug fixes targeting the most-reported issues from the tracker.
The backend's 3,000-line monolith main.py has been decomposed into domain routers, a services layer, and a proper database package. A style guide and ruff configuration now enforce consistency. On the frontend, settings have been split into dedicated routed pages with server logs, a changelog viewer, and an about page. The audio player no longer freezes mid-playback, and model loading status is now visible in the UI. Seven user-reported bugs have been fixed, including server crashes during sample uploads, generation list staleness, cryptic error messages, and CUDA support for RTX 50-series GPUs.
Settings Overhaul (#294)
- Split settings into routed sub-tabs: General, Generation, GPU, Logs, Changelog, About
- Added live server log viewer with auto-scroll
- Added in-app changelog page that parses
CHANGELOG.mdat build time - Added About page with version info, license, and generation folder quick-open
- Extracted reusable
SettingRowcomponent for consistent setting layouts
Audio Player Fix (#293)
- Fixed audio player freezing during playback
- Improved playback UX with better state management and listener cleanup
- Fixed restart race condition during regeneration
- Added stable keys for audio element re-rendering
- Improved accessibility across player controls
Backend Refactor (#285)
- Extracted all routes from
main.pyinto 13 domain routers underbackend/routes/—main.pydropped from ~3,100 lines to ~10 - Moved CRUD and service modules into
backend/services/, platform detection intobackend/utils/ - Split monolithic
database.pyinto adatabase/package with separatemodels,session,migrations, andseedmodules - Added
backend/STYLE_GUIDE.mdandpyproject.tomlwith ruff linting config - Removed dead code: unused
_get_cuda_dll_excludes, stalestudio.py,example_usage.py, oldMakefile - Deduplicated shared logic across TTS backends into
backends/base.py - Improved startup logging with version, platform, data directory, and database stats
- Fixed startup database session leak — sessions now rollback and close in
finallyblock - Isolated shutdown unload calls so one backend failure doesn't block the others
- Handled null duration in
story_itemsmigration - Reject model migration when target is a subdirectory of source cache
Documentation Rewrite (#288)
- Migrated docs site from Mintlify to Fumadocs (Next.js-based)
- Rewrote introduction and root page with content from README
- Added "Edit on GitHub" links and last-updated timestamps on all pages
- Generated OpenAPI spec and auto-generated API reference pages
- Removed stale planning docs (
CUDA_BACKEND_SWAP,EXTERNAL_PROVIDERS,MLX_AUDIO,TTS_PROVIDER_ARCHITECTURE, etc.) - Sidebar groups now expand by default; root redirects to
/docs - Added OG image metadata and
/ogpreview page
UI & Frontend
- Added model loading status indicator and effects preset dropdown (3187344)
- Fixed take-label race condition during regeneration
- Added accessible focus styling to select component
- Softened select focus indicator opacity
- Addressed 4 critical and 12 major issues from CodeRabbit review
Bug Fixes (#295)
- Fixed sample uploads crashing the server — audio decoding now runs in a thread pool instead of blocking the async event loop (#278)
- Fixed generation list not updating when a generation completes — switched to
refetchQueriesfor reliable cache busting, added SSE error fallback, and page reset on completion (#231) - Fixed error toasts showing
[object Object]instead of the actual error message (#290) - Added Whisper model selection (
base,small,medium,large,turbo) and expanded language support to the/transcribeendpoint (#233) - Upgraded CUDA backend build from cu121 to cu126 for RTX 50-series (Blackwell) GPU support (#289)
- Handled client disconnects in SSE and streaming endpoints to suppress
[Errno 32] Broken Pipeerrors (#248) - Fixed Docker build failure from pip hash mismatch on Qwen3-TTS dependencies (#286)
- Added 50 MB upload size limit with chunked reads to prevent unbounded memory allocation on sample uploads
- Eliminated redundant double audio decode in sample processing pipeline
Platform Fixes
- Replaced
netstatwithTcpStream+ PowerShell for Windows port detection (#277) - Fixed Docker frontend build and cleaned up Docker docs
- Fixed macOS download links to use
.dmginstead of.app.tar.gz - Added dynamic download redirect routes to landing site
Release Tooling
- Added
draft-release-notesandrelease-bumpagent skills - Wired CI release workflow to extract notes from
CHANGELOG.mdfor GitHub Releases - Backfilled changelog with all historical releases
v0.2.3
The "it works in dev but not in prod" release. This version fixes a series of PyInstaller bundling issues that prevented model downloading, loading, generation, and progress tracking from working in production builds.
Model Downloads Now Actually Work
The v0.2.1/v0.2.2 builds could not download or load models that weren't already cached from a dev install. This release fixes the entire chain:
- Chatterbox, Chatterbox Turbo, and LuxTTS all download, load, and generate correctly in bundled builds
- Real-time download progress — byte-level progress bars now work in production. The root cause:
huggingface_hubsilently disables tqdm progress bars based on logger level, which prevented our progress tracker from receiving byte updates. We now force-enable the internal counter regardless. - Fixed Python 3.12.0
code.replace()bug — the macOS build was on Python 3.12.0, which has a known CPython bug that corrupts bytecode when PyInstaller rewrites code objects. This causedNameError: name 'obj' is not definedcrashes during scipy/torch imports. Upgraded to Python 3.12.13.
PyInstaller Fixes
- Collect all
inflectfiles —typeguard's@typecheckeddecorator callsinspect.getsource()at import time, which needs.pysource files, not just bytecode. Fixes LuxTTS "could not get source code" error. - Collect all
perthfiles — bundles the pretrained watermark model (hparams.yaml,.pth.tar) needed by Chatterbox at runtime - Collect all
piper_phonemizefiles — bundlesespeak-ng-data/(phoneme tables, language dicts) needed by LuxTTS for text-to-phoneme conversion - Set
ESPEAK_DATA_PATHin frozen builds so the espeak-ng C library finds the bundled data instead of looking at/usr/share/espeak-ng-data/ - Collect all
linacodecfiles — fixesinspect.getsourceerror in Vocos codec - Collect all
zipvoicefiles — fixes source code lookup in LuxTTS voice cloning - Copy metadata for
requests,transformers,huggingface-hub,tokenizers,safetensors,tqdm— fixesimportlib.metadatalookups in frozen binary - Add hidden imports for
chatterbox,chatterbox_turbo,luxtts,zipvoicebackends - Add
multiprocessing.freeze_support()to fix resource_tracker subprocess crash in frozen binary --noconsolenow only applied on Windows — macOS/Linux need stdout/stderr for Tauri sidecar log capture- Hardened
sys.stdout/sys.stderrdevnull redirect to test writability, not justNonecheck
Updater
- Fixed updater artifact generation with
v1Compatiblefortauri-actionsignature files - Updated
tauri-actionto v0.6 to fix updater JSON and.siggeneration
Other Fixes
- Full traceback logging on all backend model loading errors (was just
str(e)before)
v0.2.2
UPDATE: I'm working on a rewrite of the model downloading, it's absolute hell and takes a while to test as it always works in dev and never in prod builds. Will have a solution up ASAP. If you're eager to test 0.2.x please compile from source. Next update will solve model downloading and the updater issue for good.
v0.2.2
- Fix Chatterbox model support in bundled builds [SIKE fixed in 0.2.3]
- Fix LuxTTS/ZipVoice support in bundled builds [SIKE fixed in 0.2.3]
- Auto-update CUDA binary when app version changes
- CUDA download progress bar
- Fix server process staying alive on macOS (SIGHUP handling, watchdog grace period)
- Hide console window when running CUDA binary on Windows
v0.2.1
The best local voice cloning tool, just got better...
See the new website: https://voicebox.sh
Released 2026-03-15 — v0.2.1 on GitHub (version bump due to an immutable release tag on GitHub)
Voicebox v0.1.x was a single-engine voice cloning app built around Qwen3-TTS. v0.2.0 is a ground-up rethink: four TTS engines, 23 languages, paralinguistic emotion controls, a post-processing effects pipeline, unlimited generation length, an async generation queue, and support for every major GPU vendor. Plus Docker.
New TTS Engines
Multi-Engine Architecture
Voicebox now runs four independent TTS engines behind a thread-safe per-engine backend registry. Switch engines per-generation from a single dropdown — no restart required.
| Engine | Languages | Size | Key Strengths |
|---|---|---|---|
| Qwen3-TTS 1.7B | 10 | ~3.5 GB | Highest quality, delivery instructions ("speak slowly", "whisper") |
| Qwen3-TTS 0.6B | 10 | ~1.2 GB | Lighter, faster variant |
| LuxTTS | English | ~300 MB | CPU-friendly, 48 kHz output, 150x realtime |
| Chatterbox Multilingual | 23 | ~3.2 GB | Broadest language coverage, zero-shot cloning |
| Chatterbox Turbo | English | ~1.5 GB | 350M params, low latency, paralinguistic tags |
Chatterbox Multilingual — 23 Languages (#257)
Zero-shot voice cloning in Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, and Turkish. The language dropdown dynamically filters to show only languages supported by the selected engine.
LuxTTS — Lightweight English TTS (#254)
A fast, CPU-friendly English engine. ~300 MB download, 48 kHz output, runs at 150x realtime on CPU. Good for quick drafts and machines without a GPU.
Chatterbox Turbo — Expressive English (#258)
A fast 350M-parameter English model with inline paralinguistic tags.
Paralinguistic Tags Autocomplete (#265)
Type / in the text input with Chatterbox Turbo selected to open an autocomplete for 9 expressive tags that the model synthesizes inline with speech:
[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]
Tags render as inline badges in a rich text editor and serialize cleanly to the API.
Generation
Unlimited Generation Length — Auto-Chunking (#266)
Long text is now automatically split at sentence boundaries, generated per-chunk, and crossfaded back together. Engine-agnostic — works with all four engines.
- Auto-chunking limit slider — 100–5,000 chars (default 800)
- Crossfade slider — 0–200ms (default 50ms), or 0 for a hard cut
- Max text length raised to 50,000 characters
- Smart splitting respects abbreviations (Dr., e.g., a.m.), CJK punctuation, and never breaks inside
[tags]
Asynchronous Generation Queue (#269)
Generation is now fully non-blocking. Submit a generation and start typing the next one immediately.
- Serial execution queue prevents GPU contention
- Real-time SSE status streaming (
generating→completed/failed) - Failed generations can be retried without re-entering text
- Stale generations from crashes are auto-recovered on startup
- Generating status pill shown inline in the story editor
Generation Versions
Every generation now supports multiple versions with provenance tracking:
- Original — the unprocessed TTS output, always preserved
- Effects versions — apply different effects chains to create new versions from any source
- Takes — regenerate with the same text/voice but a new seed
- Source tracking — each version records which version it was derived from
- Version pinning in stories — pin a specific version to a story track clip
- Favorites — star generations for quick access
Language Parameter Fix
Qwen TTS models now correctly receive the selected language. The generation form syncs with the voice profile's language setting.
Post-Processing Effects (#271)
A full audio effects system powered by Spotify's pedalboard library. Apply effects after generation, preview in real time, and build reusable presets.
| Effect | Description |
|---|---|
| Pitch Shift | ±12 semitones |
| Reverb | Room size, damping, wet/dry mix |
| Delay | Adjustable time, feedback, mix |
| Chorus / Flanger | Modulated delay — short for metallic, long for lush |
| Compressor | Threshold, ratio, attack, release |
| Gain | -40 to +40 dB |
| High-Pass Filter | Configurable cutoff frequency |
| Low-Pass Filter | Configurable cutoff frequency |
- 4 built-in presets — Robotic, Radio, Echo Chamber, Deep Voice
- Custom presets — create unlimited drag-and-drop effect chains
- Per-profile default effects — assign a chain to a voice profile, auto-applies to every generation
- Live preview — audition effects against existing audio before committing
- Source version selection — apply effects to any version of a generation, not just the latest
Platform Support
Windows Support (#272)
Full Windows support with CUDA GPU detection, cross-platform justfile, and clean server shutdown using taskkill /T for the process tree.
Linux (#262)
Pre-built Linux binaries are not available for this release — the release CI is still broken on Linux and we're working on fixing it. However, this release includes significant Linux improvements that make compiling from source much easier:
- AMD ROCm GPU acceleration with automatic
HSA_OVERRIDE_GFX_VERSIONfor unlisted GPUs - NVIDIA GBM buffer crash fix (#210)
- WebKitGTK microphone access for voice sample recording
- Cross-platform justfile with Linux-specific setup targets
- See the README for build-from-source instructions — we'll ship Linux CI builds as soon as we can
NVIDIA CUDA Backend Swap (#252)
The CPU-only release can download and swap in a CUDA-accelerated backend from within the app. Downloads split parts to work around GitHub's 2GB asset limit, verifies SHA-256 checksums, and restarts the server automatically.
Intel Arc (XPU) and DirectML
PyTorch backend supports Intel Arc GPUs via IPEX/XPU and any-GPU on Windows via DirectML.
Docker + Web Deployment (#161)
Run Voicebox headless:
docker compose up3-stage build, non-root runtime, health checks, persistent model cache. Binds to localhost only by default.
Whisper Turbo
Added openai/whisper-large-v3-turbo as a transcription model option.
Model Management (#268)
- Per-model unload — free GPU memory without deleting downloaded models
- Custom models directory — set
VOICEBOX_MODELS_DIRto store models anywhere - Model folder migration — move all models to a new location with progress tracking
- Download cancel/clear UI — cancel in-progress downloads, VS Code-style problems panel for errors (#238)
- Restructured settings UI — server settings and model management split into cleaner sections
Security & Reliability
- CORS hardening — explicit allowlist of local origins instead of wildcard
*; extensible viaVOICEBOX_CORS_ORIGINS(#88) - Network access toggle — fully disable outbound requests for air-gapped deployments (#133)
- Offline crash fix — Voicebox no longer crashes when HuggingFace is unreachable (#152)
- Atomic audio saves — two-phase write prevents corrupted files on crash or disk-full (#263)
- Filesystem health endpoint — proactive disk space and directory writability checks
- Errno-specific error messages — clear feedback for permission denied, disk full, missing directory
- Chatterbox float64 dtype fix — patches S3Tokenizer and VoiceEncoder to cast float64→float32, preventing crashes on certain audio inputs (#264)
- Watchdog respects keep-server-running —
/watchdog/disableendpoint prevents the server from shutting down when the app window closes, if configured - Server shutdown on Windows — clean process tree termination with
taskkill /Tandos._exitfallback
Accessibility (#243)
- Screen reader support (tested with NVDA/Narrator) across all major UI surfaces
- Keyboard navigation for voice cards, history rows, model management, and story editor
- State-aware
aria-labelattributes on all interactive controls
UI Polish
- Redesigned landing page with animated ControlUI hero, multi-engine copy, model cards, and voice creator section (#274)
- Glassmorphic active state for sidebar buttons with accent border shine
- Voices tab overhaul with inline inspector
- Re...
v0.1.13
What's Changed
Stability and reliability
- #95 Fix: selecting 0.6B model still downloads and uses 1.7B
- #93 fix(mlx): bundle native libs and broaden error handling for Apple Silicon
- #79 fix: handle non-ASCII filenames in Content-Disposition headers
- #78 fix: guard getUserMedia call against undefined mediaDevices in non-secure contexts
- #77 fix: await for confirmation before deleting voices and channels
- #128 fix: resolve multiple issues (#96, #119, #111, #108, #121, #125, #127)
- #40 Fix: audio export path resolution
Build and packaging
UX and docs
v0.1.12
Model Download UX Overhaul
- Real-time download progress tracking with accurate percentage and speed info
- No more downloading notifications during generation even when its not downloading
- Better error handling and status reporting throughout the download process
Other Improvements
- Enhanced health check endpoint with GPU type information
- Improved model caching verification
- More reliable SSE progress updates
- Actual update notifications, you don't need to go to settings and manually check anymore
Note: CUDA support for windows coming next update see issue and my plan.
v0.1.11
- Fixed transcriptions on MLX
- Fixed model download progress (finally)
v0.1.10
Faster generation on Apple Silicon
Massive speed gains, from around 20s per generation to 2-3s!
Added native MLX backend support for Apple Silicon, providing significantly faster TTS and STT generation on M-series macOS machines.
Note: this update broke transcriptions on Apple Silicon only, the patch is in the oven as we speak, 0.1.11 will follow.
Features
- MLX Backend: New backend implementation optimized for Apple Silicon using MLX framework
- Dynamic Backend Selection: Automatically detects platform and selects between MLX (macOS) and PyTorch (other platforms)
- Improved Performance: Leverages Apple's unified memory architecture for faster model inference
Backend Changes
- Refactored TTS and STT logic into modular backend implementations (
mlx_backend.py,pytorch_backend.py) - Added platform detection system to handle backend selection at runtime
- Updated model loading and caching to support both backend types
- Enhanced health check endpoints to report active backend type
Build & Release
- Updated build process to include MLX-specific dependencies for macOS builds
- Modified release workflow to handle platform-specific backend bundling
- Added
requirements-mlx.txtfor MLX dependencies
Documentation
- Updated setup and building guides with MLX-specific instructions
- Added troubleshooting guidance for MLX-related issues
- Enhanced architecture documentation to explain backend selection
v0.1.9
Improved voice profile creation flow,
- Voice create drafts: No longer lose work if you close the model
- Fixed whisper only transcribing English or Chinese, now has support for all languages
Improved Stories editor:
- Added spacebar for play/pause
- Timeline now auto-scrolls to follow playhead during playback
- Fixed misalignment of the items with mouse when picking up
- Fixed hitbox for selecting an item
- Fixed playhead jumping forward when pressing play (the timing anchors bug)
Generation box improvements
- Instruct mode no longer wipes prompt text
- Improved UI cleanliness
Misc
- Fixed "Model downloading" toast during generation when model is already downloaded
v0.1.8
🐛 Bug Fixes
Model Download Timeout Issues
Fixed critical issue where model downloads would fail with "Failed to fetch" errors on Windows:
- Root Cause: Multi-GB model downloads exceeded HTTP request timeout (30-60s), causing frontend to show errors even though downloads were continuing in background
- Solution: Refactored download endpoints to return immediately and continue downloads in background
/models/downloadendpoint now returns instantly with download starting in background/generateand/transcribeendpoints now auto-start model downloads when needed- Returns 202 Accepted status with download progress information for better UX
- Frontend can track download progress via SSE endpoint and retry when complete
Cross-Platform Cache Path Issues
- Fixed hardcoded
~/.cache/huggingface/hubpaths that don't work on Windows - All cache paths now use
hf_constants.HF_HUB_CACHEfor proper cross-platform support - Windows: Uses
%USERPROFILE%\.cache\huggingface\hubor%LOCALAPPDATA% - macOS/Linux: Uses
~/.cache/huggingface/hub - Ensures HuggingFace cache directory exists on startup (defensive fix)
✨ Features
Windows Process Management
- Added
/shutdownendpoint for graceful server shutdown on Windows - Improved process lifecycle management for bundled server binary
GPU Detection Improvements
- Added
gpu_typefield to health check response - Now shows specific GPU type: "CUDA (GPU Name)", "MPS (Apple Silicon)", or None
- Fixes UI showing "GPU: Not Available" when MPS/CUDA is actually detected
