Version: 1.0.0 Last Updated: 5/31/2025 Project Repository: https://github.com/devnen/Chatterbox-TTS-Server
This server is based on the architecture and UI of our Dia-TTS-Server project but uses the distinct chatterbox-tts engine.
- Visual Overview
- Introduction
- System Requirements
- Installation and Setup
- Configuration (
config.yaml) - Running the Server
- Feature Deep Dive
- Usage Guide
- Troubleshooting
- Project Architecture
- Testing (Conceptual)
- License and Disclaimer
This section provides a high-level visual representation of the Chatterbox TTS Server project structure and its primary components.
The following tree illustrates the organization of files and directories within the project root:
Chatterbox-TTS-Server/
│
├── config.py # Manages config.yaml, default values, accessors
├── config.yaml # PRIMARY configuration file (created/managed by server)
├── docker-compose.yml # Docker Compose setup for containerized deployment
├── Dockerfile # Docker image definition
├── documentation.md # This comprehensive documentation file
├── download_model.py # Utility to download specific model files to a local cache
├── engine.py # Core model loading (from_pretrained) & generation logic
├── models.py # Pydantic models for API request validation and structure
├── README.md # Project summary and quick start guide
├── requirements.txt # Python package dependencies
├── server.py # Main FastAPI application, API endpoints, UI routes
│
├── ui/ # Contains all files for the Web User Interface
│ ├── index.html # Main HTML template for the UI
│ ├── presets.yaml # Predefined examples for TTS generation, loaded by the UI
│ └── script.js # Frontend JavaScript for UI interactivity and API communication
│
├── model_cache/ # Default directory for download_model.py script (Note: Not the runtime Hugging Face cache)
├── outputs/ # Default directory for audio files saved from UI or API
├── reference_audio/ # Default directory for user-uploaded reference audio files for voice cloning
└── voices/ # Default directory for predefined voice audio files
This diagram illustrates the major functional components of the server and their interactions:
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────────────┐ ┌───────────────────┐
│ User (Web UI / │────→ │ FastAPI Server │────→ │ TTS Engine (engine.py) │────→ │ ChatterboxTTS │
│ API Client) │ │ (server.py) │ │ (Handles Chunks/Params) │ │ (from HF Hub) │
└───────────────────┘ └─────────┬─────────┘ └──────┬─────────┬──────────┘ └─────────┬─────────┘
↑ │ │ Calls │ │ (Uses PyTorch)
│ (Serves UI, │ Uses │ │ │
│ API data) ▼ ▼ ▼ │
│ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
└───────────────── │ Configuration │ ←─ │ config.yaml │ │ Utilities │ │
│ (config.py) │ └───────────────────┘ │ (utils.py) │ │
└───────────────────┘ │ - Chunking Logic │ │
▲ │ - Audio Proc. │ │
│ Uses │ - File Handling │ │
│ └──────┬────────────┘ │
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │ │
│ Web UI Files │ ←─── │ API Data / HTML │ │ Audio Libraries │←─────┘ │
│ (ui/*) │ │ (via server.py) │ │ (soundfile, librosa)│ ▼
└───────────────────┘ └───────────────────┘ └───────────────────┘ ┌───────────────────┐
│ PyTorch / CUDA │
└───────────────────┘
Diagram Legend:
- Boxes represent major software components or groups of files.
- Arrows (
→) indicate the primary direction of data flow or control. - Lines with descriptive text (e.g., "Uses", "Calls") indicate dependencies or interactions.
The Chatterbox TTS Server is a self-hostable application designed to provide an accessible and feature-rich interface to the chatterbox-tts speech synthesis engine. It aims to simplify the process of generating high-quality speech by offering:
- A user-friendly Web User Interface (Web UI) for interactive use.
- A robust Application Programming Interface (API) for programmatic integration, including an OpenAI-compatible endpoint.
- Advanced features such as voice cloning, predefined voices, large text handling through intelligent chunking, and fine-grained control over generation parameters.
The server utilizes the chatterbox-tts model, developed by Resemble AI. This model is known for its ability to produce natural-sounding speech. The server primarily interacts with this model by loading it from the Hugging Face Hub and passing plain text for synthesis.
Important Note on Text Input: The chatterbox-tts engine, as integrated into this server, processes plain text. It does not support special tags for speaker differentiation (e.g., [S1], [S2]) or explicit emotional control tags. The synthesis is single-speaker, based on the selected voice mode (predefined or cloned).
- High-Quality Single-Speaker TTS: Leverages the
chatterbox-ttsmodel. - Voice Cloning: Enables voice replication from user-provided audio samples.
- Predefined Voices: Offers a library of ready-to-use voices for consistent output.
- Large Text Handling: Implements intelligent chunking to process long plain text inputs without overwhelming the TTS engine.
- Flexible API: Includes a custom
/ttsendpoint for full control and an OpenAI-compatible/v1/audio/speechendpoint for broader integration. - Interactive Web UI: Provides a comprehensive interface for generation, configuration, and audio management.
- Configuration Management: Centralized settings via
config.yaml, editable through the UI or directly. - GPU Acceleration: Supports NVIDIA CUDA and Apple MPS for faster inference, with CPU fallback.
- Optional Audio Post-Processing: Features for silence trimming and audio cleanup.
- Docker Support: Facilitates easy deployment and scaling.
This documentation is intended for:
- End Users: Individuals wishing to use the Web UI for generating speech.
- Developers: Programmers looking to integrate TTS capabilities into their applications via the API.
- System Administrators: Personnel responsible for deploying and maintaining the server.
Ensure your system meets the following requirements before proceeding with installation.
- Windows: Windows 10 (64-bit) or Windows 11 (64-bit).
- Linux: Most modern distributions (Debian/Ubuntu and derivatives are well-tested).
- macOS: While potentially runnable, macOS is not a primary test environment; GPU acceleration is typically limited to NVIDIA hardware.
- Python Version: Python 3.10 or later is required.
- A modern multi-core CPU is recommended for reasonable performance, especially if GPU acceleration is unavailable.
- NVIDIA GPU: For optimal performance, an NVIDIA GPU supporting CUDA is highly recommended.
- Architecture: Maxwell architecture or newer.
- VRAM: Specific VRAM requirements depend on the
chatterbox-ttsmodel variant, but generally, 6GB+ is advisable for smoother operation. - See Section 4.5 GPU Acceleration Setup (NVIDIA) for driver and toolkit details.
- Apple Silicon: M1, M2, M3, or newer Apple Silicon chips with macOS 12.3+ provide excellent acceleration via Apple Metal Performance Shaders (MPS).
- RAM: Minimum 8 GB, 16 GB or more recommended.
- Storage: Sufficient disk space for Python environment, dependencies, downloaded models (Hugging Face cache can grow to several GBs), and generated audio files.
The server relies on several Python packages, managed via requirements.txt. Key dependencies include:
chatterbox-tts: The core text-to-speech engine.fastapi: Web framework for building the API.uvicorn: ASGI server for running FastAPI.torch&torchaudio: For deep learning and audio operations.numpy: Numerical operations.soundfile: Reading and writing audio files.huggingface_hub: Interacting with the Hugging Face Hub for model downloads.PyYAML: For parsingconfig.yamlandpresets.yaml.pydantic: Data validation for API requests.librosa: For advanced audio processing like resampling and speed adjustment.praat-parselmouth: (Optional) For unvoiced segment removal feature.python-multipart: For file uploads.Jinja2: For HTML templating (though UI is primarily API-driven).
Refer to requirements.txt [1] for the complete list.
libsndfile1: Required by thesoundfilePython package for audio file I/O.- Installation (Debian/Ubuntu):
sudo apt install libsndfile1
- Installation (Debian/Ubuntu):
ffmpeg: Recommended for robust audio operations by some underlying libraries (e.g.,librosaortorchaudiofor certain formats).- Installation (Debian/Ubuntu):
sudo apt install ffmpeg
- Installation (Debian/Ubuntu):
This section details the steps to install and configure the Chatterbox TTS Server on your system.
Before you begin, ensure you have:
- Met all System Requirements.
- Installed Python 3.10 or later.
- Installed Git.
- (If using GPU) Installed compatible NVIDIA drivers.
- Open a terminal or command prompt.
- Navigate to the directory where you want to install the server.
- Clone the project repository from GitHub:
git clone https://github.com/devnen/Chatterbox-TTS-Server.git
- Change into the project directory:
cd Chatterbox-TTS-Server
It is strongly recommended to use a Python virtual environment to isolate project dependencies.
# Ensure you are in the Chatterbox-TTS-Server directory
python -m venv venv
.\venv\Scripts\activate
# Your command prompt should now be prefixed with (venv).# Ensure you are in the Chatterbox-TTS-Server directory
python3 -m venv venv
source venv/bin/activate
# Your command prompt should now be prefixed with (venv).With the virtual environment activated:
- Upgrade
pipto its latest version (recommended):pip install --upgrade pip
- Install all required Python packages from
requirements.txt[1]:Note: This step may take several minutes as it downloads and installs numerous packages, including large ones likepip install -r requirements.txt
torchandchatterbox-tts. By default, this may install a CPU-only version of PyTorch. If GPU support is desired, proceed to the next section.
Skip this section if you intend to run the server on CPU only.
- Ensure you have the latest NVIDIA drivers installed for your operating system and GPU. You can download them from the NVIDIA Driver Downloads page.
- After installation or update, reboot your system if prompted.
- Verify driver installation by running
nvidia-smiin your terminal. This command should output information about your GPU and the highest CUDA version supported by the driver.
The chatterbox-tts engine and this server rely on PyTorch. To enable CUDA acceleration, you must install a version of PyTorch compiled with CUDA support.
- Visit the Official PyTorch Get Started page.
- Use the configuration tool on their website:
- PyTorch Build: Stable
- Your OS: Select your operating system (Linux or Windows).
- Package: Pip
- Language: Python
- Compute Platform: Select a CUDA version (e.g., CUDA 11.8, CUDA 12.1). Crucially, choose a CUDA version that is compatible with (less than or equal to) the CUDA version reported by your
nvidia-smicommand.
- The website will generate a
pip installcommand. Copy this command. - In your activated virtual environment, first uninstall any existing CPU-only PyTorch versions that might have been installed by
requirements.txt:pip uninstall torch torchvision torchaudio -y
- Then, paste and run the command obtained from the PyTorch website. Example (for CUDA 12.1, replace with your specific command):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
To verify that PyTorch can utilize your GPU:
- In your activated virtual environment, start a Python interpreter:
python - Execute the following commands:
import torch print(f"PyTorch version: {torch.__version__}") cuda_available = torch.cuda.is_available() print(f"CUDA available: {cuda_available}") if cuda_available: print(f"Number of GPUs: {torch.cuda.device_count()}") print(f"Current GPU Name: {torch.cuda.get_device_name(torch.cuda.current_device())}") exit()
- If
CUDA available:printsTrue, your setup is correct. IfFalse, revisit driver and PyTorch installation steps.
For Apple Silicon Macs (M1, M2, M3, etc.), follow this specific installation sequence:
- macOS 12.3 or later for MPS support
- An Apple Silicon Mac (M1, M2, M3, or newer)
-
Install PyTorch with MPS support first:
# With virtual environment activated pip install torch torchvision torchaudio -
Install chatterbox-tts without dependencies:
pip install --no-deps git+https://github.com/resemble-ai/chatterbox.git
-
Install core server dependencies:
pip install fastapi 'uvicorn[standard]' librosa safetensors soundfile pydub audiotsm praat-parselmouth python-multipart requests aiofiles PyYAML watchdog unidecode inflect tqdm -
Install missing chatterbox dependencies:
pip install conformer==0.3.2 diffusers==0.29.0 resemble-perth==1.0.1 transformers==4.46.3
-
Install remaining dependencies (if not already installed):
pip install --no-deps s3tokenizer pip install onnx==1.16.0
-
Configure MPS device in config.yaml:
tts_engine: device: mps # Set to 'mps' instead of 'auto' or 'cuda'
python -c "import torch; print(f'PyTorch version: {torch.__version__}'); print(f'MPS available: {torch.backends.mps.is_available()}')"Apple Silicon requires careful dependency management due to conflicts between the pinned PyTorch versions in chatterbox-tts requirements and the latest PyTorch versions that support MPS acceleration. This step-by-step process avoids version conflicts while ensuring MPS support.
The server uses a config.yaml file for all its settings.
- On the first run, if
config.yamlis not found in the project root, the server will automatically create it using default values defined internally (seeconfig.py[1]). - You can review and modify this
config.yamlfile after it's created or before the first run if you wish to customize settings like port numbers, paths, or default generation parameters. See Section 5. Configuration (config.yaml) for details.
The Chatterbox TTS Server is configured primarily through a single YAML file, config.yaml, located in the root directory of the project.
config.yaml allows customization of various aspects of the server, including network settings, model parameters, file paths, TTS engine behavior, UI preferences, and default generation values. The server reads this file upon startup.
- Location: The
config.yamlfile must reside in the project's root directory. - Creation: If
config.yamldoes not exist when the server starts, it will be automatically generated with a default set of configurations. These defaults are defined withinconfig.py[1] (specifically in theDEFAULT_CONFIGdictionary).
The following table describes the main sections and some key parameters you might find in config.yaml. For a complete list of all possible parameters and their default values, refer to the DEFAULT_CONFIG structure in config.py [1].
| Section | Parameter | Type | Description | Default (Example) |
|---|---|---|---|---|
server |
host |
string | IP address the server listens on. 0.0.0.0 for all available interfaces. |
0.0.0.0 |
port |
integer | Port number for the server. | 8000 |
|
log_file_path |
string | Path to the server log file (relative to project root or absolute). | logs/tts_server.log |
|
log_file_max_size_mb |
integer | Maximum size of a single log file before rotation. | 10 |
|
log_file_backup_count |
integer | Number of backup log files to keep. | 5 |
|
model |
repo_id |
string | Hugging Face repository ID for the chatterbox-tts model. |
ResembleAI/chatterbox |
tts_engine |
device |
string | TTS processing device: auto, cuda, mps, or cpu. auto attempts CUDA, then MPS, falls back to CPU. |
auto |
predefined_voices_path |
string | Directory for predefined voice audio files. | voices |
|
reference_audio_path |
string | Directory for user-uploaded reference audio files for voice cloning. | reference_audio |
|
default_voice_id |
string | Filename of the default predefined voice to use if none selected (primarily for UI). | default_sample.wav |
|
paths |
model_cache |
string | Directory for caching models downloaded by download_model.py. Note: Runtime uses global HF cache. |
./model_cache |
output |
string | Default directory for audio files saved from the UI or API. | ./outputs |
|
generation_defaults |
temperature |
float | Controls randomness (0.0-1.5). Lower is more deterministic. | 0.8 |
exaggeration |
float | Controls expressiveness (0.0-2.0). | 0.5 |
|
cfg_weight |
float | Classifier-Free Guidance weight (0.0-2.0). Influences adherence to style. | 0.5 |
|
seed |
integer | Random seed for generation. 0 often means random/engine default. |
0 |
|
speed_factor |
float | Playback speed factor (0.25-4.0). 1.0 is normal. |
1.0 |
|
language |
string | Default language code (e.g., en). Primarily for UI, engine may infer. |
en |
|
audio_output |
format |
string | Default output audio format (e.g., wav, opus). |
wav |
sample_rate |
integer | Target sample rate for output audio files (e.g., 24000, 48000). Resampling applied if needed. |
24000 |
|
max_reference_duration_sec |
integer | Maximum duration for reference audio files for cloning. | 30 |
|
ui_state |
last_text |
string | Last text entered in the UI. | "" |
last_voice_mode |
string | Last selected voice mode (predefined or clone). |
predefined |
|
last_predefined_voice |
string/null | Filename of the last used predefined voice. | null |
|
last_reference_file |
string/null | Filename of the last used reference audio. | null |
|
last_seed |
integer | Last used generation seed in UI. | 0 |
|
last_chunk_size |
integer | Last used chunk size in UI. | 120 |
|
last_split_text_enabled |
boolean | Whether text splitting was last enabled in UI. | true |
|
hide_chunk_warning |
boolean | Flag to hide the chunking warning modal. | false |
|
hide_generation_warning |
boolean | Flag to hide the general generation quality notice modal. | false |
|
theme |
string | Default UI theme (dark or light). |
dark |
|
ui |
title |
string | Title displayed in the web UI. | Chatterbox TTS Server |
show_language_select |
boolean | Whether to show language selection in the UI. | true |
|
max_predefined_voices_in_dropdown |
integer | Max predefined voices to list in UI dropdown before it might become less usable. | 20 |
|
debug |
save_intermediate_audio |
boolean | If true, save intermediate audio files during chunk processing for debugging. | false |
Note: Paths can be specified relative to the project root or as absolute paths.
While not explicitly a top-level section in the provided config.py's DEFAULT_CONFIG, flags for enabling audio post-processing features (like silence trimming) are typically boolean values. They might be under debug or a dedicated audio_processing section if you choose to group them. Example:
# audio_processing: # Or under debug:
# enable_silence_trimming: true
# enable_internal_silence_fix: true
# enable_unvoiced_removal: false # Requires parselmouthThe server logic in server.py and utils.py would then check these flags from config_manager.
The Web UI provides sections to manage parts of config.yaml:
- Generation Parameters: Sliders and inputs for parameters like temperature, seed, etc., reflect values from
generation_defaults. Clicking "Save Generation Parameters" updates this section inconfig.yaml. - Server Configuration: Allows viewing and, for some fields, editing settings related to
server,tts_engine, andpaths. Clicking "Save Server Configuration" updatesconfig.yaml. Remember that changes to server host/port, model settings, or fundamental paths require a server restart to take effect. - UI State: Settings like last entered text, selected voice mode, chosen files, chunking toggle/size, and theme preference are automatically saved to the
ui_statesection inconfig.yaml(typically with a debounce mechanism) as you interact with the UI.
- Ensure your Python virtual environment is activated (see Section 4.3 Python Virtual Environment Setup).
- Navigate to the root directory of the
Chatterbox-TTS-Serverproject in your terminal. - Execute the following command:
The server will start, and you will see log output in the terminal, including the address and port it's running on.
python server.py
- Automatic Download (Runtime): The first time you run the server (or if the model is not found in the cache), the
engine.pymodule, specificallyChatterboxTTS.from_pretrained(), will attempt to download thechatterbox-ttsmodel from the Hugging Face Hub (specified bymodel.repo_idinconfig.yaml). This download occurs to the standard Hugging Face cache directory (e.g.,~/.cache/huggingface/hubon Linux/macOS, or%USERPROFILE%\.cache\huggingface\hubon Windows, or as defined byHF_HOMEenvironment variable). This process can take some time depending on your internet connection and model size. The server will fully start after the model is successfully loaded. - Optional Pre-download Script (
download_model.py[1]): The project includes adownload_model.pyscript. This script downloads specific model files (listed in itsCHATTERBOX_MODEL_FILESarray) into the local directory specified bypaths.model_cacheinconfig.yaml(default:./model_cache/).- Important Distinction: The
engine.pyat runtime does not load models from thispaths.model_cachedirectory. It uses the global Hugging Face cache. Thedownload_model.pyscript is a utility for users who might want to create a local, self-contained copy of model components, perhaps for offline use or custom model management, but it's not part of the default runtime model loading path.
- Important Distinction: The
- Web UI: Once the server is running, open your web browser and navigate to the address shown in the startup logs, typically
http://localhost:PORT(e.g.,http://localhost:8000ifserver.portis 8000). The server attempts to open this automatically. - API Documentation (Swagger UI): Interactive API documentation is available at
http://localhost:PORT/docs.
- To stop the server, press
CTRL+Cin the terminal window where it is running.
For containerized deployment, refer to the Dockerfile [1] and docker-compose.yml [1] files in the project root, and the Docker instructions in the README.md file. Docker provides an isolated environment and simplifies dependency management.
This section elaborates on key features of the Chatterbox TTS Server.
The Chatterbox TTS Server expects plain text as input for speech synthesis.
- Standard punctuation (periods, commas, question marks, exclamation marks) is generally recognized by the underlying TTS engine to influence prosody.
- The server and the
chatterbox-ttsengine do not support special tags for:- Speaker differentiation (e.g.,
[S1],[S2]). All generated speech will be in a single voice per request, determined by the selected voice mode. - Explicit emotional control (e.g.,
(emotion:sad)). - Other complex control commands embedded in the text.
- Speaker differentiation (e.g.,
- Any text provided will be synthesized as is, including any characters or symbols that might resemble tags from other systems.
To handle long plain text inputs that might exceed the processing capacity of the TTS engine or lead to overly long audio files, the server implements an intelligent chunking mechanism.
- Process: Enabled by default (can be toggled in UI/API). When active,
utils.py[1] first splits the input text into sentences usingsplit_into_sentences(). Then,chunk_text_by_sentences()groups these sentences into chunks, respecting a maximum characterchunk_size(configurable). - Benefits: Ensures stable generation for long documents, better resource management, and more manageable audio segments.
- Configuration:
- UI: "Split text into chunks" checkbox and "Chunk Size" slider.
- API (
/tts):split_text(boolean) andchunk_size(integer) parameters.
- See Section 3. Large Text Processing & Chunking for a detailed explanation.
The server allows generating speech in a voice cloned from a reference audio sample.
- Mechanism: The user provides a reference audio file (
.wavor.mp3). The path to this file is passed to thechatterbox-ttsengine, which uses it as anaudio_promptto condition the synthesis. - Reference Audio:
- Files are uploaded to or placed in the directory specified by
tts_engine.reference_audio_path(default:./reference_audio/) [1]. - Quality of the reference audio (clear speech, minimal noise) significantly impacts clone quality.
- Duration is also a factor; refer to
audio_output.max_reference_duration_secinconfig.yaml.
- Files are uploaded to or placed in the directory specified by
- Usage:
- UI: Select "Voice Clone" mode, choose a reference file.
- API (
/tts): Setvoice_modetocloneand providereference_audio_filename.
- See Section 4. Voice Cloning: Replicating Voices with Reference Audio for more details.
For ease of use and consistent voice output, the server supports predefined voices.
- Mechanism: A collection of curated voice samples (audio files) are stored on the server. When a predefined voice is selected, its audio file is used as the
audio_promptfor thechatterbox-ttsengine. - Voice Files:
- Stored in the directory specified by
tts_engine.predefined_voices_path(default:./voices/) [1]. - Supported formats:
.wav,.mp3.
- Stored in the directory specified by
- Usage:
- UI: Select "Predefined Voices" mode, choose a voice from the dropdown.
- API (
/tts): Setvoice_modetopredefinedand providepredefined_voice_id(the filename).
- See Section 5. Predefined Voices: Consistent Synthetic Voices for more details.
To achieve reproducible audio output, particularly when experimenting or generating multiple parts of a longer text, a generation seed can be used.
- Mechanism: The
seedparameter (an integer) initializes the random number generators within the TTS engine. - Effect: Using the same seed, input text, voice, and other generation parameters will typically produce identical or very similar audio output. This is useful for maintaining voice consistency across chunks if not using a specific cloned or predefined voice.
- Usage:
- UI: "Generation Seed" input field.
- API (
/ttsand/v1/audio/speech):seedparameter.
- See Section 6. Consistent Generation (Seeding) for more details.
The server includes optional audio post-processing steps handled by utils.py [1] to enhance the quality of the generated audio. These are applied if their respective flags are enabled in config.yaml (e.g., under a conceptual audio_processing section or individual debug flags).
- Silence Trimming (
trim_lead_trail_silence): Removes excessive silence from the beginning and end of audio segments. - Internal Silence Reduction (
fix_internal_silence): Shortens unnaturally long pauses within the speech. - Unvoiced Segment Removal (
remove_long_unvoiced_segments): Ifpraat-parselmouthis installed, this can remove long segments of audio that contain no voiced speech (e.g., long breaths). - Speed Adjustment (
apply_speed_factor): Modifies the playback speed of the audio. Useslibrosafor pitch-preserving adjustment if available.
- Runtime Model Loading: The
engine.py[1] module loads thechatterbox-ttsmodel usingChatterboxTTS.from_pretrained(repo_id=..., device=...). This method downloads the model from the specified Hugging Face repository (defined inconfig.yamlviamodel.repo_id) into the standard Hugging Face local cache if it's not already present. This is the primary mechanism for model access during server operation. - Hugging Face Cache: The default location for this cache is platform-dependent (e.g.,
~/.cache/huggingface/hub). It can be overridden by setting theHF_HOMEenvironment variable. download_model.pyScript: This utility script [1] allows users to download specific model files (listed in its internalCHATTERBOX_MODEL_FILESarray) to a custom local directory defined bypaths.model_cacheinconfig.yaml. This is for users who might want a separate, managed local copy of model assets, but it's not the directory the server'sengine.pyuses for runtime loading.- Configuration: The
model.repo_idinconfig.yamlspecifies the default Hugging Face repository to load from.
This section explains how to use the Chatterbox TTS Server through its Web UI and API.
The Web UI provides an interactive way to generate speech and manage server settings. Access it by navigating to the server's root URL (e.g., http://localhost:8000).
- Text to synthesize: A large text area for inputting the plain text you want to convert to speech. Character count is displayed.
- Generate Speech Button: Initiates the TTS process using the current settings.
- Located below the main text input area.
- "Split text into chunks" Checkbox: Toggles the automatic text chunking feature (see Section 7.2 Large Text Processing (Chunking)). Enabled by default.
- "Chunk Size" Slider: Appears when splitting is enabled. Allows adjusting the target character length for chunks (default 120). The current value is displayed next to the slider.
Radio buttons allow choosing the voice generation method:
- Predefined Voices: Activates the dropdown to select from available predefined voices (see Section 7.4 Predefined Voices). Includes an "Import" button to upload new predefined voice files and a "Refresh" button to reload the list from the server.
- Voice Cloning: Activates the dropdown to select a reference audio file for voice cloning (see Section 7.3 Voice Cloning). Includes an "Import" button to upload new reference files and a "Refresh" button.
- A section displaying buttons for predefined text and parameter examples, loaded from
ui/presets.yaml[1]. Clicking a preset button populates the text area and relevant generation parameters.
An expandable section allows fine-tuning of TTS generation:
- Temperature: Slider controlling output randomness.
- Exaggeration: Slider controlling speech expressiveness.
- CFG Weight: Slider for Classifier-Free Guidance weight.
- Speed Factor: Slider to adjust playback speed. A warning may appear if set to values other than 1.0, as it can affect quality.
- Generation Seed: Input field for an integer seed.
- Language: Dropdown for selecting language (primarily for UI state, engine may infer).
- "Save Generation Parameters" Button: Saves the current slider/input values as new defaults in the
generation_defaultssection ofconfig.yaml.
An expandable section that displays current server configuration values loaded from config.yaml via an API call.
- Fields like server host/port, TTS device, model paths, audio output settings are shown.
- Some fields may be editable here, though changes to critical settings like paths or port numbers require a server restart to take effect.
- "Save Server Configuration" Button: Attempts to save changes made in editable fields back to
config.yaml. A restart prompt may appear. - "Restart Server" Button: (May appear after saving certain settings) Logs a request to restart the server.
- Appears below the main form after successful audio generation.
- Uses WaveSurfer.js to display an interactive waveform.
- Includes Play/Pause button, a Download link for the generated audio file (WAV or Opus), and information about the generation (voice mode, file used, generation time, audio duration).
- A button (usually in the navigation bar) to switch between light and dark UI themes. The preference is saved in the browser's local storage and also synced to
ui_state.themeinconfig.yaml.
- The UI attempts to save the last used text, voice mode, selected files, generation parameter values, chunking settings, and theme choice to the
ui_statesection inconfig.yaml. These settings are reloaded when the page is next visited.
The server exposes RESTful API endpoints for programmatic interaction. Interactive documentation (Swagger UI) is available at the /docs path.
- The API is served by FastAPI.
- Currently, the API endpoints do not implement authentication by default (this can be added if needed by modifying
server.py).
This endpoint is designed to be compatible with the basic OpenAI TTS API structure, facilitating integration with tools expecting this format.
-
Request Body: JSON, expected to follow a structure similar to
OpenAITTSRequest.Field Type Required Description Default (Server-Side) modelstring No Model identifier. Often ignored by self-hosted servers as they use a fixed engine. Can be included for compatibility. chatterbox(example)inputstring Yes The plain text to be synthesized. voicestring No Specifies the voice. This would map to either a predefined voice filename (e.g., "default_sample.wav") or a reference audio filename for cloning (e.g.,"my_clone.mp3").Engine default/config response_formatstring No Desired audio output format. Supported: "wav","opus"."wav"(from config)speedfloat No Playback speed factor (e.g., 0.5 to 2.0). Applied post-generation. 1.0seedinteger No Generation seed for reproducibility. 0or absent might use default engine randomness.0(from config) -
Processing Logic (Hypothetical for Chatterbox Server):
- The server would parse the
voiceparameter. It would need to check if thevoicestring matches a filename in thepredefined_voices_pathorreference_audio_pathto determine if it's a predefined voice or a clone request. - If
voicecorresponds to a predefined voice,voice_mode="predefined"andpredefined_voice_idwould be set internally. - If
voicecorresponds to a reference audio,voice_mode="clone"andreference_audio_filenamewould be set internally. - The
inputtext is processed. Chunking is typically applied with default server settings. - Generation parameters like temperature, exaggeration, cfg_weight would use server defaults from
config.yamlas they are not standard OpenAI API fields. - The
speedandseedparameters, if provided, would be used.
- The server would parse the
-
Response:
- Success (200 OK):
StreamingResponsecontaining binary audio data (media typeaudio/wavoraudio/opus). - Error: Standard FastAPI JSON error response (e.g., 400, 404, 500).
- Success (200 OK):
This is the primary and most flexible endpoint for TTS generation, offering full control over all available parameters.
-
Request Body (
CustomTTSRequestfrommodels.py[1]):Field Type Required Description Default (from config.yamlif not provided)textstring Yes Plain text to be synthesized. voice_mode"predefined"|"clone"No Specifies the voice generation mode. "predefined"predefined_voice_idstring | null Conditional Filename of the voice from voices/. Required ifvoice_modeispredefined.tts_engine.default_voice_idreference_audio_filenamestring | null Conditional Filename of the audio from reference_audio/. Required ifvoice_modeisclone.nulloutput_format"wav"|"opus"No Desired audio output format. audio_output.formatsplit_textboolean | null No Enable/disable automatic text chunking. truechunk_sizeinteger | null No Approximate target character length for chunks (50-500 recommended). 120temperaturefloat | null No Overrides default temperature. generation_defaults.temperatureexaggerationfloat | null No Overrides default exaggeration. generation_defaults.exaggerationcfg_weightfloat | null No Overrides default CFG weight. generation_defaults.cfg_weightseedinteger | null No Overrides default seed. generation_defaults.seedspeed_factorfloat | null No Overrides default speed factor. generation_defaults.speed_factorlanguagestring | null No Overrides default language. generation_defaults.language -
Response:
- Success (200 OK):
StreamingResponsecontaining binary audio data (media typeaudio/wavoraudio/opus) with appropriateContent-Dispositionheaders for download. - Error: Standard FastAPI JSON error response (e.g., 400 for bad input, 404 for missing voice file, 500 for server error, 503 if model not loaded).
- Success (200 OK):
These endpoints are primarily used by the Web UI to populate dynamic content and manage settings.
GET /api/ui/initial-data:- Returns a JSON object containing the full server configuration (stringified paths), lists of available reference files and predefined voices, and UI presets. Crucial for UI initialization.
POST /save_settings:- Accepts a partial JSON representation of the configuration. Merges these changes into the current
config.yamland saves it. - Response:
UpdateStatusResponse[1] indicating success/failure and if a restart is needed.
- Accepts a partial JSON representation of the configuration. Merges these changes into the current
POST /reset_settings:- Resets the
config.yamlfile to its hardcoded defaults (fromconfig.py[1]). - Response:
UpdateStatusResponse[1].
- Resets the
POST /restart_server:- Logs a request to restart the server. Actual restart depends on the deployment environment (e.g., process manager, Docker).
- Response:
UpdateStatusResponse[1].
GET /get_reference_files:- Returns a JSON list of filenames available in the
reference_audiodirectory.
- Returns a JSON list of filenames available in the
GET /get_predefined_voices:- Returns a JSON list of dictionaries, each with
display_nameandfilenamefor voices in thevoicesdirectory.
- Returns a JSON list of dictionaries, each with
POST /upload_reference:- Endpoint for uploading reference audio files. Expects
multipart/form-data. - Validates and saves files to
reference_audio_path. - Response: JSON detailing uploaded files, any errors, and the updated list of all reference files.
- Endpoint for uploading reference audio files. Expects
POST /upload_predefined_voice:- Endpoint for uploading predefined voice audio files. Expects
multipart/form-data. - Validates and saves files to
predefined_voices_path. - Response: JSON detailing uploaded files, any errors, and the updated list of all predefined voices.
- Endpoint for uploading predefined voice audio files. Expects
This section provides guidance on common issues encountered with the Chatterbox TTS Server.
| Issue | Possible Cause(s) | Suggested Solution(s) |
|---|---|---|
| Server Fails to Start | Port conflict; Python environment issues; missing critical dependencies; config.yaml corruption. |
Check terminal logs for specific error messages. Ensure selected port is free. Verify virtual environment activation and pip install -r requirements.txt. Delete config.yaml to regenerate on next start. |
| Apple Silicon (MPS) Not Available | macOS version too old; non-Apple Silicon Mac; incorrect PyTorch version; device not configured properly. | Ensure macOS 12.3+, Apple Silicon Mac (M1/M2/M3+). Install PyTorch first: pip install torch torchvision torchaudio. Set device: mps in config.yaml. Verify: python -c "import torch; print(torch.backends.mps.is_available())" |
| Apple Silicon Installation Conflicts | Version conflicts between PyTorch and chatterbox-tts dependencies; ONNX build failures. | Follow exact Apple Silicon installation sequence in Section 4.5.1. Install PyTorch first, then use --no-deps for chatterbox-tts. Use pip install onnx==1.16.0 for compatible ONNX version. |
| "CUDA not available" or Slow Performance | NVIDIA drivers not installed/updated; incorrect PyTorch (CUDA) version; GPU not selected/available. | Follow Section 4.5 GPU Acceleration Setup (NVIDIA). Set tts_engine.device to cuda in config.yaml. Check nvidia-smi. |
| VRAM Out of Memory (OOM) Errors | GPU has insufficient VRAM for the model; other applications consuming GPU memory. | Ensure GPU meets minimum requirements. Close other GPU-heavy applications. If problem persists, consider a GPU with more VRAM. For very long texts, ensure chunking is active and chunk_size is reasonable. |
| Model Download Fails | Internet connectivity issues; Hugging Face Hub issues; incorrect model.repo_id in config.yaml; cache problems. |
Check internet connection. Verify model.repo_id. Try clearing Hugging Face cache (HF_HOME or default location). |
| Voice Cloning Poor Quality/Fails | Poor quality reference audio (noise, reverb); reference audio too short/long; incorrect file format. | Use clean, clear reference audio (5-20 seconds typical). Ensure .wav or .mp3 format. Check audio_output.max_reference_duration_sec. Experiment with generation parameters. |
| Predefined Voice Not Found | Voice file missing from voices/ directory; incorrect filename in UI/API. |
Verify file exists in the path specified by tts_engine.predefined_voices_path. Ensure correct filename is used. Use "Refresh" button in UI. |
| Audio Output Issues (No sound, distorted) | Incorrect audio processing settings; sample rate mismatch; TTS engine error. | Check audio_output.sample_rate and audio_output.format in config.yaml. Review server logs for synthesis errors. Try simpler text or different voice. Disable optional audio post-processing features to isolate. |
| UI Not Loading or Behaving Erratically | JavaScript errors; browser cache issues; API connectivity problems. | Clear browser cache and cookies. Check browser's developer console (F12) for errors. Ensure server is running and accessible. |
| Configuration Changes Not Taking Effect | Server not restarted after critical changes (host, port, paths, model settings). | Restart the server application after modifying these types of settings in config.yaml. |
| File Upload Failures | Incorrect file type; file too large (if server imposes limits); permissions issues on server. | Ensure uploading supported formats (.wav, .mp3). Check server logs for detailed error. Verify write permissions for reference_audio/ and voices/ directories. |
- The primary server log file is specified by
server.log_file_pathinconfig.yaml(default:logs/tts_server.log[1]). - Logs are rotated based on
log_file_max_size_mbandlog_file_backup_count. - Review these logs for detailed error messages and operational information. Standard output in the terminal also provides real-time logging.
This section outlines the software architecture of the Chatterbox TTS Server.
server.py[1]:- The main application entry point, built with FastAPI.
- Defines all API endpoints (e.g.,
/tts,/api/ui/initial-data, configuration endpoints). - Handles incoming HTTP requests, validates them using Pydantic models (from
models.py[1]). - Serves the static files for the Web UI (
ui/directory [1]). - Orchestrates the TTS generation process by calling
engine.pyandutils.pyfunctions. - Manages application lifecycle events (startup, shutdown), including model loading.
engine.py[1]:- Responsible for loading and managing the
chatterbox-ttsmodel instance. load_model(): InitializesChatterboxTTS.from_pretrained(), handling device selection (CUDA/CPU).synthesize(): Takes text and generation parameters, invokes the corechatterbox_model.generate()method, and returns the audio tensor.
- Responsible for loading and managing the
config.py[1]:- Implements the
YamlConfigManagerclass for loading, saving, and accessing configuration fromconfig.yaml. - Defines the default configuration structure (
DEFAULT_CONFIG). - Provides convenient accessor functions (e.g.,
get_port(),get_model_repo_id()) for other modules to retrieve settings.
- Implements the
utils.py[1]:- Contains a collection of helper functions:
- Text Processing:
split_into_sentences(),chunk_text_by_sentences()for preparing text for TTS. - Audio Processing:
encode_audio()(to WAV/Opus),save_audio_to_file(),apply_speed_factor(), optional silence trimming and unvoiced segment removal functions. - File System Utilities:
get_valid_reference_files(),get_predefined_voices(),sanitize_filename(),validate_reference_audio(). PerformanceMonitorclass.
- Text Processing:
- Contains a collection of helper functions:
models.py[1]:- Defines Pydantic models used for API request body validation and structuring API responses (e.g.,
CustomTTSRequest,ErrorResponse).
- Defines Pydantic models used for API request body validation and structuring API responses (e.g.,
ui/directory [1]:index.html: The main HTML file for the single-page Web UI.script.js: Client-side JavaScript that handles all UI logic, interacts with the server's API endpoints, manages audio playback with WaveSurfer.js, and updates the DOM dynamically.presets.yaml: Contains example texts and parameters for the UI's preset feature.
- External Libraries: (e.g.,
chatterbox-tts,fastapi,torch,librosa,soundfile) provide core functionalities.
- Client Request: User (via Web UI or API client) sends a POST request to
/ttswith a JSON payload (CustomTTSRequest). - FastAPI (
server.py):- Receives and validates the request against
CustomTTSRequestmodel. - Extracts text, voice mode, generation parameters, and chunking options.
- Receives and validates the request against
- Text Processing (
utils.py):- If
split_textis true,chunk_text_by_sentences()is called to divide the input text into manageable chunks.
- If
- TTS Engine (
engine.py):- For each text chunk:
server.pydetermines theaudio_prompt_pathbased onvoice_mode(predefined or clone).engine.synthesize()is called with the chunk text, audio prompt path, and generation parameters.engine.synthesize()invokeschatterbox_model.generate().- The raw audio tensor is returned.
- For each text chunk:
- Audio Processing (
utils.pyinserver.py):- The audio tensor from the engine is converted to a NumPy array.
- Speed factor is applied via
apply_speed_factor(). - Optional post-processing (silence trimming, etc.) is applied if configured.
- Processed audio segments (if chunked) are concatenated.
- Encoding (
utils.py):- The final NumPy audio array is encoded into the desired
output_format(WAV or Opus) byencode_audio(), which also handles resampling to the target output sample rate.
- The final NumPy audio array is encoded into the desired
- FastAPI Response (
server.py):- The encoded audio bytes are streamed back to the client as a
StreamingResponsewith appropriate media type and download headers.
- The encoded audio bytes are streamed back to the client as a
- Client (Web UI -
script.js):- Receives the audio blob.
- Creates an object URL for the blob.
- Initializes WaveSurfer.js to play and visualize the audio.
While this project does not include a formal automated test suite in the provided codebase, testing can be approached through several methods:
- Manual UI Testing:
- Thoroughly test all UI elements: text input, sliders, dropdowns, buttons, file uploads, audio player.
- Test with various text lengths, including very short and very long inputs (to verify chunking).
- Test different voice modes (predefined, clone) with valid and invalid selections.
- Verify session persistence of UI settings.
- Test theme switching.
- Check behavior across different browsers (e.g., Chrome, Firefox, Edge).
- API Endpoint Testing:
- Use tools like Swagger UI (at
/docs), Postman, orcurlto send requests to all API endpoints. - Test
/ttswith various valid and invalid parameter combinations.- Different
output_formatvalues. - Chunking enabled and disabled.
- Different generation parameters (seed, temperature, etc.).
- Valid and invalid
predefined_voice_idandreference_audio_filename.
- Different
- Test
/v1/audio/speech(OpenAI compatible) similarly, focusing on its specific parameter mapping. - Test configuration endpoints (
/save_settings,/reset_settings) and verify changes inconfig.yaml. - Test file upload endpoints (
/upload_reference,/upload_predefined_voice) with valid and invalid file types/sizes. - Test helper endpoints (
/get_reference_files, etc.).
- Use tools like Swagger UI (at
- Output Audio Quality Assessment:
- Listen to generated audio for clarity, naturalness, artifacts, and correctness based on input text and parameters.
- Verify that speed factor, silence trimming, and other audio processing features work as expected.
- Configuration Testing:
- Modify
config.yamlwith different valid and invalid values to ensure the server handles them gracefully (e.g., falls back to defaults, logs errors). - Test server startup with a missing
config.yamlto verify default generation.
- Modify
- Performance Testing (Basic):
- Measure response times for API requests, especially for long text synthesis.
- Monitor CPU, GPU, and RAM usage under load. The
PerformanceMonitorclass inutils.py[1] can be enabled for more detailed internal timings.
- Docker Deployment Testing:
- Build the Docker image and run the container using
docker-compose.yml. - Verify all functionalities within the containerized environment, including volume mounts and GPU access (if applicable).
- Build the Docker image and run the container using
For a more robust setup, unit tests (e.g., using pytest) could be added for functions in utils.py and config.py, and integration tests could be written for API endpoints using FastAPI's TestClient.
- License: This project is licensed under the MIT License.
- Disclaimer: This software is provided "as is," without warranty of any kind, express or implied. The developers and contributors are not liable for any claim, damages, or other liability arising from the use of this software. Users are responsible for ensuring that their use of this TTS server and any generated audio complies with all applicable laws, regulations, and ethical guidelines, including those related to copyright, privacy, and voice impersonation. It is strongly recommended to use this technology responsibly, primarily with synthetic voices or with explicit consent when using voice cloning features that might resemble real individuals. The project authors disclaim responsibility for any misuse of this technology.