A production-ready local deployment of Chatterbox TTS by Resemble AI with FastAPI backend and Streamlit frontend, featuring voice cloning capabilities and optimized for Apple Silicon Macs.
Made with
- 🎯 State-of-the-art TTS: Based on Resemble AI's Chatterbox model
- 🎭 Voice Cloning: Upload reference audio to clone any voice
- 🚀 Apple Silicon Optimized: Automatic MPS acceleration on M1/M2/M3/M4 Macs
- 🔄 FastAPI Backend: RESTful API for easy integration
- 🎨 Beautiful UI: Streamlit-based web interface
- ⚙️ Advanced Controls: Emotion exaggeration, temperature, CFG weight, and more
- 📦 Easy Setup: One-command installation with UV package manager
- 🔒 Secure: Isolated virtual environment with pinned dependencies
The Streamlit interface provides an intuitive way to generate speech with various parameters:
- Text Input: Support for up to 500 characters
- Voice Cloning: Optional reference audio upload
- Parameter Controls: Exaggeration, CFG/Pace, temperature, and advanced sampling options
- Real-time Preview: Instant audio playback and download
FastAPI automatically generates interactive API documentation available at http://localhost:8000/docs
- Python: 3.9 or higher
- macOS: Recommended (optimized for Apple Silicon)
- UV Package Manager: For fast, reliable dependency management
- Git: For cloning the repository
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/aryateja2106/ChatterBox-TTS.git
cd ChatterBox-TTS# Create virtual environment with UV
uv venv chatterbox-env
# Activate the environment
source chatterbox-env/bin/activate# Install all dependencies using UV (much faster than pip)
uv pip install --python chatterbox-env/bin/python -r requirements.txt
# Install the Chatterbox package in development mode
uv pip install --python chatterbox-env/bin/python -e . --no-deps# Make scripts executable
chmod +x run_fastapi.sh run_streamlit.sh
# Start FastAPI server (in background)
./run_fastapi.sh &
# Start Streamlit app (in foreground)
./run_streamlit.sh# Terminal 1: Start FastAPI server
source chatterbox-env/bin/activate
python fastapi_tts_server.py
# Terminal 2: Start Streamlit app
source chatterbox-env/bin/activate
streamlit run streamlit_app.py- Streamlit UI: http://localhost:8501
- FastAPI Docs: http://localhost:8000/docs
- API Health Check: http://localhost:8000/health
- RAM: 8GB minimum, 16GB recommended
- Storage: 5GB free space for models
- Network: Internet connection for initial model download
On the first run, the system will:
- Download the Chatterbox TTS models (~3.2GB total)
- Initialize the voice encoder and speech tokenizer
- Load the models into memory
Note: Initial model download may take 5-10 minutes depending on your internet connection.
- Automatically uses MPS (Metal Performance Shaders) for GPU acceleration
- Typical generation time: 5-15 seconds for moderate text length
- Falls back to CPU processing
- Typical generation time: 15-45 seconds for moderate text length
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, this is a test of Chatterbox TTS!",
"exaggeration": 0.5,
"cfg_weight": 0.5,
"temperature": 0.8
}'curl -X POST "http://localhost:8000/synthesize_with_voice" \
-F "text=Hello, this is my cloned voice!" \
-F "voice_file=@reference_audio.wav" \
-F "exaggeration=0.7" \
-F "cfg_weight=0.3"import requests
import base64
# Basic TTS
response = requests.post(
"http://localhost:8000/synthesize",
json={
"text": "Your text here",
"exaggeration": 0.5,
"cfg_weight": 0.5
}
)
if response.status_code == 200:
data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])
# Save audio file
with open("output.wav", "wb") as f:
f.write(audio_bytes)| Parameter | Range | Default | Description |
|---|---|---|---|
exaggeration |
0.25-2.0 | 0.5 | Controls emotional intensity and expression |
cfg_weight |
0.0-1.0 | 0.5 | Classifier-free guidance weight (affects pacing) |
temperature |
0.05-5.0 | 0.8 | Sampling temperature (creativity vs consistency) |
| Parameter | Range | Default | Description |
|---|---|---|---|
repetition_penalty |
1.0-2.0 | 1.2 | Penalty for token repetition |
min_p |
0.0-1.0 | 0.05 | Minimum probability threshold |
top_p |
0.0-1.0 | 1.0 | Nucleus sampling parameter |
- For natural speech:
exaggeration=0.5,cfg_weight=0.5 - For expressive speech:
exaggeration=0.7-1.0,cfg_weight=0.3-0.4 - For fast speakers: Lower
cfg_weightto 0.3 - For dramatic content: Higher
exaggeration(0.8+)
Best Practices:
- Duration: 3-30 seconds (optimal: 5-15 seconds)
- Quality: Clear, noise-free recording
- Content: Single speaker, natural speech
- Format: WAV preferred, MP3/FLAC/M4A supported
Supported Formats:
- WAV (recommended)
- MP3
- FLAC
- M4A
- Record/Upload Reference: Use a clear sample of the target voice
- Set Parameters: Adjust
exaggerationandcfg_weightfor best results - Generate: Process your text with the cloned voice
- Fine-tune: Adjust parameters if needed for better quality
# Check if port 8000 is already in use
lsof -i :8000
# Kill existing process if needed
kill -9 <PID># Clear cache and retry
rm -rf ~/.cache/huggingface/
python test_tts.py-
Reduce batch size: Use shorter text inputs
-
Close other applications: Free up RAM
-
Check available memory:
# macOS vm_stat
If you see "MPS not available" on Apple Silicon:
- Update to macOS 12.3+
- Update PyTorch:
pip install torch torchaudio --upgrade
- Check device: Verify MPS is being used (check logs)
- Reduce text length: Break long texts into smaller chunks
- Adjust parameters: Lower
temperatureandexaggeration
- Check reference audio: Ensure it's clear and noise-free
- Adjust parameters: Try different
cfg_weightvalues - Experiment with settings: Test various parameter combinations
Enable debug logging by setting environment variable:
export CHATTERBOX_DEBUG=1
python fastapi_tts_server.pychatterbox-tts/
├── src/chatterbox/ # Core TTS package
├── fastapi_tts_server.py # FastAPI backend server
├── streamlit_app.py # Streamlit frontend
├── requirements.txt # Python dependencies
├── test_tts.py # Basic functionality test
├── run_fastapi.sh # FastAPI startup script
├── run_streamlit.sh # Streamlit startup script
├── chatterbox-env/ # Virtual environment
└── README.md # This file
- Local Only: Servers bind to localhost by default
- File Upload: Reference audio files are processed locally and cleaned up
- No Data Persistence: Generated audio is not stored permanently
- Isolated Environment: Uses virtual environment for dependency isolation
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes
- Test thoroughly
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Resemble AI for the original Chatterbox TTS model
- Hugging Face for model hosting and transformers library
- FastAPI and Streamlit communities for excellent frameworks
- Original Chatterbox Contributors:
Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
import perth
import librosa
AUDIO_PATH = "YOUR_FILE.wav"
# Load the watermarked audio
watermarked_audio, sr = librosa.load(AUDIO_PATH, sr=None)
# Initialize watermarker (same as used for embedding)
watermarker = perth.PerthImplicitWatermarker()
# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")
# Output: 0.0 (no watermark) or 1.0 (watermarked)- Issues: Open a GitHub issue for bugs or feature requests
- Discussions: Use GitHub Discussions for questions and community support
- Original Discord: 👋 Join Resemble AI's Discord for model-specific questions
- Docker containerization
- Multiple voice presets
- Batch processing capabilities
- Real-time streaming
- Integration examples for popular frameworks
This tool is intended for legitimate and ethical use cases only. Please ensure you have proper consent before cloning someone's voice. The original training data comes from freely available sources on the internet.
Made with ❤️ for the open source community