LocalSotaTalk is a comprehensive Text-to-Speech (TTS) backend service supporting multiple frameworks. It provides a unified REST API for voice synthesis with support for voice cloning and voice design capabilities. This project mainly focuses on low-cost SOTA models which supports voice cloning.
It has currently supported these frameworks:
- OmniVoice
- LongCat-AudioDiT
- Multi-Framework Support: Seamlessly works with OmniVoice (600+ languages) and LongCat-AudioDiT models
- Voice ID System: Simple speaker management using
voice_idinstead of file paths - Voice Design Support: OmniVoice supports voice design via textual descriptions
- RESTful API: Simple API, which is compatible with daswer123/xtts-api-server
- Cross-Origin Support: Built-in CORS middleware for web applications
- Automatic Speaker Detection: Scans
samples/directory for audio and design files - Model Switching: Hot-swap between different TTS models at runtime
- Python 3.8+
- NVIDIA GPU with CUDA support (recommended)
- Git
- Clone the repository
git clone https://github.com/MoRanYue/LocalSotaTalk.git
cd LocalSotaTalk- Initialize submodules
git submodule update --init --recursive- Create virtual environment and install dependencies
# Create virtual environment
python -m venv .venv
# Or use UV
# uv venv -p 3.10
# Activate on Windows
.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Prepare samples directory
# Create samples directory
mkdir samples
# Add your speaker files (optional)
# Example: paimon.wav + paimon.txt for voice cloning
# Example: flba.design.txt for voice design# Start with OmniVoice (default)
python main.py --model k2-fsa/OmniVoice --host 0.0.0.0 --port 8000
# Start with LongCat-AudioDiT
python main.py --model meituan-longcat/LongCat-AudioDiT-1B --host 0.0.0.0 --port 8000
# With custom samples directory
python main.py --model k2-fsa/OmniVoice --samples-dir ./my_speakers --port 8000Once running, access:
- API Documentation: http://localhost:8000/docs
- OpenAPI Specification: http://localhost:8000/openapi.json
LocalSotaTalk automatically scans the samples/ directory for speaker files. Two types of speakers are supported:
For traditional voice cloning, add these files:
{speaker_id}.wav- Reference audio file (supports .wav, .mp3, .flac){speaker_id}.txt- Transcript of the reference audio (optional but recommended)
Example: samples/paimon.wav + samples/paimon.txt creates speaker with voice_id="paimon"
For voice design (OmniVoice only), add:
{speaker_id}.design.txt- Text description of voice characteristics
Example: samples/flba.design.txt with content:
female, low pitch, british accent
The prompt above is an example for OmniVoice.
- Files are automatically associated by their base name
.wavfiles take priority over.design.txtfiles- System supports mixed configurations (some speakers with audio, others with design)
Returns detailed information about all available speakers.
Response:
[
{
"name": "paimon",
"voice_id": "paimon",
"type": "audio_with_text",
"file_path": "samples/paimon.wav",
"text_path": "samples/paimon.txt",
"design_description": null
},
{
"name": "flba",
"voice_id": "flba",
"type": "design_only",
"file_path": "samples/flba.design.txt",
"text_path": null,
"design_description": "female, low pitch, british accent"
}
]Synthesize speech and return audio stream.
Request:
{
"text": "Hello, this is a test",
"speaker_wav": "paimon",
"language": "en"
}Parameters:
text: Text to synthesize (required)speaker_wav: Speaker voice_id or audio file path (required)language: Language code (optional, default: "en")
Response: Audio/WAV stream with headers:
Content-Type: audio/wavDuration: Audio duration in secondsSample-Rate: 24000 Hz
Synthesize speech and save to file.
Request:
{
"text": "Hello, this is a test",
"speaker_wav": "paimon",
"language": "en",
"file_name_or_path": "output.wav"
}Response:
{
"file_path": "output/output.wav",
"duration": 2.32,
"sample_rate": 24000
}GET /languages- Get supported languagesGET /get_models_list- Get available modelsPOST /switch_model- Switch TTS modelPOST /set_tts_settings- Update TTS parametersGET /get_folders- Get current folder pathsPOST /set_output- Set output directoryPOST /set_speaker_folder- Set samples directory
import requests
import json
# TTS synthesis
def tts_synthesis(text, voice_id="paimon", language="en"):
url = "http://localhost:8000/tts_to_audio/"
data = {
"text": text,
"speaker_wav": voice_id,
"language": language
}
response = requests.post(url, json=data)
if response.status_code == 200:
with open("output.wav", "wb") as f:
f.write(response.content)
print("Audio saved to output.wav")
else:
print(f"Error: {response.status_code}")
print(response.json())
# Get available speakers
def get_speakers():
response = requests.get("http://localhost:8000/speakers")
speakers = response.json()
for speaker in speakers:
print(f"{speaker['voice_id']}: {speaker['type']}")# Get speakers list
curl -X GET http://localhost:8000/speakers
# Synthesize with voice_id
curl -X POST http://localhost:8000/tts_to_audio/ \
-H "Content-Type: application/json" \
-d '{"text":"Hello world","speaker_wav":"paimon","language":"en"}' \
--output output.wav
# Synthesize with voice design
curl -X POST http://localhost:8000/tts_to_audio/ \
-H "Content-Type: application/json" \
-d '{"text":"Hello world","speaker_wav":"flba","language":"en"}' \
--output design_output.wavLocalSotaTalk/
βββ api/ # FastAPI endpoints and schemas
βββ models/ # TTS adapter implementations
β βββ base_adapter.py # Abstract base class
β βββ omnivoice_adapter.py # OmniVoice implementation
β βββ longcat_adapter.py # LongCat-AudioDiT implementation
β βββ manager.py # Model manager
βββ utils/ # Utility functions
βββ systems/ # Framework submodules
β βββ OmniVoice/ # OmniVoice framework
β βββ LongCat-AudioDiT/ # LongCat framework
βββ samples/ # Speaker samples directory
βββ output/ # Generated audio files
βββ config.py # Configuration management
| Feature | OmniVoice | LongCat-AudioDiT |
|---|---|---|
| Voice Cloning | β | β |
| Voice Design | β | β |
| Languages | 600+ | Chinese, English |
| Reference Audio | Optional | Required |
| Reference Text | Optional | Recommended |
| Inference Speed | Fast | Medium |
python main.py --help
--model MODEL Model repository (default: k2-fsa/OmniVoice)
--samples-dir DIR Samples directory (default: samples)
--output-dir DIR Output directory (default: output)
--host HOST Server host (default: 0.0.0.0)
--port PORT Server port (default: 8000)
--log-level LEVEL Log level (default: info)k2-fsa/OmniVoice- Multi-language voice cloning and designmeituan-longcat/LongCat-AudioDiT-1B- Chinese voice cloningmeituan-longcat/LongCat-AudioDiT-3.5B- Chinese voice cloning (larger)
-
"Speaker with voice_id 'xxx' not found"
- Check that files exist in
samples/directory - Verify file naming:
{voice_id}.wavor{voice_id}.design.txt - Ensure server has read permissions
- Check that files exist in
-
"Current model does not support voice design"
- LongCat-AudioDiT doesn't support voice design
- Switch to OmniVoice:
POST /switch_modelwith{"model_name": "k2-fsa/OmniVoice"}
-
Model loading failures
- Ensure CUDA is properly installed
- Check internet connection for model downloads
- Verify sufficient GPU memory
-
Audio quality issues
- For voice cloning, ensure reference audio is clean
- For voice design, use clear, specific descriptions
- Adjust TTS settings via
/set_tts_settings
- Check server console output for detailed error messages
- Enable debug logging:
--log-level debug - Logs are written to console and optionally to file
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Commit changes:
git commit -am 'Add feature' - Push to branch:
git push origin feature-name - Submit a Pull Request
# Install development dependencies
pip install -e .
# Run tests
python -m pytest tests/
# Check code style
python -m black .
python -m flake8 .This project is licensed under the MIT License - see the LICENSE file for details.
- OmniVoice by Xiaomi
- LongCat-AudioDiT by Meituan
- xtts-api-server by daswer123
- All contributors and the open-source community
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Note: This project is under development. API changes may occur between minor versions.