REST API for text-to-speech synthesis using Moshi model from Kyutai Labs, with interactive Swagger documentation and Docker deployment.
- 🌐 Bilingual Support: French and English
- 🎤 44 Voice Presets: VCTK, CML-TTS French, Expresso emotions, EARS speakers
- 🎭 Emotional Speech: Happy, angry, calm, confused, whisper, and more
- 📖 Swagger Documentation: Interactive interface to test the API
- 🎵 High-Quality Audio: 24kHz in WAV or RAW format
- 🚀 GPU Support: Automatic CUDA acceleration
- 🔒 Secure: Non-root user, input validation
- 📦 Docker: Simple and reproducible deployment
- 🔄 RESTful API: Well-structured endpoints with OpenAPI
- 📊 Health Checks: Service status monitoring
- Docker installed
- NVIDIA Docker Runtime (optional, for GPU support)
- At least 8GB RAM
- ~10GB disk space for the model
The fastest way to get started! No need to clone or build.
docker run -d --name moshi-tts-api \
-p 8000:8000 \
-v moshi-models:/app/models \
--gpus all \
mmaudet/moshi-tts-api:latestFor Mac M1/M2/M3/M4/M5 - Uses MLX with Metal GPU acceleration:
# Clone the repository
git clone https://github.com/mmaudet/moshi-tts-api.git
cd moshi-tts-api
# Run the installation script
./install-macos-mlx.sh
# Activate the virtual environment
source venv-moshi-mlx/bin/activate
# Start the API server
python3 -m uvicorn app:app --host 0.0.0.0 --port 8000Why native installation for Mac?
- 🚀 Best performance - Direct Metal GPU access (not possible in Docker)
- ⚡ MLX framework - Apple's optimized ML framework for M-series chips
- 💪 No Docker overhead - Native macOS performance
Note: MLX requires macOS and cannot run in Docker containers (Metal framework limitation)
Access the API at: http://localhost:8000/docs
- Clone the project
git clone https://github.com/mmaudet/moshi-tts-api.git
cd moshi-tts-api- Quick build and launch
chmod +x build-and-run.sh
./build-and-run.shOr manually:
# Build
docker build -t moshi-tts-api:latest .
# Run with GPU
docker run -d --name moshi-tts-api \
-p 8000:8000 \
-v $(pwd)/models:/app/models \
--gpus all \
moshi-tts-api:latest
### Option 3: With Docker Compose
**Using pre-built image** (update `docker-compose.yml`):
```yaml
services:
moshi-tts-api:
image: mmaudet/moshi-tts-api:latest
# Remove the 'build: .' lineOnce the API is started, access the interactive documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- OpenAPI JSON: http://localhost:8000/openapi.json
chmod +x test_api.sh
./test_api.shcurl -X POST http://localhost:8000/api/v1/tts \
-H "Content-Type: application/json" \
-d '{
"text": "Bonjour, je suis Moshi, votre assistant vocal.",
"language": "fr"
}' \
--output bonjour.wavcurl -X POST http://localhost:8000/api/v1/tts \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, I am Moshi, your voice assistant.",
"language": "en"
}' \
--output hello.wavcurl -X POST http://localhost:8000/api/v1/tts \
-H "Content-Type: application/json" \
-d '{
"text": "Hello with a different voice.",
"language": "en",
"voice": "vctk_p226"
}' \
--output custom_voice.wavcurl -X POST http://localhost:8000/api/v1/tts \
-H "Content-Type: application/json" \
-d '{
"text": "Test audio",
"language": "en",
"format": "raw"
}' \
--output test.raw
# Convert RAW to WAV
ffmpeg -f s16le -ar 24000 -ac 1 -i test.raw output.wavcurl http://localhost:8000/curl http://localhost:8000/api/v1/healthResponse:
{
"status": "healthy",
"model_loaded": true,
"device": "cuda",
"available_languages": ["fr", "en"],
"api_version": "1.0.0",
"timestamp": "2024-01-01T12:00:00Z"
}curl http://localhost:8000/api/v1/languagesResponse:
{
"languages": [
{"code": "fr", "name": "French (Français)"},
{"code": "en", "name": "English"}
]
}curl http://localhost:8000/api/v1/voicesResponse:
{
"voices": [
{"id": "default", "name": "vctk_p225", "description": "Default voice"},
{"id": "vctk_p225", "name": "vctk_p225", "description": "VCTK voice p225"},
{"id": "vctk_p226", "name": "vctk_p226", "description": "VCTK voice p226"}
]
}curl -X POST http://localhost:8000/api/v1/tts \
-H "Content-Type: application/json" \
-d '{
"text": "Your text here",
"language": "fr",
"format": "wav",
"voice": "default"
}' \
--output audio.wavParameters:
text(required): Text to synthesize (1-5000 characters)language(optional, default: "fr"): Language code ("fr" or "en")format(optional, default: "wav"): Output format ("wav" or "raw")voice(optional, default: "default"): Voice preset to use
curl -X POST http://localhost:8000/api/v1/tts/file \
-F "file=@my_text.txt" \
-F "language=fr" \
--output audio.wavThe API uses pydantic-settings for type-safe configuration management. Configuration can be set via:
.envfile (local development)- Environment variables (Docker/production)
- Default values (fallback)
# Copy the template
cp .env.example .env
# Edit .env with your settings
nano .envImportant: The .env file is gitignored and should never be committed!
See .env.example for all available settings:
# Server
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=info
WORKERS=1
# Model Configuration
DEFAULT_TTS_REPO=kyutai/tts-1.6b-en_fr
DEFAULT_VOICE_REPO=kyutai/tts-voices
SAMPLE_RATE=24000
MODEL_DEVICE=cuda
MODEL_DTYPE=auto # auto, bfloat16, or float32
MODEL_N_Q=32 # Number of codebooks
MODEL_TEMP=0.6 # Temperature for generation
MODEL_CFG_COEF=2.0 # CFG coefficient
# CORS
CORS_ORIGINS=* # Change in production!
CORS_CREDENTIALS=true
# Environment
ENVIRONMENT=production
DEBUG=false- GPU: Real-time or faster generation
- Memory: ~6GB for the model in bf16
- First Request: Slower (model loading)
# View logs
docker logs -f moshi-tts-api
# Stop container
docker stop moshi-tts-api
# Restart
docker restart moshi-tts-api
# Remove container
docker rm -f moshi-tts-api
# Clean image
docker rmi moshi-tts-api:latest
# Enter container
docker exec -it moshi-tts-api bash# Check logs
docker logs moshi-tts-api
# Check if port 8000 is free
lsof -i :8000# Verify NVIDIA Docker
nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiimport requests
response = requests.post(
"http://localhost:8000/api/v1/tts",
json={
"text": "Hello world",
"language": "en",
"voice": "vctk_p225"
}
)
with open("output.wav", "wb") as f:
f.write(response.content)const axios = require('axios');
const fs = require('fs');
axios.post('http://localhost:8000/api/v1/tts', {
text: 'Hello world',
language: 'en',
voice: 'default'
}, {
responseType: 'arraybuffer'
}).then(response => {
fs.writeFileSync('output.wav', response.data);
});Use the HTTP Request node with:
- Method: POST
- URL: http://localhost:8000/api/v1/tts
- Body: JSON with
{"text": "your text", "language": "en"} - Response Format: File
The API includes 44 voice presets from multiple datasets in the kyutai/tts-voices repository:
British English speakers from the Voice Cloning Toolkit:
vctk_p225throughvctk_p234- Various speaker characteristics- Example:
"voice": "vctk/p226_023.wav"
High-quality French speakers:
cml_fr_1406,cml_fr_1591,cml_fr_1770,cml_fr_2114,cml_fr_2154cml_fr_2216,cml_fr_2223,cml_fr_2465,cml_fr_296,cml_fr_3267- Example:
"voice": "cml-tts/fr/1406_1028_000009-0003.wav"
Emotional and stylistic variations:
- Speaking Styles:
default,enunciated,fast,projected,whisper - Emotions:
happy,angry,calm,confused - Example:
"voice": "expresso/ex03-ex01_happy_001_channel1_334s.wav"
Diverse English speakers (subset of 50 available):
ears_p001,ears_p002,ears_p003,ears_p004,ears_p005ears_p010,ears_p015,ears_p020,ears_p025,ears_p030ears_p035,ears_p040,ears_p045,ears_p050- Example:
"voice": "ears/p001/freeform_speech_01.wav"
# English with emotional expression
curl -X POST http://localhost:8000/api/v1/tts \
-H "Content-Type: application/json" \
-d '{"text": "I am so happy today!", "language": "en", "voice": "expresso/ex03-ex01_happy_001_channel1_334s.wav"}' \
--output happy_voice.wav
# French voice
curl -X POST http://localhost:8000/api/v1/tts \
-H "Content-Type: application/json" \
-d '{"text": "Bonjour, comment allez-vous?", "language": "fr", "voice": "cml-tts/fr/1406_1028_000009-0003.wav"}' \
--output french_voice.wav
# Different English speaker
curl -X POST http://localhost:8000/api/v1/tts \
-H "Content-Type: application/json" \
-d '{"text": "Hello, this is a different voice.", "language": "en", "voice": "ears/p010/freeform_speech_01.wav"}' \
--output ears_voice.wavYou can list all available voices using the /api/v1/voices endpoint:
curl http://localhost:8000/api/v1/voices | jqThis project uses Moshi from Kyutai Labs. See their license.
This API wrapper is licensed under the MIT License - see LICENSE for details.
Contributions are welcome! Feel free to:
- Fork the project
- Create a branch for your feature (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Kyutai Labs for the Moshi model
- FastAPI for the web framework
- Docker for containerization
For any questions or suggestions, feel free to open an issue on GitHub or email me mmaudet@linagora.com
⭐ If this project is useful to you, don't forget to give it a star on GitHub!