Universal text-to-speech adapter with pluggable engines.
# Copy config
cp .env.example .env
# Build image
make build
# Start server (model downloads on first run, ~4.3GB)
make up
# Wait for health (model warmup ~2min)
make health
# Test generation
make testuv sync
make serveDownload model while online, then run without network:
make download-model # Downloads to ~/.cache/tts-adapter/models/
# Add to .env:
# TTS_QWEN3_MODEL_PATH=~/.cache/tts-adapter/models/Qwen3-TTS-12Hz-1.7B-CustomVoice
# HF_HUB_OFFLINE=1See Qwen3 Engine docs for details.
Open http://localhost:9880 — three modes (Simple, Voice Design, Voice Clone), RU/EN switch, advanced generation settings. See Web UI docs.
For API access, see API Reference. Swagger docs available at /docs.
Edit .env (copy from .env.example):
# Engine selection
TTS_ENGINE=qwen3
# Defaults
TTS_DEFAULT_SPEAKER=Serena
TTS_DEFAULT_LANGUAGE=Russian
# Qwen3 engine
TTS_QWEN3_MODEL_ID=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
TTS_QWEN3_DEVICE=cuda:0
TTS_QWEN3_DTYPE=bfloat16See .env.example for all options.
make build # Build image
make up # Start container
make health # Check health
make logs # View logs
make down # Stop
make shell # Shell into containerSee Makefile for all targets.
Model weights (~4.3GB) download automatically on first make up. Stored in host's ~/.cache/huggingface and mounted into container, so:
- Download happens once, persists across container restarts
- Same cache shared between Docker and local development
- Custom path: set
HF_CACHE_PATHin.env
curl -X POST http://localhost:9880/tts \
-H 'content-type: application/json' \
-d '{"text":"Hello world","language":"English","speaker":"Ryan"}' \
--output out.wavcurl -X POST http://localhost:9880/tts/batch \
-H 'content-type: application/json' \
-d '{"items":[{"id":"001","text":"First phrase"},{"id":"002","text":"Second phrase"}]}' \
--output batch.zipClone any voice from a 3-10 second audio sample (requires Base model):
# Switch to Base model in .env:
# TTS_QWEN3_MODEL_ID=Qwen/Qwen3-TTS-12Hz-1.7B-Base
curl -X POST http://localhost:9880/tts/clone \
-F 'text=Привет мир' \
-F 'language=Russian' \
-F 'reference_audio=@voice_sample.wav' \
-F 'reference_text=Текст из референсного аудио' \
--output cloned.wavSee Qwen3 Engine docs for details.
curl http://localhost:9880/health| Document | Description |
|---|---|
| Web UI | Language switch, advanced settings, multi-client, LAN |
| Architecture | Design decisions, engine protocol |
| API Reference | Endpoint specs, request/response formats |
| Qwen3 Engine | Model variants, speakers, setup |
| AGENTS.md | Project instructions for AI agents |
| Engine | Status | Description |
|---|---|---|
| Qwen3-TTS | ✅ Ready | 1.7B/0.6B with voice cloning, preset speakers, instructions |
- Create
tts_adapter/engines/new_engine.py - Implement
TTSEngineprotocol - Register in
engines/__init__.py - Document in
docs/engines/
See Architecture for details.
Apache-2.0