Releases: Kenpath/svara-tts-inference
Releases · Kenpath/svara-tts-inference
v1.0.0 — Embedded vLLM Engine + OpenAI-Compatible API
What's New
Architecture
- Embedded vLLM engine — single-process architecture, no separate vLLM server. Eliminates HTTP hop, reduces latency and operational complexity.
- Single port (8080) — everything served from one FastAPI process managed by supervisord.
API
- OpenAI-compatible
/v1/audio/speech— drop-in replacement for OpenAI's TTS API. Works with the OpenAI Python/Node SDKs out of the box. - Consolidated endpoint — removed
/v1/text-to-speech, all features (streaming, zero-shot cloning, generation params) unified under/v1/audio/speech. - Voice accepts both ID format (
hi_male) and display name format (Hindi (Male)). - New optional params:
chunk_size,buffer_ms(viaextra_bodyin OpenAI SDK).
Performance
- FP8 weight quantization — ~50% GPU memory reduction with minimal quality loss.
- FP8 KV cache — doubles KV cache capacity (55K → 111K tokens, 27x concurrent request headroom). Requires FlashInfer backend.
- FlashInfer attention backend — optional alternative to FlashAttention v2.
- torch.compile on SNAC decoder — compiled at startup via
orchestrator.warmup(). - SNAC on CPU by default — frees GPU memory for vLLM, benchmarks show identical latency to GPU.
- Auto-detected worker count — scales SNAC decode workers with CPU cores.
Streaming
- Fixed long-text streaming — audio now streams progressively within chunks (~1s TTFB) instead of waiting for all chunks to complete.
- Sentence-boundary chunking — long text split at 200 chars with 50ms crossfade stitching.
Configuration
SNAC_DEVICE(renamed fromTTS_DEVICE) — configurable SNAC decoder device (cpu/cuda/mps).VLLM_ATTENTION_BACKEND— choose FlashAttention or FlashInfer.VLLM_KV_CACHE_DTYPE— set tofp8for doubled KV cache.SNAC_COMPILE— toggle torch.compile for SNAC model.SNAC_WINDOW_SIZE— configurable SNAC mapper window size.LOG_LEVEL— properly wired through to Python logging.- All env vars documented in
.env.exampleand passed throughdocker-compose.yml.
Cleanup
- Removed stale
scripts/directory,/debug/timingendpoints,timing.py. - Consistent logging (replaced all
print()withlogger.*). - Updated README, DEPLOYMENT.md, ARCHITECTURE.md to reflect current API surface.
- Fixed
total_mem→total_memorybug in GPU detection. - Added
language_datadependency (required bylangcodes). - Copies
assets/directory into Docker image for voice configs.
Supported Languages
Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Magahi, Chhattisgarhi, Maithili, Assamese, Bodo, Dogri, Gujarati, Malayalam, Punjabi, Tamil, English (Indian), Nepali, Sanskrit — 38 voices (19 languages x 2 genders).