Releases · Kenpath/svara-tts-inference

What's New

Architecture

Embedded vLLM engine — single-process architecture, no separate vLLM server. Eliminates HTTP hop, reduces latency and operational complexity.
Single port (8080) — everything served from one FastAPI process managed by supervisord.

API

OpenAI-compatible /v1/audio/speech — drop-in replacement for OpenAI's TTS API. Works with the OpenAI Python/Node SDKs out of the box.
Consolidated endpoint — removed /v1/text-to-speech, all features (streaming, zero-shot cloning, generation params) unified under /v1/audio/speech.
Voice accepts both ID format (hi_male) and display name format (Hindi (Male)).
New optional params: chunk_size, buffer_ms (via extra_body in OpenAI SDK).

Performance

FP8 weight quantization — ~50% GPU memory reduction with minimal quality loss.
FP8 KV cache — doubles KV cache capacity (55K → 111K tokens, 27x concurrent request headroom). Requires FlashInfer backend.
FlashInfer attention backend — optional alternative to FlashAttention v2.
torch.compile on SNAC decoder — compiled at startup via orchestrator.warmup().
SNAC on CPU by default — frees GPU memory for vLLM, benchmarks show identical latency to GPU.
Auto-detected worker count — scales SNAC decode workers with CPU cores.

Streaming

Fixed long-text streaming — audio now streams progressively within chunks (~1s TTFB) instead of waiting for all chunks to complete.
Sentence-boundary chunking — long text split at 200 chars with 50ms crossfade stitching.

Configuration

SNAC_DEVICE (renamed from TTS_DEVICE) — configurable SNAC decoder device (cpu/cuda/mps).
VLLM_ATTENTION_BACKEND — choose FlashAttention or FlashInfer.
VLLM_KV_CACHE_DTYPE — set to fp8 for doubled KV cache.
SNAC_COMPILE — toggle torch.compile for SNAC model.
SNAC_WINDOW_SIZE — configurable SNAC mapper window size.
LOG_LEVEL — properly wired through to Python logging.
All env vars documented in .env.example and passed through docker-compose.yml.

Cleanup

Removed stale scripts/ directory, /debug/timing endpoints, timing.py.
Consistent logging (replaced all print() with logger.*).
Updated README, DEPLOYMENT.md, ARCHITECTURE.md to reflect current API surface.
Fixed total_mem → total_memory bug in GPU detection.
Added language_data dependency (required by langcodes).
Copies assets/ directory into Docker image for voice configs.

Supported Languages

Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Magahi, Chhattisgarhi, Maithili, Assamese, Bodo, Dogri, Gujarati, Malayalam, Punjabi, Tamil, English (Indian), Nepali, Sanskrit — 38 voices (19 languages x 2 genders).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New

Architecture

API

Performance

Streaming

Configuration

Cleanup

Supported Languages

Uh oh!

Releases: Kenpath/svara-tts-inference

v1.0.0 — Embedded vLLM Engine + OpenAI-Compatible API

What's New

Architecture

API

Performance

Streaming

Configuration

Cleanup

Supported Languages

Uh oh!