Skip to content

Releases: Kenpath/svara-tts-inference

v1.0.0 — Embedded vLLM Engine + OpenAI-Compatible API

21 Mar 03:11

Choose a tag to compare

What's New

Architecture

  • Embedded vLLM engine — single-process architecture, no separate vLLM server. Eliminates HTTP hop, reduces latency and operational complexity.
  • Single port (8080) — everything served from one FastAPI process managed by supervisord.

API

  • OpenAI-compatible /v1/audio/speech — drop-in replacement for OpenAI's TTS API. Works with the OpenAI Python/Node SDKs out of the box.
  • Consolidated endpoint — removed /v1/text-to-speech, all features (streaming, zero-shot cloning, generation params) unified under /v1/audio/speech.
  • Voice accepts both ID format (hi_male) and display name format (Hindi (Male)).
  • New optional params: chunk_size, buffer_ms (via extra_body in OpenAI SDK).

Performance

  • FP8 weight quantization — ~50% GPU memory reduction with minimal quality loss.
  • FP8 KV cache — doubles KV cache capacity (55K → 111K tokens, 27x concurrent request headroom). Requires FlashInfer backend.
  • FlashInfer attention backend — optional alternative to FlashAttention v2.
  • torch.compile on SNAC decoder — compiled at startup via orchestrator.warmup().
  • SNAC on CPU by default — frees GPU memory for vLLM, benchmarks show identical latency to GPU.
  • Auto-detected worker count — scales SNAC decode workers with CPU cores.

Streaming

  • Fixed long-text streaming — audio now streams progressively within chunks (~1s TTFB) instead of waiting for all chunks to complete.
  • Sentence-boundary chunking — long text split at 200 chars with 50ms crossfade stitching.

Configuration

  • SNAC_DEVICE (renamed from TTS_DEVICE) — configurable SNAC decoder device (cpu/cuda/mps).
  • VLLM_ATTENTION_BACKEND — choose FlashAttention or FlashInfer.
  • VLLM_KV_CACHE_DTYPE — set to fp8 for doubled KV cache.
  • SNAC_COMPILE — toggle torch.compile for SNAC model.
  • SNAC_WINDOW_SIZE — configurable SNAC mapper window size.
  • LOG_LEVEL — properly wired through to Python logging.
  • All env vars documented in .env.example and passed through docker-compose.yml.

Cleanup

  • Removed stale scripts/ directory, /debug/timing endpoints, timing.py.
  • Consistent logging (replaced all print() with logger.*).
  • Updated README, DEPLOYMENT.md, ARCHITECTURE.md to reflect current API surface.
  • Fixed total_memtotal_memory bug in GPU detection.
  • Added language_data dependency (required by langcodes).
  • Copies assets/ directory into Docker image for voice configs.

Supported Languages

Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Magahi, Chhattisgarhi, Maithili, Assamese, Bodo, Dogri, Gujarati, Malayalam, Punjabi, Tamil, English (Indian), Nepali, Sanskrit — 38 voices (19 languages x 2 genders).