Skip to content

Kenpath/svara-tts-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

72 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

svara-tts-inference

Inference and deployment toolkit for Svara-TTS, an open-source multilingual text-to-speech model for Indic languages โ€” includes examples for local GGUF inference, Gradio demo, and production-ready API deployment.

๐Ÿค— Hugging Face - svara-tts-v1 Model Open In Colab ๐Ÿค— Hugging Face - Spaces

Features

  • 38 Voice Profiles: Support for 19 Indian languages with male and female voices
  • Streaming Audio: Real-time audio generation with low-latency streaming
  • OpenAI-Compatible API: Drop-in replacement for OpenAI's /v1/audio/speech endpoint
  • Production Ready: Docker deployment with embedded vLLM engine
  • GPU Accelerated: CUDA-optimized inference with configurable SNAC decoder device
  • Multiple Audio Formats: Output in MP3, Opus, AAC, WAV, or raw PCM via ffmpeg
  • Zero-Shot Voice Cloning: Clone any voice with a short audio reference
  • Long-Text Chunking: Automatic sentence-boundary splitting with crossfade stitching

Supported Languages

Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Magahi, Chhattisgarhi, Maithili, Assamese, Bodo, Dogri, Gujarati, Malayalam, Punjabi, Tamil, English (Indian), Nepali, Sanskrit

Quick Start - API Deployment

Deploy Svara TTS as a production API service with Docker:

# Clone repository
git clone <repository-url>
cd svara-tts-inference

# Configure (optional)
cp .env.example .env

# Build and start
docker-compose up -d

# Test the API
curl http://localhost:8080/health
curl http://localhost:8080/v1/voices

API Usage

Get Available Voices:

curl http://localhost:8080/v1/voices

OpenAI-Compatible Endpoint:

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello from Svara TTS!",
    "voice": "en_male",
    "response_format": "mp3"
  }' \
  --output speech.mp3

Streaming:

curl -N -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "เคจเคฎเคธเฅเคคเฅ‡, เคฎเฅˆเค‚ เคธเฅเคตเคฐเคพ เคŸเฅ€เคŸเฅ€เคเคธ เคนเฅ‚เค‚",
    "voice": "hi_male",
    "response_format": "wav",
    "stream": true
  }' \
  --output audio.wav

Python Example (OpenAI SDK):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.audio.speech.create(
    model="svara-tts-v1",
    voice="hi_female",
    input="เคจเคฎเคธเฅเคคเฅ‡, เคฎเฅˆเค‚ เคธเฅเคตเคฐเคพ เคนเฅ‚เค‚เฅค",
    response_format="mp3",
)

response.stream_to_file("output.mp3")

See examples/api_client.py for more examples.

API Documentation

Endpoints

Endpoint Method Description
/health GET Health check
/v1/voices GET List available voices
/v1/audio/speech POST OpenAI-compatible TTS (supports streaming, zero-shot cloning)

Voice IDs

For svara-tts-v1, voice IDs follow the format {language_code}_{gender}:

The /v1/audio/speech endpoint also accepts display names like Hindi (Male), English (Female).

Audio Formats

All endpoints support multiple output formats via the response_format parameter:

Format MIME Type Notes
mp3 audio/mpeg Default
opus audio/ogg Great for streaming
aac audio/aac ADTS container
wav audio/wav Uncompressed, larger files
pcm audio/pcm Raw signed 16-bit LE, 24kHz mono

Deployment Guide

For detailed deployment instructions, configuration options, and troubleshooting:

Read the Full Deployment Guide

Topics covered:

  • Prerequisites and hardware requirements
  • Docker configuration
  • Environment variables
  • Production deployment with nginx
  • Troubleshooting and monitoring
  • Multi-GPU setup

Architecture

The server runs as a single process with the vLLM engine embedded directly in the FastAPI application. This eliminates the HTTP hop between API server and LLM engine, reducing latency and operational complexity.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         FastAPI Server           โ”‚  Port 8080
โ”‚                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   Embedded vLLM Engine     โ”‚  โ”‚
โ”‚  โ”‚   (AsyncLLMEngine)         โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚             โ”‚                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   SNAC Decoder             โ”‚  โ”‚
โ”‚  โ”‚   Token โ†’ PCM Audio        โ”‚  โ”‚
โ”‚  โ”‚   (SNAC_DEVICE: cpu/cuda)  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚             โ”‚                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   ffmpeg (format convert)  โ”‚  โ”‚
โ”‚  โ”‚   PCM โ†’ MP3/Opus/WAV/AAC  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

For detailed architecture documentation, see ARCHITECTURE.md.

Development

Project Structure

svara-tts-inference/
โ”œโ”€โ”€ api/                    # FastAPI server
โ”‚   โ”œโ”€โ”€ server.py           # Main API endpoints + engine init
โ”‚   โ””โ”€โ”€ models.py           # Pydantic request/response models
โ”œโ”€โ”€ tts_engine/             # Core TTS engine
โ”‚   โ”œโ”€โ”€ orchestrator.py     # TTS pipeline orchestration
โ”‚   โ”œโ”€โ”€ transports.py       # Embedded vLLM transport
โ”‚   โ”œโ”€โ”€ buffers.py          # Audio prebuffering + crossfade
โ”‚   โ”œโ”€โ”€ mapper.py           # Token-to-SNAC mapping
โ”‚   โ”œโ”€โ”€ codec.py            # SNAC encoder/decoder
โ”‚   โ”œโ”€โ”€ voice_config.py     # Voice profiles
โ”‚   โ”œโ”€โ”€ encoder.py          # Text-to-token encoding
โ”‚   โ”œโ”€โ”€ constants.py        # Token IDs and special tokens
โ”‚   โ””โ”€โ”€ utils.py            # Utilities (chunking, audio processing)
โ”œโ”€โ”€ assets/                 # Voice config YAML files
โ”œโ”€โ”€ examples/               # Example scripts
โ”‚   โ””โ”€โ”€ api_client.py       # API client examples
โ”œโ”€โ”€ Dockerfile              # Docker image
โ”œโ”€โ”€ docker-compose.yml      # Docker Compose config
โ”œโ”€โ”€ supervisord.conf        # Process manager config
โ”œโ”€โ”€ requirements.txt        # Python dependencies
โ””โ”€โ”€ .env.example            # Environment variable template

Local Development

# Install dependencies
pip install -r requirements.txt

# Configure environment (optional)
cp .env.example .env

# Start the server (vLLM engine starts embedded)
cd api && python server.py

The vLLM engine initializes in-process during FastAPI startup โ€” no separate vLLM server needed.

Requirements

Hardware

  • GPU: NVIDIA GPU with 16GB+ VRAM (recommended: 24GB+)
  • RAM: 16GB+ system RAM
  • Storage: 50GB+ free space

Software

  • Docker 20.10+
  • Docker Compose 2.0+
  • NVIDIA GPU Drivers
  • NVIDIA Container Toolkit

License

See LICENSE file for details.

Citation

If you use Svara TTS in your research, please cite:

@misc{svara-tts-v1,
  title={Svara TTS: Multilingual Text-to-Speech for Indic Languages},
  author={Kenpath},
  year={2024},
  url={https://huggingface.co/kenpath/svara-tts-v1}
}

Links

About

Inference and deployment toolkit for Svara-TTS, an open-source multilingual text-to-speech model for Indic languages

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors