diff --git a/README.md b/README.md index eb77204..6eafa41 100644 --- a/README.md +++ b/README.md @@ -98,6 +98,8 @@ See [docs/deployment.md](docs/deployment.md) for full deployment guide. | POST | `/api/v1/stt/transcribe` | Batch transcription | | POST | `/api/v1/stt/transcribe/stream` | SSE streaming transcription | | WS | `/api/v1/stt/transcribe/ws` | WebSocket real-time transcription | +| POST | `/api/v1/tts/synthesize` | Batch text-to-speech | +| POST | `/api/v1/tts/synthesize/stream` | Streaming text-to-speech | See [docs/api.md](docs/api.md) for full API documentation. @@ -128,6 +130,20 @@ curl -N -X POST "http://localhost:8000/api/v1/stt/transcribe/stream?engine=whisp -F "file=@/path/to/audio.wav" ``` +### 🔊 Text-to-Speech (TTS) + +**Batch Synthesis** + +```bash +curl -X POST "http://localhost:8000/api/v1/tts/synthesize?engine=coqui&text=Hello%20world&voice=en-US-1&speed=1.0" +``` + +**Streaming Synthesis** + +```bash +curl -N -X POST "http://localhost:8000/api/v1/tts/synthesize/stream?engine=coqui&text=Hello%20world" +``` + --- ## 📚 Documentation @@ -137,6 +153,7 @@ Detailed documentation is available in the `docs/` directory: * **[📖 API Reference](docs/api.md)**: Full details on all REST, SSE, and WebSocket endpoints. * **[⚙️ Configuration Guide](docs/configuration.md)**: How to configure engines, environment variables, and the `engines.yaml` file. * **[🛠️ Custom Engines](docs/custom-engines.md)**: A step-by-step guide to building and integrating your own STT or TTS engines. +* **[🚀 Deployment Guide](docs/deployment.md)**: Docker and production deployment. --- diff --git a/docs/api.md b/docs/api.md index ef81419..31cd72e 100644 --- a/docs/api.md +++ b/docs/api.md @@ -238,6 +238,107 @@ ws.onmessage = (event) => { --- +### Text-to-Speech (TTS) + +#### POST /tts/synthesize + +Batch synthesis - convert text to speech audio. + +**Query Parameters:** + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `engine` | string | Yes | Engine name (e.g., "coqui") | +| `text` | string | Yes | Text to synthesize | +| `voice` | string | No | Voice name/ID to use | +| `speed` | float | No | Speech speed multiplier (0 < speed <= 3.0, default: 1.0) | +| `engine_params` | string | No | JSON string with engine-specific parameters | + +**Example:** + +```bash +curl -X POST "http://localhost:8000/api/v1/tts/synthesize?engine=coqui&text=Hello%20world&voice=en-US-1&speed=1.0" +``` + +**Response:** + +```json +{ + "audio_data": "", + "sample_rate": 22050, + "duration_seconds": 1.5, + "format": "wav", + "performance_metrics": { + "latency_ms": 250.5, + "processing_time_ms": 200.0, + "audio_duration_ms": 1500.0, + "real_time_factor": 0.13, + "characters_processed": 11 + } +} +``` + +#### POST /tts/synthesize/stream + +SSE (Server-Sent Events) streaming synthesis - receive progressive audio chunks. + +**Query Parameters:** + +Same as `/tts/synthesize`. + +**Response:** + +Server-Sent Events stream with two event types: + +1. **chunk** - Partial audio data + +``` +event: chunk +data: {"audio_data": "", "sequence_number": 0, "chunk_latency_ms": 25.5} +``` + +2. **complete** - Final complete response + +``` +event: complete +data: {"audio_data": "", "sample_rate": 22050, "duration_seconds": 1.5, "format": "wav", "performance_metrics": {...}} +``` + +**Example:** + +```bash +curl -N -X POST "http://localhost:8000/api/v1/tts/synthesize/stream?engine=coqui&text=Hello%20world" +``` + +**JavaScript Client Example:** + +```javascript +const params = new URLSearchParams({ + engine: 'coqui', + text: 'Hello, how are you today?', + voice: 'en-US-1', + speed: '1.0' +}); + +const eventSource = new EventSource( + `http://localhost:8000/api/v1/tts/synthesize/stream?${params}` +); + +eventSource.addEventListener('chunk', (event) => { + const chunk = JSON.parse(event.data); + // Process audio chunk + console.log('Chunk:', chunk.sequence_number); +}); + +eventSource.addEventListener('complete', (event) => { + const result = JSON.parse(event.data); + console.log('Complete, duration:', result.duration_seconds); + eventSource.close(); +}); +``` + +--- + ## Data Models ### STTResponse @@ -287,6 +388,42 @@ Performance metrics for transcription. | `total_stream_duration_ms` | float | Total streaming duration | | `total_chunks` | int | Number of chunks (streaming) | +### TTSResponse + +Complete synthesis response. + +| Field | Type | Description | +|-------|------|-------------| +| `audio_data` | bytes | Complete generated audio | +| `sample_rate` | int | Audio sample rate in Hz | +| `duration_seconds` | float | Audio duration in seconds | +| `format` | string | Audio format (wav, mp3, etc.) | +| `performance_metrics` | TTSPerformanceMetrics | Performance metrics | + +### TTSChunk + +Streaming chunk for partial audio. + +| Field | Type | Description | +|-------|------|-------------| +| `audio_data` | bytes | Audio chunk bytes | +| `sequence_number` | int | Chunk sequence for ordering | +| `chunk_latency_ms` | float | Generation latency for this chunk | + +### TTSPerformanceMetrics + +Performance metrics for synthesis. + +| Field | Type | Description | +|-------|------|-------------| +| `latency_ms` | float | Total end-to-end latency | +| `processing_time_ms` | float | Model processing time | +| `real_time_factor` | float | Processing time / audio duration | +| `characters_per_second` | float | Text-to-audio synthesis speed | +| `time_to_first_byte_ms` | float | Time to first audio byte (streaming) | +| `total_stream_duration_ms` | float | Total streaming duration | +| `total_chunks` | int | Number of chunks (streaming) | + --- ## Error Responses