High-performance, high-concurrency Text-to-Speech server using Orpheus TTS with TensorRT-LLM. Designed for low latency AND high concurrent users on a single RTX 4090.
Based on Baseten's production approach.
- FP8 LLM + FP16 KV Cache - Optimal precision strategy for 4090
- Token-level Micro-batching - Serve 16-32 concurrent sessions efficiently
- Pre-allocated KV Cache - No allocation during inference
- Streaming Audio - First audio byte in <150ms
- Fair Scheduling - TTFB prioritization for new sessions
- Production-ready - gRPC, WebSocket, and REST APIs
| Metric | Target | Notes |
|---|---|---|
| TTFB | < 150ms | Time to first audio byte |
| Concurrent Sessions | 16-32 | With micro-batching |
| Token Throughput | 3000+ tok/s | Across all sessions |
| VRAM Usage | ~14GB | With 32 session KV cache |
Reference: RTX 3090 achieves 4 CCU with basic setup
βββββββββββββββββββββββββββββββββββββββ
β API Gateway β
β HTTP / WebSocket / gRPC β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Batch Inference Engine β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββ β
β β Session Scheduler β β KV Cache Manager β β
β β β β β β
β β β’ Fair scheduling β β β’ Pre-allocated for N sessions β β
β β β’ TTFB priority β β β’ Slot-based allocation β β
β β β’ Starvation prev β β β’ Block reuse for efficiency β β
β ββββββββββββ¬βββββββββββ ββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Main Inference Loop β β
β β β β
β β while True: β β
β β 1. sessions = scheduler.pick_sessions(max_batch_size=16) β β
β β 2. batch_inputs = build_batch(sessions) # Tokens + KV slots β β
β β 3. next_tokens = llm_engine.forward(batch_inputs) # FP8 LLM β β
β β 4. for sess, tok in zip(sessions, next_tokens): β β
β β sess.append_token(tok) β β
β β if sess.should_flush(): β β
β β audio = snac.decode(sess.tokens) # FP16 Decoder β β
β β sess.stream.send(audio) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GPU: RTX 4090 (24GB VRAM)
ββββββββββββββββββββββββββββββββββββββββ
β FP8 Orpheus LLM (TensorRT-LLM) β ~3GB
β FP16 KV Cache (32 slots Γ 4K ctx) β ~8GB
β FP16 SNAC Decoder β ~1GB
β Workspace + Overhead β ~2GB
ββββββββββββββββββββββββββββββββββββββββ
Instead of processing one request at a time:
- Maintain active session pool with KV cache slots
- Pick up to N sessions for each forward pass
- Run batched forward - GPU processes all sessions in parallel
- Stream results back to each client
First chunk: 7 tokens β ~75ms audio (low TTFB)
Subsequent: 14 tokens β ~150ms audio (efficient batching)
SNAC needs 7 tokens minimum per audio frame (~10.67ms audio).
- New sessions get priority until first audio sent (TTFB optimization)
- Round-robin among active sessions prevents starvation
- Timeout handling for slow clients
- NVIDIA RTX 4090 (or Ada Lovelace GPU with FP8 support)
- CUDA 12.x + cuDNN 8.9+
- Python 3.10+
- ~50GB disk space
cd orpheus-tts-tensorrt
# Install dependencies
chmod +x setup.sh
./setup.sh
# Activate environment
source venv/bin/activate
# Build TensorRT-LLM engine
python build_engine.py --config config/model_config.yaml
# Start production server
python production_server.py --port 8000# Single request test
python client.py --text "Hello world" --output test.wav
# Benchmark with scaling
python benchmark.py --mode scaling --concurrent 32
# Full benchmark suite
python benchmark.py --mode all --requests 100 --concurrent 16 --output results.jsonscheduler:
max_concurrent_sessions: 32 # Total sessions in KV cache
max_batch_size: 16 # Sessions per forward pass
ttfb_priority_tokens: 20 # Prioritize new sessions
flush_first_n_tokens: 7 # First audio chunk (7 = minimum)
flush_every_n_tokens: 14 # Subsequent chunks
tuning:
# Conversational AI (prioritize latency)
conversational:
max_text_chars: 250
target_concurrent: 24
target_ttfb_ms: 100
# Long-form TTS (prioritize throughput)
longform:
max_text_chars: 5000
target_concurrent: 16
target_ttfb_ms: 300- Start conservative: 16 concurrent, 8 batch size
- Run benchmark:
python benchmark.py --mode tuning - Check metrics: TTFB p95, success rate, tokens/sec
- Increase gradually: Push concurrent up until TTFB degrades
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"text": "Hello!", "voice": "tara"}' \
--output speech.wavimport asyncio
import websockets
import json
async def stream_tts():
async with websockets.connect("ws://localhost:8000/v1/audio/stream") as ws:
await ws.send(json.dumps({"text": "Hello!", "voice": "tara"}))
while True:
msg = await ws.recv()
if isinstance(msg, bytes):
# Audio chunk
play_audio(msg)
else:
data = json.loads(msg)
if data.get("done"):
print(f"TTFB: {data['metrics']['ttfb_ms']:.0f}ms")
break
asyncio.run(stream_tts()){
"scheduler": {
"total_sessions": 1234,
"active_sessions": 12,
"pending_sessions": 3,
"max_concurrent": 32
},
"kv_cache": {
"used_slots": 15,
"free_slots": 17
},
"total_tokens": 567890,
"total_batches": 12345
}| Component | Precision | Why |
|---|---|---|
| Orpheus LLM | FP8 | 4090 native support, 2x throughput |
| KV Cache | FP16 | Quality preservation |
| SNAC Decoder | FP16 | Avoid audio artifacts |
DO NOT use INT4/INT8 - causes prosody flattening in TTS.
orpheus-tts-tensorrt/
βββ production_server.py # Production server with micro-batching
βββ streaming_server.py # Simpler streaming server
βββ build_engine.py # TensorRT-LLM engine builder
βββ benchmark.py # Comprehensive benchmarking
βββ client.py # Async client
βββ setup.sh # Installation script
βββ config/
β βββ model_config.yaml # All tuning parameters
βββ src/
β βββ scheduler.py # Session scheduler + KV cache manager
β βββ batched_engine.py # TensorRT-LLM batched inference
β βββ llm_engine.py # LLM engine wrapper
β βββ snac_decoder.py # SNAC audio decoder
βββ models/ # Downloaded models
βββ checkpoints/ # Converted checkpoints
βββ engines/ # Built TensorRT engines
- Reduce
max_batch_size(smaller batches = faster per-step) - Increase
ttfb_priority_tokens(prioritize new sessions longer) - Check GPU utilization - should be 90%+
- Check
session_timeout_sin config - Ensure network isn't bottleneck
- Monitor queue depth
# Calculate KV cache size
kv_cache_gb = (
num_slots * num_layers * 2 * num_kv_heads * head_dim * max_seq_len * 2 # K+V
) / 1e9
# 32 slots Γ 28 layers Γ 2 Γ 8 heads Γ 128 dim Γ 4096 seq Γ 2 bytes = ~7.3GBReduce num_slots or max_context_len if needed.
- Orpheus TTS by Canopy Labs
- TensorRT-LLM by NVIDIA
- SNAC Audio Codec
- Architecture inspired by Baseten's production deployment
MIT License