Orpheus TTS with TensorRT-LLM for RTX 4090

High-performance, high-concurrency Text-to-Speech server using Orpheus TTS with TensorRT-LLM. Designed for low latency AND high concurrent users on a single RTX 4090.

Based on Baseten's production approach.

Key Features

FP8 LLM + FP16 KV Cache - Optimal precision strategy for 4090
Token-level Micro-batching - Serve 16-32 concurrent sessions efficiently
Pre-allocated KV Cache - No allocation during inference
Streaming Audio - First audio byte in <150ms
Fair Scheduling - TTFB prioritization for new sessions
Production-ready - gRPC, WebSocket, and REST APIs

Performance Targets (RTX 4090)

Metric	Target	Notes
TTFB	< 150ms	Time to first audio byte
Concurrent Sessions	16-32	With micro-batching
Token Throughput	3000+ tok/s	Across all sessions
VRAM Usage	~14GB	With 32 session KV cache

Reference: RTX 3090 achieves 4 CCU with basic setup

Architecture

                           ┌─────────────────────────────────────┐
                           │          API Gateway                │
                           │   HTTP / WebSocket / gRPC           │
                           └───────────────┬─────────────────────┘
                                           │
                                           ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                         Batch Inference Engine                            │
│                                                                           │
│  ┌─────────────────────┐    ┌─────────────────────────────────────────┐  │
│  │   Session Scheduler │    │          KV Cache Manager               │  │
│  │                     │    │                                         │  │
│  │  • Fair scheduling  │    │  • Pre-allocated for N sessions         │  │
│  │  • TTFB priority    │    │  • Slot-based allocation                │  │
│  │  • Starvation prev  │    │  • Block reuse for efficiency           │  │
│  └──────────┬──────────┘    └──────────────────────────────────────────┘ │
│             │                                                             │
│             ▼                                                             │
│  ┌───────────────────────────────────────────────────────────────────┐   │
│  │                     Main Inference Loop                            │   │
│  │                                                                    │   │
│  │  while True:                                                       │   │
│  │    1. sessions = scheduler.pick_sessions(max_batch_size=16)       │   │
│  │    2. batch_inputs = build_batch(sessions)  # Tokens + KV slots   │   │
│  │    3. next_tokens = llm_engine.forward(batch_inputs)  # FP8 LLM   │   │
│  │    4. for sess, tok in zip(sessions, next_tokens):                │   │
│  │         sess.append_token(tok)                                     │   │
│  │         if sess.should_flush():                                    │   │
│  │           audio = snac.decode(sess.tokens)  # FP16 Decoder        │   │
│  │           sess.stream.send(audio)                                  │   │
│  └───────────────────────────────────────────────────────────────────┘   │
│                                                                           │
└──────────────────────────────────────────────────────────────────────────┘

                           GPU: RTX 4090 (24GB VRAM)
                    ┌──────────────────────────────────────┐
                    │   FP8 Orpheus LLM (TensorRT-LLM)     │ ~3GB
                    │   FP16 KV Cache (32 slots × 4K ctx)  │ ~8GB
                    │   FP16 SNAC Decoder                  │ ~1GB
                    │   Workspace + Overhead               │ ~2GB
                    └──────────────────────────────────────┘

Concurrency Strategy

Token-level Micro-batching

Instead of processing one request at a time:

Maintain active session pool with KV cache slots
Pick up to N sessions for each forward pass
Run batched forward - GPU processes all sessions in parallel
Stream results back to each client

Audio Flushing Strategy

First chunk:  7 tokens  → ~75ms audio  (low TTFB)
Subsequent:  14 tokens  → ~150ms audio (efficient batching)

SNAC needs 7 tokens minimum per audio frame (~10.67ms audio).

Fair Scheduling

New sessions get priority until first audio sent (TTFB optimization)
Round-robin among active sessions prevents starvation
Timeout handling for slow clients

Quick Start

Prerequisites

NVIDIA RTX 4090 (or Ada Lovelace GPU with FP8 support)
CUDA 12.x + cuDNN 8.9+
Python 3.10+
~50GB disk space

Installation

cd orpheus-tts-tensorrt

# Install dependencies
chmod +x setup.sh
./setup.sh

# Activate environment
source venv/bin/activate

# Build TensorRT-LLM engine
python build_engine.py --config config/model_config.yaml

# Start production server
python production_server.py --port 8000

Test Concurrency

# Single request test
python client.py --text "Hello world" --output test.wav

# Benchmark with scaling
python benchmark.py --mode scaling --concurrent 32

# Full benchmark suite
python benchmark.py --mode all --requests 100 --concurrent 16 --output results.json

Tuning Guide

Key Parameters (in `config/model_config.yaml`)

scheduler:
  max_concurrent_sessions: 32   # Total sessions in KV cache
  max_batch_size: 16            # Sessions per forward pass
  ttfb_priority_tokens: 20      # Prioritize new sessions
  flush_first_n_tokens: 7       # First audio chunk (7 = minimum)
  flush_every_n_tokens: 14      # Subsequent chunks

tuning:
  # Conversational AI (prioritize latency)
  conversational:
    max_text_chars: 250
    target_concurrent: 24
    target_ttfb_ms: 100

  # Long-form TTS (prioritize throughput)
  longform:
    max_text_chars: 5000
    target_concurrent: 16
    target_ttfb_ms: 300

Finding Your Sweet Spot

Start conservative: 16 concurrent, 8 batch size
Run benchmark: python benchmark.py --mode tuning
Check metrics: TTFB p95, success rate, tokens/sec
Increase gradually: Push concurrent up until TTFB degrades

API Reference

POST /v1/audio/speech (Streaming)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello!", "voice": "tara"}' \
  --output speech.wav

WebSocket /v1/audio/stream (Real-time)

import asyncio
import websockets
import json

async def stream_tts():
    async with websockets.connect("ws://localhost:8000/v1/audio/stream") as ws:
        await ws.send(json.dumps({"text": "Hello!", "voice": "tara"}))
        
        while True:
            msg = await ws.recv()
            if isinstance(msg, bytes):
                # Audio chunk
                play_audio(msg)
            else:
                data = json.loads(msg)
                if data.get("done"):
                    print(f"TTFB: {data['metrics']['ttfb_ms']:.0f}ms")
                    break

asyncio.run(stream_tts())

GET /v1/stats

{
  "scheduler": {
    "total_sessions": 1234,
    "active_sessions": 12,
    "pending_sessions": 3,
    "max_concurrent": 32
  },
  "kv_cache": {
    "used_slots": 15,
    "free_slots": 17
  },
  "total_tokens": 567890,
  "total_batches": 12345
}

Precision Strategy

Component	Precision	Why
Orpheus LLM	FP8	4090 native support, 2x throughput
KV Cache	FP16	Quality preservation
SNAC Decoder	FP16	Avoid audio artifacts

DO NOT use INT4/INT8 - causes prosody flattening in TTS.

Project Structure

orpheus-tts-tensorrt/
├── production_server.py     # Production server with micro-batching
├── streaming_server.py      # Simpler streaming server
├── build_engine.py          # TensorRT-LLM engine builder
├── benchmark.py             # Comprehensive benchmarking
├── client.py                # Async client
├── setup.sh                 # Installation script
├── config/
│   └── model_config.yaml    # All tuning parameters
├── src/
│   ├── scheduler.py         # Session scheduler + KV cache manager
│   ├── batched_engine.py    # TensorRT-LLM batched inference
│   ├── llm_engine.py        # LLM engine wrapper
│   └── snac_decoder.py      # SNAC audio decoder
├── models/                  # Downloaded models
├── checkpoints/             # Converted checkpoints
└── engines/                 # Built TensorRT engines

Troubleshooting

High TTFB at High Concurrency

Reduce max_batch_size (smaller batches = faster per-step)
Increase ttfb_priority_tokens (prioritize new sessions longer)
Check GPU utilization - should be 90%+

Sessions Timing Out

Check session_timeout_s in config
Ensure network isn't bottleneck
Monitor queue depth

VRAM Issues

# Calculate KV cache size
kv_cache_gb = (
    num_slots * num_layers * 2 * num_kv_heads * head_dim * max_seq_len * 2  # K+V
) / 1e9
# 32 slots × 28 layers × 2 × 8 heads × 128 dim × 4096 seq × 2 bytes = ~7.3GB

Reduce num_slots or max_context_len if needed.

Credits

Orpheus TTS by Canopy Labs
TensorRT-LLM by NVIDIA
SNAC Audio Codec
Architecture inspired by Baseten's production deployment

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Orpheus TTS with TensorRT-LLM for RTX 4090

Key Features

Performance Targets (RTX 4090)

Architecture

Concurrency Strategy

Token-level Micro-batching

Audio Flushing Strategy

Fair Scheduling

Quick Start

Prerequisites

Installation

Test Concurrency

Tuning Guide

Key Parameters (in `config/model_config.yaml`)

Finding Your Sweet Spot

API Reference

POST /v1/audio/speech (Streaming)

WebSocket /v1/audio/stream (Real-time)

GET /v1/stats

Precision Strategy

Project Structure

Troubleshooting

High TTFB at High Concurrency

Sessions Timing Out

VRAM Issues

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
src		src
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
build_engine.py		build_engine.py
client.py		client.py
production_server.py		production_server.py
requirements.txt		requirements.txt
setup.sh		setup.sh
streaming_server.py		streaming_server.py

Folders and files

Latest commit

History

Repository files navigation

Orpheus TTS with TensorRT-LLM for RTX 4090

Key Features

Performance Targets (RTX 4090)

Architecture

Concurrency Strategy

Token-level Micro-batching

Audio Flushing Strategy

Fair Scheduling

Quick Start

Prerequisites

Installation

Test Concurrency

Tuning Guide

Key Parameters (in config/model_config.yaml)

Finding Your Sweet Spot

API Reference

POST /v1/audio/speech (Streaming)

WebSocket /v1/audio/stream (Real-time)

GET /v1/stats

Precision Strategy

Project Structure

Troubleshooting

High TTFB at High Concurrency

Sessions Timing Out

VRAM Issues

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Key Parameters (in `config/model_config.yaml`)

Packages