Skip to content

piyushgit011/orpheus-tensorrt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Orpheus TTS with TensorRT-LLM for RTX 4090

High-performance, high-concurrency Text-to-Speech server using Orpheus TTS with TensorRT-LLM. Designed for low latency AND high concurrent users on a single RTX 4090.

Based on Baseten's production approach.

Key Features

  • FP8 LLM + FP16 KV Cache - Optimal precision strategy for 4090
  • Token-level Micro-batching - Serve 16-32 concurrent sessions efficiently
  • Pre-allocated KV Cache - No allocation during inference
  • Streaming Audio - First audio byte in <150ms
  • Fair Scheduling - TTFB prioritization for new sessions
  • Production-ready - gRPC, WebSocket, and REST APIs

Performance Targets (RTX 4090)

Metric Target Notes
TTFB < 150ms Time to first audio byte
Concurrent Sessions 16-32 With micro-batching
Token Throughput 3000+ tok/s Across all sessions
VRAM Usage ~14GB With 32 session KV cache

Reference: RTX 3090 achieves 4 CCU with basic setup

Architecture

                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚          API Gateway                β”‚
                           β”‚   HTTP / WebSocket / gRPC           β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚
                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Batch Inference Engine                            β”‚
β”‚                                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Session Scheduler β”‚    β”‚          KV Cache Manager               β”‚  β”‚
β”‚  β”‚                     β”‚    β”‚                                         β”‚  β”‚
β”‚  β”‚  β€’ Fair scheduling  β”‚    β”‚  β€’ Pre-allocated for N sessions         β”‚  β”‚
β”‚  β”‚  β€’ TTFB priority    β”‚    β”‚  β€’ Slot-based allocation                β”‚  β”‚
β”‚  β”‚  β€’ Starvation prev  β”‚    β”‚  β€’ Block reuse for efficiency           β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚             β”‚                                                             β”‚
β”‚             β–Ό                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                     Main Inference Loop                            β”‚   β”‚
β”‚  β”‚                                                                    β”‚   β”‚
β”‚  β”‚  while True:                                                       β”‚   β”‚
β”‚  β”‚    1. sessions = scheduler.pick_sessions(max_batch_size=16)       β”‚   β”‚
β”‚  β”‚    2. batch_inputs = build_batch(sessions)  # Tokens + KV slots   β”‚   β”‚
β”‚  β”‚    3. next_tokens = llm_engine.forward(batch_inputs)  # FP8 LLM   β”‚   β”‚
β”‚  β”‚    4. for sess, tok in zip(sessions, next_tokens):                β”‚   β”‚
β”‚  β”‚         sess.append_token(tok)                                     β”‚   β”‚
β”‚  β”‚         if sess.should_flush():                                    β”‚   β”‚
β”‚  β”‚           audio = snac.decode(sess.tokens)  # FP16 Decoder        β”‚   β”‚
β”‚  β”‚           sess.stream.send(audio)                                  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                           GPU: RTX 4090 (24GB VRAM)
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   FP8 Orpheus LLM (TensorRT-LLM)     β”‚ ~3GB
                    β”‚   FP16 KV Cache (32 slots Γ— 4K ctx)  β”‚ ~8GB
                    β”‚   FP16 SNAC Decoder                  β”‚ ~1GB
                    β”‚   Workspace + Overhead               β”‚ ~2GB
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Concurrency Strategy

Token-level Micro-batching

Instead of processing one request at a time:

  1. Maintain active session pool with KV cache slots
  2. Pick up to N sessions for each forward pass
  3. Run batched forward - GPU processes all sessions in parallel
  4. Stream results back to each client

Audio Flushing Strategy

First chunk:  7 tokens  β†’ ~75ms audio  (low TTFB)
Subsequent:  14 tokens  β†’ ~150ms audio (efficient batching)

SNAC needs 7 tokens minimum per audio frame (~10.67ms audio).

Fair Scheduling

  • New sessions get priority until first audio sent (TTFB optimization)
  • Round-robin among active sessions prevents starvation
  • Timeout handling for slow clients

Quick Start

Prerequisites

  • NVIDIA RTX 4090 (or Ada Lovelace GPU with FP8 support)
  • CUDA 12.x + cuDNN 8.9+
  • Python 3.10+
  • ~50GB disk space

Installation

cd orpheus-tts-tensorrt

# Install dependencies
chmod +x setup.sh
./setup.sh

# Activate environment
source venv/bin/activate

# Build TensorRT-LLM engine
python build_engine.py --config config/model_config.yaml

# Start production server
python production_server.py --port 8000

Test Concurrency

# Single request test
python client.py --text "Hello world" --output test.wav

# Benchmark with scaling
python benchmark.py --mode scaling --concurrent 32

# Full benchmark suite
python benchmark.py --mode all --requests 100 --concurrent 16 --output results.json

Tuning Guide

Key Parameters (in config/model_config.yaml)

scheduler:
  max_concurrent_sessions: 32   # Total sessions in KV cache
  max_batch_size: 16            # Sessions per forward pass
  ttfb_priority_tokens: 20      # Prioritize new sessions
  flush_first_n_tokens: 7       # First audio chunk (7 = minimum)
  flush_every_n_tokens: 14      # Subsequent chunks

tuning:
  # Conversational AI (prioritize latency)
  conversational:
    max_text_chars: 250
    target_concurrent: 24
    target_ttfb_ms: 100

  # Long-form TTS (prioritize throughput)
  longform:
    max_text_chars: 5000
    target_concurrent: 16
    target_ttfb_ms: 300

Finding Your Sweet Spot

  1. Start conservative: 16 concurrent, 8 batch size
  2. Run benchmark: python benchmark.py --mode tuning
  3. Check metrics: TTFB p95, success rate, tokens/sec
  4. Increase gradually: Push concurrent up until TTFB degrades

API Reference

POST /v1/audio/speech (Streaming)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello!", "voice": "tara"}' \
  --output speech.wav

WebSocket /v1/audio/stream (Real-time)

import asyncio
import websockets
import json

async def stream_tts():
    async with websockets.connect("ws://localhost:8000/v1/audio/stream") as ws:
        await ws.send(json.dumps({"text": "Hello!", "voice": "tara"}))
        
        while True:
            msg = await ws.recv()
            if isinstance(msg, bytes):
                # Audio chunk
                play_audio(msg)
            else:
                data = json.loads(msg)
                if data.get("done"):
                    print(f"TTFB: {data['metrics']['ttfb_ms']:.0f}ms")
                    break

asyncio.run(stream_tts())

GET /v1/stats

{
  "scheduler": {
    "total_sessions": 1234,
    "active_sessions": 12,
    "pending_sessions": 3,
    "max_concurrent": 32
  },
  "kv_cache": {
    "used_slots": 15,
    "free_slots": 17
  },
  "total_tokens": 567890,
  "total_batches": 12345
}

Precision Strategy

Component Precision Why
Orpheus LLM FP8 4090 native support, 2x throughput
KV Cache FP16 Quality preservation
SNAC Decoder FP16 Avoid audio artifacts

DO NOT use INT4/INT8 - causes prosody flattening in TTS.

Project Structure

orpheus-tts-tensorrt/
β”œβ”€β”€ production_server.py     # Production server with micro-batching
β”œβ”€β”€ streaming_server.py      # Simpler streaming server
β”œβ”€β”€ build_engine.py          # TensorRT-LLM engine builder
β”œβ”€β”€ benchmark.py             # Comprehensive benchmarking
β”œβ”€β”€ client.py                # Async client
β”œβ”€β”€ setup.sh                 # Installation script
β”œβ”€β”€ config/
β”‚   └── model_config.yaml    # All tuning parameters
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ scheduler.py         # Session scheduler + KV cache manager
β”‚   β”œβ”€β”€ batched_engine.py    # TensorRT-LLM batched inference
β”‚   β”œβ”€β”€ llm_engine.py        # LLM engine wrapper
β”‚   └── snac_decoder.py      # SNAC audio decoder
β”œβ”€β”€ models/                  # Downloaded models
β”œβ”€β”€ checkpoints/             # Converted checkpoints
└── engines/                 # Built TensorRT engines

Troubleshooting

High TTFB at High Concurrency

  1. Reduce max_batch_size (smaller batches = faster per-step)
  2. Increase ttfb_priority_tokens (prioritize new sessions longer)
  3. Check GPU utilization - should be 90%+

Sessions Timing Out

  1. Check session_timeout_s in config
  2. Ensure network isn't bottleneck
  3. Monitor queue depth

VRAM Issues

# Calculate KV cache size
kv_cache_gb = (
    num_slots * num_layers * 2 * num_kv_heads * head_dim * max_seq_len * 2  # K+V
) / 1e9
# 32 slots Γ— 28 layers Γ— 2 Γ— 8 heads Γ— 128 dim Γ— 4096 seq Γ— 2 bytes = ~7.3GB

Reduce num_slots or max_context_len if needed.

Credits

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors