Lowering TTS latency #294

franklin050187 · 2025-04-04T15:05:28Z

franklin050187
Apr 4, 2025

Hi there,
I am trying to lower TTS latency under 1 sec.

Can the method used in RVC https://github.com/KoljaB/RealtimeTTS/tree/master/example_rvc be faster than a direct stream feed and play ?

At the moment between STT vad stop and sound coming out of the speaker I get 2 seconds (which is great).
Test were done without LLM call (I used a dummy sentence instead of the LLM call and measure using a video recording between last sound picked up by the mic and first sound output by the speaker).

Did try with play_async but it was slower and I dont mind blocking while synthesis is going (will add a signal to stop it later if needed).

The idea if my understanding is correct is to keep the stream active and feed it text so it can be synthesized as soon as it gets generated instead of having to call .play each time by using the push_text(text: str).

Not sure if it will improve performance so if someone did try it I am all ears.

The mail goal is to be able to have sub 2 sec audio output including LLM call, using vector db, 12b model call and some logic I know my call setup with the LLM takes about 1 sec so that leaves me with 1 sec to do STT transcription and TTS synthesize which is a challenge when using Coqui to keep quality (did try other model but found nothing this good for french).

HW :
RTX3090 / windows 10 / deepseed enable

code snippet actually used :

from RealtimeSTT import AudioToTextRecorder
import logging
from llm_util import _build_context, generate_response_fast
from RealtimeTTS import TextToAudioStream, CoquiEngine

language = "fr"
device = "cuda"
model = "small"
silero_use_onnx = True
voice = "voix2.wav"
full_sentences = False

def vad_stop():
    logging.info("VAD STOP")

def vad_start():
    logging.info("VAD START")

def show_text(text="default"):
    logging.info(f"data: {text}")
    context = _build_context(text)
    logging.info(f"context: {context}")
    response_generator = generate_response_fast(text)
    logging.info("Response generation start")
    stream.feed(response_generator)
    stream.play(log_synthesized_text=True, fast_sentence_fragment=True)
    logging.info("Response generated")

if __name__ == "__main__":
    logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
    )

    recorder = AudioToTextRecorder(
        model=model,
        language=language,
        no_log_file=True,
        device=device,
        silero_use_onnx=silero_use_onnx,
        on_vad_detect_start=vad_start,
        on_vad_detect_stop=vad_stop,
    )

    engine = CoquiEngine(
            language="fr",
            speed=1.15,
            voice=voice,
            specific_model="v2.0.3",
            thread_count=12,
            full_sentences=full_sentences,
            use_deepspeed=True,
            device="cuda",
        )
    logging.info("Coqui engine initialized")

    stream = TextToAudioStream(
        engine, language="fr", tokenizer="stanza" # stanza
    )

    while True:
        text = recorder.text().strip()
        if text:
            show_text(text)

Answered by franklin050187

Apr 10, 2025

Fixed it :
Issue was a function to store memory in vector db that was blocking the generator that streams the response from the LLM to the TTS.

Solution : used threading to run the save function in background.

Result :
Measure from vad stop to first audio output is now 2 sec.
This include transcribe, building context by requesting vector db, LLM call, start synthesis.
Not sure if on current HW I can squeeze more perf.
Now time to talk to the bot to teach it things.

# Run storing in background so it doesn't block yield
def store_async(user_text: str, full_text: str):
    logging.info("Storing conversation...")
    _store_conversation(user_text, full_text)
    logging.info("Conversation sto…

View full answer

franklin050187 · 2025-04-10T15:18:51Z

franklin050187
Apr 10, 2025
Author

Fixed it :
Issue was a function to store memory in vector db that was blocking the generator that streams the response from the LLM to the TTS.

Solution : used threading to run the save function in background.

Result :
Measure from vad stop to first audio output is now 2 sec.
This include transcribe, building context by requesting vector db, LLM call, start synthesis.
Not sure if on current HW I can squeeze more perf.
Now time to talk to the bot to teach it things.

# Run storing in background so it doesn't block yield
def store_async(user_text: str, full_text: str):
    logging.info("Storing conversation...")
    _store_conversation(user_text, full_text)
    logging.info("Conversation stored")


def generate_response_fast(user_text: str) -> Generator[str, None, None]:
    """Generate a streaming response to user input with context from memory.

    Args:
        user_text: The user's input text

    Yields:
        Chunks of the generated response text
    """
    global full_text

    try:
        # Build context from memory and visual info if needed
        context = _build_context(user_text)

        # Store user message in history
        save_chat_history("user", user_text)

        # Load system prompt
        system_prompt = _load_system_prompt()

        # Generate response
        messages = [("system", system_prompt + "\n" + context)] + get_chat_history(5)
        full_text = ""
        logging.info("Generating response...")    
        for chunk in pass_stream(messages):
            chunk_text = chunk.text()
            full_text += chunk_text
            yield chunk_text

        # Store conversation in background
        threading.Thread(target=store_async, args=(user_text, full_text), daemon=True).start()

    except Exception as e:
        logging.error(f"Error generating response: {e}")
        yield f"Sorry, I encountered an error: {str(e)}"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowering TTS latency #294

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Lowering TTS latency #294

Uh oh!

Uh oh!

franklin050187 Apr 4, 2025

Replies: 1 comment

Uh oh!

franklin050187 Apr 10, 2025 Author

franklin050187
Apr 4, 2025

franklin050187
Apr 10, 2025
Author