Lowering TTS latency #294
-
|
Hi there, Can the method used in RVC https://github.com/KoljaB/RealtimeTTS/tree/master/example_rvc be faster than a direct stream feed and play ? At the moment between STT vad stop and sound coming out of the speaker I get 2 seconds (which is great). Did try with play_async but it was slower and I dont mind blocking while synthesis is going (will add a signal to stop it later if needed). The idea if my understanding is correct is to keep the stream active and feed it text so it can be synthesized as soon as it gets generated instead of having to call .play each time by using the push_text(text: str). Not sure if it will improve performance so if someone did try it I am all ears. The mail goal is to be able to have sub 2 sec audio output including LLM call, using vector db, 12b model call and some logic I know my call setup with the LLM takes about 1 sec so that leaves me with 1 sec to do STT transcription and TTS synthesize which is a challenge when using Coqui to keep quality (did try other model but found nothing this good for french). HW : code snippet actually used : |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Fixed it : Solution : used threading to run the save function in background. Result : |
Beta Was this translation helpful? Give feedback.
Fixed it :
Issue was a function to store memory in vector db that was blocking the generator that streams the response from the LLM to the TTS.
Solution : used threading to run the save function in background.
Result :
Measure from vad stop to first audio output is now 2 sec.
This include transcribe, building context by requesting vector db, LLM call, start synthesis.
Not sure if on current HW I can squeeze more perf.
Now time to talk to the bot to teach it things.