-
Notifications
You must be signed in to change notification settings - Fork 130
all speech output turned into "download dour voices", issue seemed to be mismatch between piper and ffmpeg #102
Description
I solved it over an evening working with ChatGPT. at the end I asked for a summary of the fix:
This switches the TTS pipeline to a robust, headered stream:
Piper now writes WAV to stdout (--output_file -) instead of raw PCM.
ffmpeg now reads WAV (no guessing), then transcodes to the requested format.
Also fixes two subtle edge cases:
allow speaker: 0 (if speaker is not None)
write a trailing newline + flush to ensure Piper actually synthesizes the final utterance.
diff --git a/app/speech.py b/app/speech.py
index 4c5b1a1..b7a8e54 100644
--- a/app/speech.py
+++ b/app/speech.py
@@ -223,16 +223,19 @@ async def generate_speech(request: GenerateSpeechRequest):
except KeyError as e:
raise ServiceUnavailableError(f"Configuration error: tts-1 voice '{voice}' is missing 'model:' setting. KeyError: {e}")
- speaker = voice_map.get('speaker', None)
-
- tts_args = ["piper", "--model", str(piper_model), "--data-dir", "voices", "--download-dir", "voices", "--output-raw"]
- if speaker:
- tts_args.extend(["--speaker", str(speaker)])
+ speaker = voice_map.get('speaker', None)
+
+ # Emit WAV to stdout for a stable, headered stream that ffmpeg can consume reliably.
+ tts_args = ["piper", "--model", str(piper_model),
+ "--data-dir", "voices", "--download-dir", "voices",
+ "--output_file", "-"]
+ # Accept speaker=0 as valid; only skip when truly None
+ if speaker is not None:
+ tts_args.extend(["--speaker", str(speaker)])
if speed != 1.0:
tts_args.extend(["--length-scale", f"{1.0/speed}"])
tts_proc = subprocess.Popen(tts_args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
- tts_proc.stdin.write(bytearray(input_text.encode('utf-8')))
+ tts_proc.stdin.write(input_text.encode("utf-8") + b"\n")
+ tts_proc.stdin.flush()
tts_proc.stdin.close()
try:
@@ -242,10 +245,10 @@ async def generate_speech(request: GenerateSpeechRequest):
except:
sample_rate = '22050'
- ffmpeg_args = build_ffmpeg_args(response_format, input_format="s16le", sample_rate=sample_rate)
+ # Read WAV from Piper (headered), then transcode to requested output format
+ ffmpeg_args = build_ffmpeg_args(response_format, input_format="WAV", sample_rate=sample_rate)
# Pipe the output from piper/xtts to the input of ffmpeg
ffmpeg_args.extend(["-"])
ffmpeg_proc = subprocess.Popen(ffmpeg_args, stdin=tts_proc.stdout, stdout=subprocess.PIPE)
return StreamingResponse(content=ffmpeg_proc.stdout, media_type=media_type)
You previously streamed raw s16le PCM from Piper and told ffmpeg to read -f s16le. On some environments (Docker-in-LXC, certain Piper/ffmpeg versions, CPU pinning constraints), that raw pipe path is brittle:
Pipe timing/buffering: Piper may not emit any frames until it sees a newline/end-of-input. If stdin closes without a newline, some builds produce zero frames. (We added the newline + flush.)
Raw format sensitivity: If anything about the raw PCM isn’t exactly what ffmpeg expects (frame size, timing, sample rate mismatch, partial frames due to buffering), ffmpeg can still produce a tiny, valid output that sounds like a short stock clip.
Version drift: Different Piper builds (or ffmpeg) changed edge behavior over time; what worked earlier stopped working after an image/dep update.
Switching to WAV to stdout gives ffmpeg a headered stream with an explicit format. That removes ambiguity and timing sensitivity, and it matched what worked for you immediately.
Allowing speaker: 0 is also a correctness fix—if speaker: incorrectly drops zero, which is a valid speaker index in multi-speaker models.