Skip to content
This repository was archived by the owner on Jan 4, 2026. It is now read-only.
This repository was archived by the owner on Jan 4, 2026. It is now read-only.

all speech output turned into "download dour voices", issue seemed to be mismatch between piper and ffmpeg #102

@grimwiz

Description

@grimwiz

I solved it over an evening working with ChatGPT. at the end I asked for a summary of the fix:

This switches the TTS pipeline to a robust, headered stream:
Piper now writes WAV to stdout (--output_file -) instead of raw PCM.
ffmpeg now reads WAV (no guessing), then transcodes to the requested format.
Also fixes two subtle edge cases:
allow speaker: 0 (if speaker is not None)
write a trailing newline + flush to ensure Piper actually synthesizes the final utterance.

diff --git a/app/speech.py b/app/speech.py
index 4c5b1a1..b7a8e54 100644
--- a/app/speech.py
+++ b/app/speech.py
@@ -223,16 +223,19 @@ async def generate_speech(request: GenerateSpeechRequest):
         except KeyError as e:
             raise ServiceUnavailableError(f"Configuration error: tts-1 voice '{voice}' is missing 'model:' setting. KeyError: {e}")
 
-        speaker = voice_map.get('speaker', None)
-
-        tts_args = ["piper", "--model", str(piper_model), "--data-dir", "voices", "--download-dir", "voices", "--output-raw"]
-        if speaker:
-            tts_args.extend(["--speaker", str(speaker)])
+        speaker = voice_map.get('speaker', None)
+
+        # Emit WAV to stdout for a stable, headered stream that ffmpeg can consume reliably.
+        tts_args = ["piper", "--model", str(piper_model),
+                    "--data-dir", "voices", "--download-dir", "voices",
+                    "--output_file", "-"]
+        # Accept speaker=0 as valid; only skip when truly None
+        if speaker is not None:
+            tts_args.extend(["--speaker", str(speaker)])
         if speed != 1.0:
             tts_args.extend(["--length-scale", f"{1.0/speed}"])
 
         tts_proc = subprocess.Popen(tts_args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
-        tts_proc.stdin.write(bytearray(input_text.encode('utf-8')))
+        tts_proc.stdin.write(input_text.encode("utf-8") + b"\n")
+        tts_proc.stdin.flush()
         tts_proc.stdin.close()
 
         try:
@@ -242,10 +245,10 @@ async def generate_speech(request: GenerateSpeechRequest):
         except:
             sample_rate = '22050'
 
-        ffmpeg_args = build_ffmpeg_args(response_format, input_format="s16le", sample_rate=sample_rate)
+        # Read WAV from Piper (headered), then transcode to requested output format
+        ffmpeg_args = build_ffmpeg_args(response_format, input_format="WAV", sample_rate=sample_rate)
 
         # Pipe the output from piper/xtts to the input of ffmpeg
         ffmpeg_args.extend(["-"])
         ffmpeg_proc = subprocess.Popen(ffmpeg_args, stdin=tts_proc.stdout, stdout=subprocess.PIPE)
 
         return StreamingResponse(content=ffmpeg_proc.stdout, media_type=media_type)

You previously streamed raw s16le PCM from Piper and told ffmpeg to read -f s16le. On some environments (Docker-in-LXC, certain Piper/ffmpeg versions, CPU pinning constraints), that raw pipe path is brittle:

Pipe timing/buffering: Piper may not emit any frames until it sees a newline/end-of-input. If stdin closes without a newline, some builds produce zero frames. (We added the newline + flush.)

Raw format sensitivity: If anything about the raw PCM isn’t exactly what ffmpeg expects (frame size, timing, sample rate mismatch, partial frames due to buffering), ffmpeg can still produce a tiny, valid output that sounds like a short stock clip.

Version drift: Different Piper builds (or ffmpeg) changed edge behavior over time; what worked earlier stopped working after an image/dep update.

Switching to WAV to stdout gives ffmpeg a headered stream with an explicit format. That removes ambiguity and timing sensitivity, and it matched what worked for you immediately.

Allowing speaker: 0 is also a correctness fix—if speaker: incorrectly drops zero, which is a valid speaker index in multi-speaker models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions