all speech output turned into "download dour voices", issue seemed to be mismatch between piper and ffmpeg

I solved it over an evening working with ChatGPT. at the end I asked for a summary of the fix:

This switches the TTS pipeline to a robust, headered stream:
Piper now writes WAV to stdout (--output_file -) instead of raw PCM.
ffmpeg now reads WAV (no guessing), then transcodes to the requested format.
Also fixes two subtle edge cases:
allow speaker: 0 (if speaker is not None)
write a trailing newline + flush to ensure Piper actually synthesizes the final utterance.

```
diff --git a/app/speech.py b/app/speech.py
index 4c5b1a1..b7a8e54 100644
--- a/app/speech.py
+++ b/app/speech.py
@@ -223,16 +223,19 @@ async def generate_speech(request: GenerateSpeechRequest):
         except KeyError as e:
             raise ServiceUnavailableError(f"Configuration error: tts-1 voice '{voice}' is missing 'model:' setting. KeyError: {e}")
 
-        speaker = voice_map.get('speaker', None)
-
-        tts_args = ["piper", "--model", str(piper_model), "--data-dir", "voices", "--download-dir", "voices", "--output-raw"]
-        if speaker:
-            tts_args.extend(["--speaker", str(speaker)])
+        speaker = voice_map.get('speaker', None)
+
+        # Emit WAV to stdout for a stable, headered stream that ffmpeg can consume reliably.
+        tts_args = ["piper", "--model", str(piper_model),
+                    "--data-dir", "voices", "--download-dir", "voices",
+                    "--output_file", "-"]
+        # Accept speaker=0 as valid; only skip when truly None
+        if speaker is not None:
+            tts_args.extend(["--speaker", str(speaker)])
         if speed != 1.0:
             tts_args.extend(["--length-scale", f"{1.0/speed}"])
 
         tts_proc = subprocess.Popen(tts_args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
-        tts_proc.stdin.write(bytearray(input_text.encode('utf-8')))
+        tts_proc.stdin.write(input_text.encode("utf-8") + b"\n")
+        tts_proc.stdin.flush()
         tts_proc.stdin.close()
 
         try:
@@ -242,10 +245,10 @@ async def generate_speech(request: GenerateSpeechRequest):
         except:
             sample_rate = '22050'
 
-        ffmpeg_args = build_ffmpeg_args(response_format, input_format="s16le", sample_rate=sample_rate)
+        # Read WAV from Piper (headered), then transcode to requested output format
+        ffmpeg_args = build_ffmpeg_args(response_format, input_format="WAV", sample_rate=sample_rate)
 
         # Pipe the output from piper/xtts to the input of ffmpeg
         ffmpeg_args.extend(["-"])
         ffmpeg_proc = subprocess.Popen(ffmpeg_args, stdin=tts_proc.stdout, stdout=subprocess.PIPE)
 
         return StreamingResponse(content=ffmpeg_proc.stdout, media_type=media_type)
```

You previously streamed raw s16le PCM from Piper and told ffmpeg to read -f s16le. On some environments (Docker-in-LXC, certain Piper/ffmpeg versions, CPU pinning constraints), that raw pipe path is brittle:

Pipe timing/buffering: Piper may not emit any frames until it sees a newline/end-of-input. If stdin closes without a newline, some builds produce zero frames. (We added the newline + flush.)

Raw format sensitivity: If anything about the raw PCM isn’t exactly what ffmpeg expects (frame size, timing, sample rate mismatch, partial frames due to buffering), ffmpeg can still produce a tiny, valid output that sounds like a short stock clip.

Version drift: Different Piper builds (or ffmpeg) changed edge behavior over time; what worked earlier stopped working after an image/dep update.

Switching to WAV to stdout gives ffmpeg a headered stream with an explicit format. That removes ambiguity and timing sensitivity, and it matched what worked for you immediately.

Allowing speaker: 0 is also a correctness fix—if speaker: incorrectly drops zero, which is a valid speaker index in multi-speaker models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all speech output turned into "download dour voices", issue seemed to be mismatch between piper and ffmpeg #102

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

all speech output turned into "download dour voices", issue seemed to be mismatch between piper and ffmpeg #102

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions