feat(tts): implement CosyVoice3 engine with HTTP client and dynamic p…#17
feat(tts): implement CosyVoice3 engine with HTTP client and dynamic p…#17
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new CosyVoice3 TTS engine implementation that talks to an external FastAPI server via an HTTP client, including configuration models and unit tests, plus dependency/config updates to wire it into the project.
Changes:
- Introduce
CosyVoice3Engine(HTTP client, batch + streaming) andCosyVoice3Config/VoiceConfig. - Add unit tests covering config defaults, lifecycle, voice resolution, batch synthesis, and streaming behavior with mocked HTTP.
- Update
pyproject.tomloptional dependency groups andengines.yamlengine definitions.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
app/engines/tts/cosyvoice3/engine.py |
Implements CosyVoice3 HTTP client TTS engine (batch + streaming, voice resolution, metrics). |
app/engines/tts/cosyvoice3/config.py |
Adds Pydantic config models for CosyVoice3 and per-voice prompt settings. |
app/engines/tts/cosyvoice3/__init__.py |
Exposes CosyVoice3 config/engine symbols for import. |
tests/unit/engines/tts/cosyvoice3/test_cosyvoice3_engine.py |
Adds unit tests for the new engine using mocked httpx streaming responses. |
tests/unit/engines/tts/cosyvoice3/__init__.py |
Marks the new test package. |
pyproject.toml |
Adds cosyvoice3 optional dependency group (httpx) and also adds a qwen3-asr group. |
engines.yaml |
Adds CosyVoice3 TTS config block and also enables a qwen3-asr STT block. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Qwen3-ASR: State-of-the-art multilingual ASR (52 languages) | ||
| # Install: uv sync --group qwen3-asr | ||
| qwen3-asr: | ||
| enabled: true |
There was a problem hiding this comment.
qwen3-asr is enabled and points to app.engines.stt.qwen3asr.engine.Qwen3ASREngine, but there is no such module/class in the repository. With the default config, engine loading will fail at startup. Either add the missing engine implementation, or set enabled: false / remove this block until the engine exists.
| enabled: true | |
| enabled: false # Disabled until Qwen3ASREngine is implemented |
| default_voice: "fin_default" | ||
| voices: | ||
| default: | ||
| prompt_wav_path: "/path/to/voice_sample.wav" | ||
| prompt_text: "Transcript of the voice sample audio" |
There was a problem hiding this comment.
default_voice is set to fin_default, but the configured voices map only defines default. As written, requests without an explicit voice will raise Voice 'fin_default' not found. Align default_voice with an existing key (or rename the voice key).
app/engines/tts/cosyvoice3/engine.py
Outdated
| speed: Speech speed multiplier | ||
| reference_audio: Reference audio bytes for voice cloning (overrides voice) | ||
| reference_text: Transcript of reference audio | ||
| **kwargs: Additional parameters (ignored) |
There was a problem hiding this comment.
The synthesize docstring says **kwargs are ignored, but prompt_wav_path / prompt_text are actually consumed via _resolve_voice(...). Update the docstring to reflect the supported engine-specific params so API users know how to use them.
| **kwargs: Additional parameters (ignored) | |
| **kwargs: Engine-specific parameters, including: | |
| - ``prompt_wav_path``: Optional path to a prompt WAV file used for zero-shot | |
| voice cloning (overrides the configured voice prompt path). | |
| - ``prompt_text``: Optional text transcript corresponding to the prompt audio. | |
| Additional unexpected keyword arguments are ignored. |
app/engines/tts/cosyvoice3/engine.py
Outdated
| # Collect all PCM chunks | ||
| pcm_data = b"" | ||
| async for chunk in self._call_inference( | ||
| tts_text=text, | ||
| prompt_wav_path=wav_path, | ||
| prompt_text=prompt_text, | ||
| speed=effective_speed, | ||
| ): | ||
| pcm_data += chunk |
There was a problem hiding this comment.
Building pcm_data via repeated pcm_data += chunk in a loop is quadratic-time due to repeated reallocations/copies, and can become expensive for longer syntheses. Accumulate chunks in a list[bytes] (or a bytearray) and b"".join(...) once at the end.
| # Collect all PCM chunks | |
| pcm_data = b"" | |
| async for chunk in self._call_inference( | |
| tts_text=text, | |
| prompt_wav_path=wav_path, | |
| prompt_text=prompt_text, | |
| speed=effective_speed, | |
| ): | |
| pcm_data += chunk | |
| # Collect all PCM chunks efficiently | |
| pcm_chunks = [] | |
| async for chunk in self._call_inference( | |
| tts_text=text, | |
| prompt_wav_path=wav_path, | |
| prompt_text=prompt_text, | |
| speed=effective_speed, | |
| ): | |
| pcm_chunks.append(chunk) | |
| pcm_data = b"".join(pcm_chunks) |
| response = await self._client.get("/docs") | ||
| logger.info( | ||
| "CosyVoice3 server connected at %s (status: %d)", | ||
| self.cv3_config.service_url, | ||
| response.status_code, | ||
| ) | ||
| except Exception as e: | ||
| logger.warning( | ||
| "CosyVoice3 server at %s may not be reachable: %s", | ||
| self.cv3_config.service_url, | ||
| e, | ||
| ) |
There was a problem hiding this comment.
Connection verification in _initialize() logs a warning on failure but still returns successfully, so the engine becomes is_ready() even when the server is unreachable. Either raise EngineNotReadyError when the health check fails (and/or when the status code is non-2xx), or adjust the docstring/comments so initialization semantics are accurate.
| response = await self._client.get("/docs") | |
| logger.info( | |
| "CosyVoice3 server connected at %s (status: %d)", | |
| self.cv3_config.service_url, | |
| response.status_code, | |
| ) | |
| except Exception as e: | |
| logger.warning( | |
| "CosyVoice3 server at %s may not be reachable: %s", | |
| self.cv3_config.service_url, | |
| e, | |
| ) | |
| response = await self._client.get("/docs") | |
| if response.is_error: | |
| # Non-success status means the engine is not ready. | |
| await self._client.aclose() | |
| self._client = None | |
| raise EngineNotReadyError( | |
| f"CosyVoice3 server at {self.cv3_config.service_url} " | |
| f"health check failed with status {response.status_code}" | |
| ) | |
| logger.info( | |
| "CosyVoice3 server connected at %s (status: %d)", | |
| self.cv3_config.service_url, | |
| response.status_code, | |
| ) | |
| except Exception as e: | |
| # Connection failure during health check: engine is not ready. | |
| if self._client is not None: | |
| await self._client.aclose() | |
| self._client = None | |
| logger.warning( | |
| "CosyVoice3 server at %s may not be reachable: %s", | |
| self.cv3_config.service_url, | |
| e, | |
| ) | |
| raise EngineNotReadyError( | |
| f"CosyVoice3 server at {self.cv3_config.service_url} is not reachable" | |
| ) from e |
app/engines/tts/cosyvoice3/engine.py
Outdated
| voice, reference_audio, reference_text, **kwargs | ||
| ) | ||
| prompt_text = self._prepare_prompt_text(prompt_text) | ||
| effective_speed = speed if speed != 1.0 else self.cv3_config.speed |
There was a problem hiding this comment.
Same speed-defaulting issue as in synthesize(): treating speed=1.0 as "use config" prevents callers from explicitly requesting normal speed when the config default is not 1.0. This should follow the base interface semantics where the request parameter overrides config.
| effective_speed = speed if speed != 1.0 else self.cv3_config.speed | |
| effective_speed = self.cv3_config.speed if speed is None else speed |
|
|
||
| all_pcm_data += pcm_chunk | ||
|
|
There was a problem hiding this comment.
all_pcm_data += pcm_chunk inside the streaming loop has the same quadratic growth problem as batch mode, and will scale poorly for long streams. Consider accumulating into a bytearray or list[bytes] and joining once for the final response.
| return dyn_wav_path, dyn_prompt_text, None | ||
|
|
There was a problem hiding this comment.
prompt_wav_path is accepted via **kwargs (from user-provided engine_params) and then opened as a local filesystem path. This allows an API caller to force the server to read arbitrary local files and exfiltrate them to the external CosyVoice3 service. Restrict this to a safe allowlist directory / known voice IDs, or require clients to provide reference_audio bytes instead of a server-side path.
| return dyn_wav_path, dyn_prompt_text, None | |
| # Validate that the dynamic path stays within an allowed directory derived from | |
| # configured voices, to avoid arbitrary local file access. | |
| allowed_base_dirs: list[Path] = [] | |
| for voice_cfg in self.cv3_config.voices.values(): | |
| if getattr(voice_cfg, "prompt_wav_path", None): | |
| try: | |
| base_dir = Path(voice_cfg.prompt_wav_path).resolve().parent | |
| if base_dir not in allowed_base_dirs: | |
| allowed_base_dirs.append(base_dir) | |
| except Exception: | |
| # Ignore malformed paths in config; they simply won't be used as bases. | |
| continue | |
| dyn_path = Path(dyn_wav_path) | |
| try: | |
| resolved_dyn_path = dyn_path.resolve() | |
| except Exception as exc: | |
| raise SynthesisError( | |
| "Invalid prompt_wav_path provided. Use a configured voice or " | |
| "provide reference_audio bytes instead." | |
| ) from exc | |
| is_allowed = False | |
| for base_dir in allowed_base_dirs: | |
| try: | |
| resolved_dyn_path.relative_to(base_dir) | |
| is_allowed = True | |
| break | |
| except ValueError: | |
| continue | |
| if not is_allowed: | |
| raise SynthesisError( | |
| "The provided prompt_wav_path is not in an allowed directory. " | |
| "Use a configured voice or provide reference_audio bytes instead." | |
| ) | |
| return str(resolved_dyn_path), dyn_prompt_text, None |
app/engines/tts/cosyvoice3/engine.py
Outdated
| voice, reference_audio, reference_text, **kwargs | ||
| ) | ||
| prompt_text = self._prepare_prompt_text(prompt_text) | ||
| effective_speed = speed if speed != 1.0 else self.cv3_config.speed |
There was a problem hiding this comment.
effective_speed = speed if speed != 1.0 else self.cv3_config.speed makes it impossible for callers to override a non-1.0 configured default back to 1.0, even though BaseTTSEngine documents that the speed parameter overrides the config. Consider using an optional speed: float | None (None => config default), or always honor the explicit request value (including 1.0).
| effective_speed = speed if speed != 1.0 else self.cv3_config.speed | |
| effective_speed = speed |
…gnment - Fix engines.yaml: add fin_default voice key to match default_voice - Fix quadratic pcm_data concatenation → list[bytes] + b''.join() - Fix speed defaulting: use float|None so callers can override to 1.0 - Add prompt_wav_path security validation via allowed_voice_dirs - Update _initialize docstring to clarify best-effort health check
Description
Implemented the CosyVoice3 TTS engine integration using an HTTP client approach to communicate with an external CosyVoice3 FastAPI server. This implementation supports:
engine_params(prompt_wav_pathand prompt_text).fin_demosamples).Type of Change
Checklist
make format)make lint)make test)Related Issues
Closes #
Testing & Verification
Automated Tests
Manual Verification (if applicable)
Verified functionality by running the CosyVoice3 server locally and making
curlrequests to the VoiceCore API:prompt_wav_pathprovided inengine_params.API Endpoints Tested (if applicable)
POST /api/v1/stt/transcribeor/tts/synthesize)POST .../stream)WS .../ws)Engine-Specific Tests (if applicable)
Security Impact