Skip to content

feat(tts): implement CosyVoice3 engine with HTTP client and dynamic p…#17

Open
phonk2682 wants to merge 3 commits intomainfrom
feature/add_cosyvoice3_tts
Open

feat(tts): implement CosyVoice3 engine with HTTP client and dynamic p…#17
phonk2682 wants to merge 3 commits intomainfrom
feature/add_cosyvoice3_tts

Conversation

@phonk2682
Copy link
Copy Markdown
Collaborator

Description

Implemented the CosyVoice3 TTS engine integration using an HTTP client approach to communicate with an external CosyVoice3 FastAPI server. This implementation supports:

  • Zero-zero voice cloning using reference audio.
  • Dynamic prompt configuration via API engine_params (prompt_wav_path and prompt_text).
  • Streaming synthesis support.
  • Configurable default voices (mapped to fin_demo samples).

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • New Engine (STT/TTS provider)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking code cleanup or optimization)
  • Documentation update
  • Performance improvement

Checklist

  • I have read the CONTRIBUTING guide
  • My code follows the project's code style (make format)
  • Linting passes (make lint)
  • Tests pass (make test)
  • Documentation updated (if needed)
  • No sensitive information (API keys, secrets) included

Related Issues

Closes #

Testing & Verification

Automated Tests

  • Unit tests added/updated
  • All existing tests pass

Manual Verification (if applicable)

Verified functionality by running the CosyVoice3 server locally and making curl requests to the VoiceCore API:

  1. Checked default voice synthesis.
  2. Verified zero-shot cloning with dynamic prompt_wav_path provided in engine_params.
  3. Validated streaming response.

API Endpoints Tested (if applicable)

  • Batch endpoint (POST /api/v1/stt/transcribe or /tts/synthesize)
  • SSE streaming (POST .../stream)
  • WebSocket (WS .../ws)

Engine-Specific Tests (if applicable)

  • Engine type: [TTS]
  • Provider: [CosyVoice3 (Self-hosted)]
  • Model: [Fun-CosyVoice3-0.5B]
  • Device tested: [cuda]

Security Impact

  • No security implications
  • Security impact (please describe below)

Copilot AI review requested due to automatic review settings February 13, 2026 10:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CosyVoice3 TTS engine implementation that talks to an external FastAPI server via an HTTP client, including configuration models and unit tests, plus dependency/config updates to wire it into the project.

Changes:

  • Introduce CosyVoice3Engine (HTTP client, batch + streaming) and CosyVoice3Config / VoiceConfig.
  • Add unit tests covering config defaults, lifecycle, voice resolution, batch synthesis, and streaming behavior with mocked HTTP.
  • Update pyproject.toml optional dependency groups and engines.yaml engine definitions.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
app/engines/tts/cosyvoice3/engine.py Implements CosyVoice3 HTTP client TTS engine (batch + streaming, voice resolution, metrics).
app/engines/tts/cosyvoice3/config.py Adds Pydantic config models for CosyVoice3 and per-voice prompt settings.
app/engines/tts/cosyvoice3/__init__.py Exposes CosyVoice3 config/engine symbols for import.
tests/unit/engines/tts/cosyvoice3/test_cosyvoice3_engine.py Adds unit tests for the new engine using mocked httpx streaming responses.
tests/unit/engines/tts/cosyvoice3/__init__.py Marks the new test package.
pyproject.toml Adds cosyvoice3 optional dependency group (httpx) and also adds a qwen3-asr group.
engines.yaml Adds CosyVoice3 TTS config block and also enables a qwen3-asr STT block.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Qwen3-ASR: State-of-the-art multilingual ASR (52 languages)
# Install: uv sync --group qwen3-asr
qwen3-asr:
enabled: true
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qwen3-asr is enabled and points to app.engines.stt.qwen3asr.engine.Qwen3ASREngine, but there is no such module/class in the repository. With the default config, engine loading will fail at startup. Either add the missing engine implementation, or set enabled: false / remove this block until the engine exists.

Suggested change
enabled: true
enabled: false # Disabled until Qwen3ASREngine is implemented

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +58
default_voice: "fin_default"
voices:
default:
prompt_wav_path: "/path/to/voice_sample.wav"
prompt_text: "Transcript of the voice sample audio"
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_voice is set to fin_default, but the configured voices map only defines default. As written, requests without an explicit voice will raise Voice 'fin_default' not found. Align default_voice with an existing key (or rename the voice key).

Copilot uses AI. Check for mistakes.
speed: Speech speed multiplier
reference_audio: Reference audio bytes for voice cloning (overrides voice)
reference_text: Transcript of reference audio
**kwargs: Additional parameters (ignored)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synthesize docstring says **kwargs are ignored, but prompt_wav_path / prompt_text are actually consumed via _resolve_voice(...). Update the docstring to reflect the supported engine-specific params so API users know how to use them.

Suggested change
**kwargs: Additional parameters (ignored)
**kwargs: Engine-specific parameters, including:
- ``prompt_wav_path``: Optional path to a prompt WAV file used for zero-shot
voice cloning (overrides the configured voice prompt path).
- ``prompt_text``: Optional text transcript corresponding to the prompt audio.
Additional unexpected keyword arguments are ignored.

Copilot uses AI. Check for mistakes.
Comment on lines +250 to +258
# Collect all PCM chunks
pcm_data = b""
async for chunk in self._call_inference(
tts_text=text,
prompt_wav_path=wav_path,
prompt_text=prompt_text,
speed=effective_speed,
):
pcm_data += chunk
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building pcm_data via repeated pcm_data += chunk in a loop is quadratic-time due to repeated reallocations/copies, and can become expensive for longer syntheses. Accumulate chunks in a list[bytes] (or a bytearray) and b"".join(...) once at the end.

Suggested change
# Collect all PCM chunks
pcm_data = b""
async for chunk in self._call_inference(
tts_text=text,
prompt_wav_path=wav_path,
prompt_text=prompt_text,
speed=effective_speed,
):
pcm_data += chunk
# Collect all PCM chunks efficiently
pcm_chunks = []
async for chunk in self._call_inference(
tts_text=text,
prompt_wav_path=wav_path,
prompt_text=prompt_text,
speed=effective_speed,
):
pcm_chunks.append(chunk)
pcm_data = b"".join(pcm_chunks)

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +84
response = await self._client.get("/docs")
logger.info(
"CosyVoice3 server connected at %s (status: %d)",
self.cv3_config.service_url,
response.status_code,
)
except Exception as e:
logger.warning(
"CosyVoice3 server at %s may not be reachable: %s",
self.cv3_config.service_url,
e,
)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connection verification in _initialize() logs a warning on failure but still returns successfully, so the engine becomes is_ready() even when the server is unreachable. Either raise EngineNotReadyError when the health check fails (and/or when the status code is non-2xx), or adjust the docstring/comments so initialization semantics are accurate.

Suggested change
response = await self._client.get("/docs")
logger.info(
"CosyVoice3 server connected at %s (status: %d)",
self.cv3_config.service_url,
response.status_code,
)
except Exception as e:
logger.warning(
"CosyVoice3 server at %s may not be reachable: %s",
self.cv3_config.service_url,
e,
)
response = await self._client.get("/docs")
if response.is_error:
# Non-success status means the engine is not ready.
await self._client.aclose()
self._client = None
raise EngineNotReadyError(
f"CosyVoice3 server at {self.cv3_config.service_url} "
f"health check failed with status {response.status_code}"
)
logger.info(
"CosyVoice3 server connected at %s (status: %d)",
self.cv3_config.service_url,
response.status_code,
)
except Exception as e:
# Connection failure during health check: engine is not ready.
if self._client is not None:
await self._client.aclose()
self._client = None
logger.warning(
"CosyVoice3 server at %s may not be reachable: %s",
self.cv3_config.service_url,
e,
)
raise EngineNotReadyError(
f"CosyVoice3 server at {self.cv3_config.service_url} is not reachable"
) from e

Copilot uses AI. Check for mistakes.
voice, reference_audio, reference_text, **kwargs
)
prompt_text = self._prepare_prompt_text(prompt_text)
effective_speed = speed if speed != 1.0 else self.cv3_config.speed
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same speed-defaulting issue as in synthesize(): treating speed=1.0 as "use config" prevents callers from explicitly requesting normal speed when the config default is not 1.0. This should follow the base interface semantics where the request parameter overrides config.

Suggested change
effective_speed = speed if speed != 1.0 else self.cv3_config.speed
effective_speed = self.cv3_config.speed if speed is None else speed

Copilot uses AI. Check for mistakes.
Comment on lines +362 to +364

all_pcm_data += pcm_chunk

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_pcm_data += pcm_chunk inside the streaming loop has the same quadratic growth problem as batch mode, and will scale poorly for long streams. Consider accumulating into a bytearray or list[bytes] and joining once for the final response.

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +116
return dyn_wav_path, dyn_prompt_text, None

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompt_wav_path is accepted via **kwargs (from user-provided engine_params) and then opened as a local filesystem path. This allows an API caller to force the server to read arbitrary local files and exfiltrate them to the external CosyVoice3 service. Restrict this to a safe allowlist directory / known voice IDs, or require clients to provide reference_audio bytes instead of a server-side path.

Suggested change
return dyn_wav_path, dyn_prompt_text, None
# Validate that the dynamic path stays within an allowed directory derived from
# configured voices, to avoid arbitrary local file access.
allowed_base_dirs: list[Path] = []
for voice_cfg in self.cv3_config.voices.values():
if getattr(voice_cfg, "prompt_wav_path", None):
try:
base_dir = Path(voice_cfg.prompt_wav_path).resolve().parent
if base_dir not in allowed_base_dirs:
allowed_base_dirs.append(base_dir)
except Exception:
# Ignore malformed paths in config; they simply won't be used as bases.
continue
dyn_path = Path(dyn_wav_path)
try:
resolved_dyn_path = dyn_path.resolve()
except Exception as exc:
raise SynthesisError(
"Invalid prompt_wav_path provided. Use a configured voice or "
"provide reference_audio bytes instead."
) from exc
is_allowed = False
for base_dir in allowed_base_dirs:
try:
resolved_dyn_path.relative_to(base_dir)
is_allowed = True
break
except ValueError:
continue
if not is_allowed:
raise SynthesisError(
"The provided prompt_wav_path is not in an allowed directory. "
"Use a configured voice or provide reference_audio bytes instead."
)
return str(resolved_dyn_path), dyn_prompt_text, None

Copilot uses AI. Check for mistakes.
voice, reference_audio, reference_text, **kwargs
)
prompt_text = self._prepare_prompt_text(prompt_text)
effective_speed = speed if speed != 1.0 else self.cv3_config.speed
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effective_speed = speed if speed != 1.0 else self.cv3_config.speed makes it impossible for callers to override a non-1.0 configured default back to 1.0, even though BaseTTSEngine documents that the speed parameter overrides the config. Consider using an optional speed: float | None (None => config default), or always honor the explicit request value (including 1.0).

Suggested change
effective_speed = speed if speed != 1.0 else self.cv3_config.speed
effective_speed = speed

Copilot uses AI. Check for mistakes.
…gnment

- Fix engines.yaml: add fin_default voice key to match default_voice
- Fix quadratic pcm_data concatenation → list[bytes] + b''.join()
- Fix speed defaulting: use float|None so callers can override to 1.0
- Add prompt_wav_path security validation via allowed_voice_dirs
- Update _initialize docstring to clarify best-effort health check
@phonk2682 phonk2682 requested a review from minhsaco99 February 13, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants