Conversation
- Add speaker_wav parameter to BaseTTSEngine.synthesize() and synthesize_stream() - Add base64 field_serializer for audio_data in TTSChunk and TTSResponse - Refactor validate_audio_upload to support both required and optional modes
- Add VoxCPMEngine with synthesize and synthesize_stream methods - Add VoxCPMConfig for engine configuration - Implement torchaudio.load monkey patch for soundfile fallback
- Change synthesize endpoints from TTSRequest body to Query/Form params - Add audio file upload support for voice cloning - Add validation: require either audio or voice parameter - Add prompt_text parameter for VoxCPM
- Add voxcpm dependency group to pyproject.toml
- Add VoxCPM TTS engine configuration - Use openbmb/VoxCPM0.5 as default model (HuggingFace)
There was a problem hiding this comment.
Pull request overview
This PR adds VoxCPM TTS engine integration with zero-shot voice cloning capabilities. The implementation includes batch and streaming synthesis modes, audio file upload support for voice cloning, and updates to the base TTS interface to support reference audio.
Changes:
- Add VoxCPMEngine with support for batch and streaming synthesis
- Update TTS API endpoints to accept multipart form data with audio file uploads
- Add base64 serialization for audio data in API responses
- Extend base TTS engine interface with speaker_wav parameter
Reviewed changes
Copilot reviewed 8 out of 10 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Add voxcpm dependency to optional dependencies |
| engines.yaml | Configure VoxCPM engine with model and device settings |
| app/models/engine.py | Add base64 serialization for audio data in TTSChunk and TTSResponse |
| app/engines/base.py | Add speaker_wav parameter to TTS engine interface methods |
| app/engines/tts/voxcpm/config.py | Define VoxCPM-specific configuration model |
| app/engines/tts/voxcpm/engine.py | Implement VoxCPM TTS engine with batch and streaming synthesis |
| app/api/deps.py | Refactor audio upload validation to support optional uploads |
| app/api/routers/tts.py | Implement TTS endpoints with multipart form support and voice cloning |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # This acts as a fallback/fix for environments where torchaudio 2.10 tries to use broken torchcodec | ||
| def safe_load(filepath, **kwargs): | ||
| try: | ||
| data, sampler_rate = sf.read(filepath) |
There was a problem hiding this comment.
Variable name "sampler_rate" should be "sample_rate" for consistency with the rest of the codebase and standard audio processing terminology.
app/engines/tts/voxcpm/engine.py
Outdated
| # Create temp file, close it so other processes can open it if needed | ||
| fd, temp_wav_path = tempfile.mkstemp(suffix=".wav") | ||
| os.write(fd, speaker_wav) | ||
| os.close(fd) |
There was a problem hiding this comment.
There is a potential resource leak in the temporary file handling. If an exception occurs between creating the temp file and writing to it, or if os.write() fails, the file descriptor may remain open and the temporary file may not be cleaned up. Use a more robust pattern with try-finally or a context manager to ensure the file descriptor is always closed properly.
| # Create temp file, close it so other processes can open it if needed | |
| fd, temp_wav_path = tempfile.mkstemp(suffix=".wav") | |
| os.write(fd, speaker_wav) | |
| os.close(fd) | |
| # Create temp file, ensure file descriptor is always closed | |
| fd = None | |
| try: | |
| fd, temp_wav_path = tempfile.mkstemp(suffix=".wav") | |
| os.write(fd, speaker_wav) | |
| finally: | |
| if fd is not None: | |
| try: | |
| os.close(fd) | |
| except OSError: | |
| # Best-effort close; log if needed by surrounding infrastructure | |
| pass |
app/engines/tts/voxcpm/engine.py
Outdated
|
|
||
|
|
There was a problem hiding this comment.
Unnecessary blank lines. Remove the extra whitespace for better code readability and consistency.
| async def synthesize_text( | ||
| text: str = Query(..., description="Text to synthesize"), | ||
| prompt_text: str | None = Query(None, description="Text content of the reference audio (if provided)"), | ||
| voice: str | None = Form(None, description="Voice ID to use"), | ||
| audio: bytes | None = Depends(get_optional_audio_upload), | ||
| speed: float = Query(1.0, description="Speed multiplier"), | ||
| engine_params: str | None = Query(None, description="JSON engine parameters"), | ||
| tts_engine: BaseTTSEngine = Depends(get_tts_engine), |
There was a problem hiding this comment.
Mixing Form and Query parameters in a single endpoint can cause issues with FastAPI. The 'voice' parameter is declared as Form but 'text' and other parameters are Query. This inconsistency may lead to unexpected behavior. Consider using all Query parameters or restructuring to use a proper multipart/form-data request body with all form fields.
app/api/routers/tts.py
Outdated
| # Validation: Must have either audio (speaker reference) or voice/default | ||
| # Actually, some engines might have defaults, but user requested explicit "one of them must be present" logic? | ||
| # User said: "Khi user gửi request thì bắt buộc 1 trong 2 phải có" (When user sends request, must have 1 of 2) | ||
|
|
||
| if not audio and not voice: | ||
| # Check if engine has a default voice? | ||
| # But user rule is strict: "Must have 1 of 2" | ||
| raise HTTPException( | ||
| 400, "Either 'audio' file or 'voice' ID must be provided" | ||
| ) |
There was a problem hiding this comment.
The validation logic requires either audio or voice to be provided, but this may be too restrictive. Some TTS engines might have default voices configured in their settings that don't require explicit voice parameters. Consider checking if the engine has a default voice before raising this error, or document this requirement more clearly in the API specification.
| # Validation: Must have either audio (speaker reference) or voice/default | |
| # Actually, some engines might have defaults, but user requested explicit "one of them must be present" logic? | |
| # User said: "Khi user gửi request thì bắt buộc 1 trong 2 phải có" (When user sends request, must have 1 of 2) | |
| if not audio and not voice: | |
| # Check if engine has a default voice? | |
| # But user rule is strict: "Must have 1 of 2" | |
| raise HTTPException( | |
| 400, "Either 'audio' file or 'voice' ID must be provided" | |
| ) | |
| # Validation: Prefer explicit audio (speaker reference) or voice ID when provided. | |
| # However, some engines may have a configured default voice and do not require either. | |
| # If the engine does not expose a default voice, enforce that one of the two must be present. | |
| if not audio and not voice: | |
| # Allow engines that define a default voice attribute to proceed without explicit voice/audio. | |
| default_voice = getattr(tts_engine, "default_voice", None) | |
| if default_voice is None: | |
| raise HTTPException( | |
| 400, "Either 'audio' file or 'voice' ID must be provided" | |
| ) |
| duration_seconds = len(wav) / self._model.tts_model.sample_rate | ||
|
|
||
| except Exception as e: | ||
| raise SynthesisError(f"Synthesis failed: {e}") from e |
There was a problem hiding this comment.
The error message exposes internal exception details which could leak sensitive information about the system implementation. Consider providing a more generic error message to the user while logging the detailed exception internally.
|
|
||
| # TTS Models | ||
| voxcpm = [ | ||
| "voxcpm>=0.0.1", # Use latest available |
There was a problem hiding this comment.
The version constraint "voxcpm>=0.0.1" is too permissive and may pull in incompatible future versions. Consider specifying a more restrictive version range to ensure compatibility and prevent breaking changes from being automatically installed.
| "voxcpm>=0.0.1", # Use latest available | |
| "voxcpm>=0.0.1,<0.1.0", # Constrain to tested minor range |
|
|
||
|
|
There was a problem hiding this comment.
Unnecessary blank lines. Remove for better code consistency.
| self._model = None | ||
|
|
There was a problem hiding this comment.
The cleanup method only sets the model to None without explicitly releasing GPU memory or other resources. If the model uses CUDA tensors, consider explicitly calling torch.cuda.empty_cache() or moving the model to CPU before deletion to ensure proper resource cleanup.
| self._model = None | |
| # Explicitly release model resources, especially GPU memory if used | |
| if self._model is not None: | |
| model = self._model | |
| # Move model to CPU if possible to free GPU memory before dropping reference | |
| if hasattr(model, "to"): | |
| try: | |
| model.to("cpu") | |
| except Exception: | |
| # Best-effort; ignore failures during cleanup | |
| pass | |
| # Drop local reference | |
| del model | |
| self._model = None | |
| # Clear CUDA cache if CUDA is available to ensure GPU memory is released | |
| if torch.cuda.is_available(): | |
| try: | |
| torch.cuda.empty_cache() | |
| except Exception: | |
| # Ignore CUDA cache cleanup errors during shutdown | |
| pass |
|
|
||
| await websocket.close() | ||
|
|
||
| except WebSocketDisconnect: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
…event loop on Python 3.12
Description
Add VoxCPM TTS engine integration with voice cloning support. This PR introduces a new TTS provider (VoxCPM) and updates the TTS API to support audio file uploads for zero-shot voice cloning.
Key changes:
speaker_wavparameter to base TTS engine interfaceType of Change
Checklist
make format)make lint)make test)Related Issues
Closes #
Testing & Verification
Automated Tests
Manual Verification (if applicable)
Tested TTS synthesis via API with VoxCPM engine using reference audio for voice cloning. Verified both batch and streaming endpoints return valid audio output.
API Endpoints Tested (if applicable)
POST /api/v1/tts/synthesize)POST /api/v1/tts/synthesize/stream)WS /api/v1/tts/synthesize/ws)Engine-Specific Tests (if applicable)
Security Impact