Skip to content

Feature/add voxcpm engine#9

Closed
phonk2682 wants to merge 11 commits intomainfrom
feature/add_voxcpm_engine
Closed

Feature/add voxcpm engine#9
phonk2682 wants to merge 11 commits intomainfrom
feature/add_voxcpm_engine

Conversation

@phonk2682
Copy link
Copy Markdown
Collaborator

Description

Add VoxCPM TTS engine integration with voice cloning support. This PR introduces a new TTS provider (VoxCPM) and updates the TTS API to support audio file uploads for zero-shot voice cloning.

Key changes:

  • Add VoxCPMEngine implementation with batch and streaming synthesis
  • Update TTS router to accept audio file uploads via multipart form
  • Add speaker_wav parameter to base TTS engine interface
  • Add base64 serialization for audio data in API responses
  • Refactor audio upload validation in deps.py

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • New Engine (STT/TTS provider)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking code cleanup or optimization)
  • Documentation update
  • Performance improvement

Checklist

  • I have read the CONTRIBUTING guide
  • My code follows the project's code style (make format)
  • Linting passes (make lint)
  • Tests pass (make test)
  • Documentation updated (if needed)
  • No sensitive information (API keys, secrets) included

Related Issues

Closes #

Testing & Verification

Automated Tests

  • Unit tests added/updated
  • All existing tests pass

Manual Verification (if applicable)

Tested TTS synthesis via API with VoxCPM engine using reference audio for voice cloning. Verified both batch and streaming endpoints return valid audio output.

API Endpoints Tested (if applicable)

  • Batch endpoint (POST /api/v1/tts/synthesize)
  • SSE streaming (POST /api/v1/tts/synthesize/stream)
  • WebSocket (WS /api/v1/tts/synthesize/ws)

Engine-Specific Tests (if applicable)

  • Engine type: TTS
  • Provider: VoxCPM (OpenBMB)
  • Model: openbmb/VoxCPM0.5
  • Device tested: cuda

Security Impact

  • No security implications
  • Security impact (please describe below)

- Add speaker_wav parameter to BaseTTSEngine.synthesize() and synthesize_stream()
- Add base64 field_serializer for audio_data in TTSChunk and TTSResponse
- Refactor validate_audio_upload to support both required and optional modes
- Add VoxCPMEngine with synthesize and synthesize_stream methods
- Add VoxCPMConfig for engine configuration
- Implement torchaudio.load monkey patch for soundfile fallback
- Change synthesize endpoints from TTSRequest body to Query/Form params
- Add audio file upload support for voice cloning
- Add validation: require either audio or voice parameter
- Add prompt_text parameter for VoxCPM
- Add voxcpm dependency group to pyproject.toml
- Add VoxCPM TTS engine configuration
- Use openbmb/VoxCPM0.5 as default model (HuggingFace)
Copilot AI review requested due to automatic review settings January 22, 2026 07:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds VoxCPM TTS engine integration with zero-shot voice cloning capabilities. The implementation includes batch and streaming synthesis modes, audio file upload support for voice cloning, and updates to the base TTS interface to support reference audio.

Changes:

  • Add VoxCPMEngine with support for batch and streaming synthesis
  • Update TTS API endpoints to accept multipart form data with audio file uploads
  • Add base64 serialization for audio data in API responses
  • Extend base TTS engine interface with speaker_wav parameter

Reviewed changes

Copilot reviewed 8 out of 10 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
pyproject.toml Add voxcpm dependency to optional dependencies
engines.yaml Configure VoxCPM engine with model and device settings
app/models/engine.py Add base64 serialization for audio data in TTSChunk and TTSResponse
app/engines/base.py Add speaker_wav parameter to TTS engine interface methods
app/engines/tts/voxcpm/config.py Define VoxCPM-specific configuration model
app/engines/tts/voxcpm/engine.py Implement VoxCPM TTS engine with batch and streaming synthesis
app/api/deps.py Refactor audio upload validation to support optional uploads
app/api/routers/tts.py Implement TTS endpoints with multipart form support and voice cloning

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# This acts as a fallback/fix for environments where torchaudio 2.10 tries to use broken torchcodec
def safe_load(filepath, **kwargs):
try:
data, sampler_rate = sf.read(filepath)
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name "sampler_rate" should be "sample_rate" for consistency with the rest of the codebase and standard audio processing terminology.

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +95
# Create temp file, close it so other processes can open it if needed
fd, temp_wav_path = tempfile.mkstemp(suffix=".wav")
os.write(fd, speaker_wav)
os.close(fd)
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential resource leak in the temporary file handling. If an exception occurs between creating the temp file and writing to it, or if os.write() fails, the file descriptor may remain open and the temporary file may not be cleaned up. Use a more robust pattern with try-finally or a context manager to ensure the file descriptor is always closed properly.

Suggested change
# Create temp file, close it so other processes can open it if needed
fd, temp_wav_path = tempfile.mkstemp(suffix=".wav")
os.write(fd, speaker_wav)
os.close(fd)
# Create temp file, ensure file descriptor is always closed
fd = None
try:
fd, temp_wav_path = tempfile.mkstemp(suffix=".wav")
os.write(fd, speaker_wav)
finally:
if fd is not None:
try:
os.close(fd)
except OSError:
# Best-effort close; log if needed by surrounding infrastructure
pass

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +219


Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary blank lines. Remove the extra whitespace for better code readability and consistency.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +31
async def synthesize_text(
text: str = Query(..., description="Text to synthesize"),
prompt_text: str | None = Query(None, description="Text content of the reference audio (if provided)"),
voice: str | None = Form(None, description="Voice ID to use"),
audio: bytes | None = Depends(get_optional_audio_upload),
speed: float = Query(1.0, description="Speed multiplier"),
engine_params: str | None = Query(None, description="JSON engine parameters"),
tts_engine: BaseTTSEngine = Depends(get_tts_engine),
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing Form and Query parameters in a single endpoint can cause issues with FastAPI. The 'voice' parameter is declared as Form but 'text' and other parameters are Query. This inconsistency may lead to unexpected behavior. Consider using all Query parameters or restructuring to use a proper multipart/form-data request body with all form fields.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +55
# Validation: Must have either audio (speaker reference) or voice/default
# Actually, some engines might have defaults, but user requested explicit "one of them must be present" logic?
# User said: "Khi user gửi request thì bắt buộc 1 trong 2 phải có" (When user sends request, must have 1 of 2)

if not audio and not voice:
# Check if engine has a default voice?
# But user rule is strict: "Must have 1 of 2"
raise HTTPException(
400, "Either 'audio' file or 'voice' ID must be provided"
)
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic requires either audio or voice to be provided, but this may be too restrictive. Some TTS engines might have default voices configured in their settings that don't require explicit voice parameters. Consider checking if the engine has a default voice before raising this error, or document this requirement more clearly in the API specification.

Suggested change
# Validation: Must have either audio (speaker reference) or voice/default
# Actually, some engines might have defaults, but user requested explicit "one of them must be present" logic?
# User said: "Khi user gửi request thì bắt buộc 1 trong 2 phải có" (When user sends request, must have 1 of 2)
if not audio and not voice:
# Check if engine has a default voice?
# But user rule is strict: "Must have 1 of 2"
raise HTTPException(
400, "Either 'audio' file or 'voice' ID must be provided"
)
# Validation: Prefer explicit audio (speaker reference) or voice ID when provided.
# However, some engines may have a configured default voice and do not require either.
# If the engine does not expose a default voice, enforce that one of the two must be present.
if not audio and not voice:
# Allow engines that define a default voice attribute to proceed without explicit voice/audio.
default_voice = getattr(tts_engine, "default_voice", None)
if default_voice is None:
raise HTTPException(
400, "Either 'audio' file or 'voice' ID must be provided"
)

Copilot uses AI. Check for mistakes.
duration_seconds = len(wav) / self._model.tts_model.sample_rate

except Exception as e:
raise SynthesisError(f"Synthesis failed: {e}") from e
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message exposes internal exception details which could leak sensitive information about the system implementation. Consider providing a more generic error message to the user while logging the detailed exception internally.

Copilot uses AI. Check for mistakes.

# TTS Models
voxcpm = [
"voxcpm>=0.0.1", # Use latest available
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version constraint "voxcpm>=0.0.1" is too permissive and may pull in incompatible future versions. Consider specifying a more restrictive version range to ensure compatibility and prevent breaking changes from being automatically installed.

Suggested change
"voxcpm>=0.0.1", # Use latest available
"voxcpm>=0.0.1,<0.1.0", # Constrain to tested minor range

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +24


Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary blank lines. Remove for better code consistency.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +68
self._model = None

Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup method only sets the model to None without explicitly releasing GPU memory or other resources. If the model uses CUDA tensors, consider explicitly calling torch.cuda.empty_cache() or moving the model to CPU before deletion to ensure proper resource cleanup.

Suggested change
self._model = None
# Explicitly release model resources, especially GPU memory if used
if self._model is not None:
model = self._model
# Move model to CPU if possible to free GPU memory before dropping reference
if hasattr(model, "to"):
try:
model.to("cpu")
except Exception:
# Best-effort; ignore failures during cleanup
pass
# Drop local reference
del model
self._model = None
# Clear CUDA cache if CUDA is available to ensure GPU memory is released
if torch.cuda.is_available():
try:
torch.cuda.empty_cache()
except Exception:
# Ignore CUDA cache cleanup errors during shutdown
pass

Copilot uses AI. Check for mistakes.

await websocket.close()

except WebSocketDisconnect:
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
@phonk2682 phonk2682 closed this Jan 23, 2026
@phonk2682 phonk2682 deleted the feature/add_voxcpm_engine branch January 23, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants