Summary
HybridClaw has inbound audio transcription and can send generated audio files back through some channels, but it does not ship a first-party TTS runtime config or built-in speech synthesis provider.
Why
Voice is already partially present in the product surface. A native TTS path would make voice interactions feel complete instead of requiring custom scripts or external MCP wrappers.
Proposed scope
- Add first-party
tts.* runtime configuration.
- Add at least one built-in speech synthesis provider abstraction with pluggable backends.
- Support generating outbound audio replies directly from agent text.
- Add per-channel delivery rules for supported platforms.
- Allow voice preferences per agent/session where practical.
Candidate UX
/tts on|off
/tts voice <name>
- gateway config for provider, voice, format, and max duration
- optional "reply in voice" channel/session setting
Implementation notes
- Start with one reliable provider/backend and a clean abstraction for future backends.
- Reuse the existing media delivery path instead of inventing a separate outbound transport.
- Ensure generated files are treated as sensitive transient artifacts and cleaned up correctly.
- Consider Discord file delivery and WhatsApp voice-note/PTT support separately so the first version can ship incrementally.
Acceptance criteria
- A user can enable TTS and receive spoken replies without custom tooling.
- At least one built-in provider/backend is documented and tested.
- Generated audio is delivered through supported channels using the existing media pipeline.
- Config and transcripts make it clear when a text reply was synthesized to audio.
- Tests cover provider selection, generation failure paths, and channel delivery behavior.
Summary
HybridClaw has inbound audio transcription and can send generated audio files back through some channels, but it does not ship a first-party TTS runtime config or built-in speech synthesis provider.
Why
Voice is already partially present in the product surface. A native TTS path would make voice interactions feel complete instead of requiring custom scripts or external MCP wrappers.
Proposed scope
tts.*runtime configuration.Candidate UX
/tts on|off/tts voice <name>Implementation notes
Acceptance criteria