Local TTS plugin for OpenClaw, powered by mlx-audio on Apple Silicon.
MLX is Apple's machine learning framework, optimized for the unified memory architecture of M-series chips. This plugin depends on MLX and therefore only runs on Apple Silicon Macs (M1 and later).
Intel Macs, Windows, and Linux are not supported. Alternatives for those platforms:
- openedai-speech (self-hosted, requires NVIDIA GPU)
- Chatterbox-TTS-Server (same)
- OpenClaw's built-in Edge TTS (cloud-based, no GPU required)
- macOS, Apple Silicon (M1 and later)
- Default
pythonEnvMode: managedrequires no preinstalled Python or Homebrew, the plugin bootstrapsuvand a lockfile-managed local Python runtime - Optional
pythonEnvMode: externaluses your existing Python environment viapythonExecutable - OpenClaw
Tell your OpenClaw:
Install the @cosformula/openclaw-mlx-audio plugin, configure local TTS, and restart.
OpenClaw will handle plugin installation, config changes, and restart automatically.
For Chinese TTS with Qwen3-TTS:
Install the @cosformula/openclaw-mlx-audio plugin, configure local TTS with Qwen3-TTS-0.6B, and restart.
openclaw plugin install @cosformula/openclaw-mlx-audioOr load from a local path in openclaw.json:
{
"plugins": {
"load": { "paths": ["/path/to/openclaw-mlx-audio"] }
}
}Set options in plugins.entries.openclaw-mlx-audio.config within openclaw.json:
{
"plugins": {
"entries": {
"openclaw-mlx-audio": {
"enabled": true,
"config": {}
}
}
}
}The default configuration uses Kokoro-82M with langCode: auto (Kokoro language auto-detection). For Chinese with Qwen3-TTS, set model:
{
"config": {
"model": "mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16",
"workers": 1
}
}{
"env": {
"vars": {
"OPENAI_TTS_BASE_URL": "http://127.0.0.1:19280/v1"
}
},
"messages": {
"tts": {
"provider": "openai",
"openai": { "apiKey": "local" },
"timeoutMs": 120000
}
}
}On startup, the plugin will:
- Start a proxy on the configured
port(default19280) - Launch
mlx_audio.serveron an internal derived port (default19281) - If
autoStart: true, warm up the mlx-audio server in the background - If
autoStart: false, start the server on first/v1/audio/speech,GET /v1/models, toolgenerate, or/mlx-tts test - Require upstream
/v1/modelshealth to pass within about 10 seconds during startup, otherwise the request returns unavailable and startup is retried on next request - If
pythonEnvMode: managed, bootstrapuvinto~/.openclaw/mlx-audio/bin/uv, sync~/.openclaw/mlx-audio/runtime/from bundledpyproject.tomlanduv.lock, then launch the server viauv run --project ... - If
pythonEnvMode: external, validatepythonExecutable(Python 3.11-3.13, required modules importable) and use it directly
Plugin config is refreshed in the background while the service is running (every ~2 seconds). You can also run /mlx-tts reload (or tool action reload) to force immediate apply without restarting the OpenClaw gateway.
On first launch, the model will be downloaded (Kokoro-82M is ~345 MB, Qwen3-TTS-0.6B-Base is ~2.3 GB). During startup, /mlx-tts status and tool action status report startup phase and approximate model cache progress (text bar + percentage). If startup times out, the 503 detail returned to OpenClaw includes the same status snapshot. No network connection is needed after the initial download.
The default model is Kokoro-82M. The following models are selected for distinct use cases:
| Model | Description | Languages | Repo |
|---|---|---|---|
| Kokoro | Fast, multilingual TTS with 54 voice presets | EN, JA, ZH, FR, ES, IT, PT, HI | Kokoro-82M-bf16 |
| Qwen3-TTS Base | Alibaba's multilingual TTS with 3-second voice cloning | ZH, EN, JA, KO, and more | 0.6B-Base-bf16 |
| Qwen3-TTS VoiceDesign | Generates voices from natural language descriptions | ZH, EN, JA, KO, and more | 1.7B-VoiceDesign-bf16 |
| Chatterbox | Expressive multilingual TTS | EN, ES, FR, DE, IT, PT, and 10 more | chatterbox-fp16 |
mlx-audio supports additional models (Soprano, Spark-TTS, OuteTTS, CSM, Dia, etc.). See the mlx-audio README for the full list.
| Variant | Description |
|---|---|
| Base | Foundation model. Supports voice cloning from 3-second reference audio. Can be fine-tuned. |
| VoiceDesign | Generates voices from natural language descriptions (e.g. "a deep male voice with a British accent"). Does not accept reference audio. |
| CustomVoice | Provides 9 preset voices with instruction-based style control. |
Currently, mlx-community offers MLX-converted versions of 0.6B-Base and 1.7B-VoiceDesign.
Memory usage reference:
| Model | Disk | RAM (1 worker) |
|---|---|---|
| Kokoro-82M | 345 MB | ~400 MB |
| Qwen3-TTS-0.6B-Base | 2.3 GB | ~1.4 GB |
| Qwen3-TTS-1.7B-VoiceDesign | 4.2 GB | ~3.8 GB |
| Chatterbox | ~3 GB | ~3.5 GB |
For Chatterbox, plan for about 3.5 GB RAM at runtime (1 worker).
- 8 GB Mac: Kokoro-82M or Qwen3-TTS-0.6B-Base with
workers: 1. Models at 1.7B and above will be terminated by the OS due to insufficient memory. - 16 GB and above: All models listed above are viable.
- Chinese: Qwen3-TTS series. Kokoro supports Chinese but produces lower quality output compared to Qwen3-TTS.
- English: Kokoro-82M has the smallest footprint and lowest latency.
- Multilingual: Chatterbox covers 16 languages.
langCode is Kokoro-specific. Qwen3-TTS auto-detects language from input text. Other models ignore this field.
When langCode: auto, detection currently maps only to a, z, or j.
| Code | Language |
|---|---|
a |
American English |
b |
British English |
z |
Chinese |
j |
Japanese |
e |
Spanish |
f |
French |
Kokoro includes 50+ preset voices:
| Category | Examples |
|---|---|
| American female | af_heart, af_bella, af_nova, af_sky |
| American male | am_adam, am_echo |
| Chinese female | zf_xiaobei |
| Chinese male | zm_yunxi |
| Japanese | jf_alpha, jm_kumo |
Qwen3-TTS Base clones voices from reference audio (refAudio). VoiceDesign generates voices from natural language descriptions (instruct).
When not specified, models use their default voice.
All fields are optional:
| Field | Default | Description |
|---|---|---|
model |
mlx-community/Kokoro-82M-bf16 |
HuggingFace model ID |
port |
19280 |
Public OpenAI-compatible TTS endpoint port (OPENAI_TTS_BASE_URL) |
proxyPort |
Legacy compatibility field. When set, port is treated as server port and proxyPort as public endpoint port |
|
workers |
1 |
Uvicorn worker count |
speed |
1.0 |
Speech speed multiplier |
langCode |
auto |
Kokoro-specific language code. Qwen3-TTS auto-detects from text. Other models ignore this field |
refAudio |
Reference audio path (voice cloning, Base models only) | |
refText |
Transcript of reference audio | |
instruct |
Voice description text (VoiceDesign models only) | |
temperature |
0.7 |
Generation temperature |
topP |
0.95 |
Nucleus sampling parameter (top_p) |
topK |
40 |
Top-k sampling parameter (top_k) |
repetitionPenalty |
1.0 |
Repetition penalty (repetition_penalty) |
autoStart |
true |
Start with OpenClaw |
healthCheckIntervalMs |
30000 |
Health check interval in ms |
restartOnCrash |
true |
Auto-restart on crash |
maxRestarts |
3 |
Max consecutive restart attempts |
OpenClaw tts() -> proxy (:port, default 19280) -> mlx_audio.server (:internal, default 19281) -> Apple Silicon GPU
^ injects model, lang_code, speed, temperature, top_p, top_k, repetition_penalty, response_format=mp3
OpenClaw's TTS client uses the OpenAI /v1/audio/speech API. The additional parameters required by mlx-audio (full model ID, language code, etc.) are not part of the OpenAI API specification.
The proxy intercepts requests, injects configured parameters (model, lang_code, speed, temperature, top_p, top_k, repetition_penalty), forces response_format: "mp3", and forwards them to the mlx-audio server. No changes to OpenClaw are required, the proxy presents itself as a standard OpenAI TTS endpoint.
For POST /v1/audio/speech, request bodies larger than 1 MB are rejected with HTTP 413.
If the downstream client disconnects before completion, the proxy cancels the upstream request immediately.
The plugin also manages the server lifecycle:
- In
managedmode, bootstraps a localuvtoolchain, syncs dependencies from bundledpyproject.tomlanduv.lock, and runs from~/.openclaw/mlx-audio/runtime/.venv/ - In
externalmode, validates the configuredpythonExecutableand uses that environment without modifying it - Starts the mlx-audio server as a child process
- Auto-restarts on crash (counter resets after 30s of healthy uptime)
- Cleans up stale processes on the target port before starting
- Checks available memory before starting; detects OOM kills
- Tracks startup phase and approximate model cache progress for
/mlx-tts status, toolstatus, and startup timeout errors - Restricts tool output paths to
/tmpor~/.openclaw/mlx-audio/outputs, verifies real paths with async filesystem checks, and rejects symbolic-link segments - Streams generated audio directly to disk and rejects payloads larger than 64 MB to prevent memory spikes
Server crashes 3 times then stops restarting
Check OpenClaw logs for [mlx-audio] Last errors:. Common causes: missing Python dependency, incorrect model name, port conflict. After fixing, modify any config field to reset the crash counter.
SIGKILL
Logs will show ⚠️ Server was killed by SIGKILL (likely out-of-memory). The system terminated the process due to insufficient memory. Use a smaller model or set workers to 1.
Port conflict
The plugin only cleans up stale mlx_audio.server processes on the internal server port. If another app is using the configured port, stop it manually or change port:
# 1) Inspect who owns the public port first (internal server port is +1 in single-port mode)
/usr/sbin/lsof -nP -iTCP:19280 -sTCP:LISTEN
# 2) Only if the command is mlx_audio.server, terminate it gracefully
kill -TERM <mlx_audio_server_pid>Startup health timeout
If logs show Server did not pass health check within 10000ms, startup did not become healthy in time. The error detail now includes startup phase and approximate model cache progress. Common causes are first-run dependency/model warmup, wrong model name, or dependency mismatch in external mode. Retry after fixing the root cause.
Slow first startup
The model is being downloaded. Kokoro-82M is ~345 MB, Qwen3-TTS-0.6B-Base is ~2.3 GB.
MIT