Skip to content

cosformula/openclaw-mlx-audio

Repository files navigation

openclaw-mlx-audio

中文文档

Local TTS plugin for OpenClaw, powered by mlx-audio on Apple Silicon.

MLX and Platform Compatibility

MLX is Apple's machine learning framework, optimized for the unified memory architecture of M-series chips. This plugin depends on MLX and therefore only runs on Apple Silicon Macs (M1 and later).

Intel Macs, Windows, and Linux are not supported. Alternatives for those platforms:

Requirements

  • macOS, Apple Silicon (M1 and later)
  • Default pythonEnvMode: managed requires no preinstalled Python or Homebrew, the plugin bootstraps uv and a lockfile-managed local Python runtime
  • Optional pythonEnvMode: external uses your existing Python environment via pythonExecutable
  • OpenClaw

Quick Start

Tell your OpenClaw:

Install the @cosformula/openclaw-mlx-audio plugin, configure local TTS, and restart.

OpenClaw will handle plugin installation, config changes, and restart automatically.

For Chinese TTS with Qwen3-TTS:

Install the @cosformula/openclaw-mlx-audio plugin, configure local TTS with Qwen3-TTS-0.6B, and restart.

Manual Installation

1. Install the Plugin

openclaw plugin install @cosformula/openclaw-mlx-audio

Or load from a local path in openclaw.json:

{
  "plugins": {
    "load": { "paths": ["/path/to/openclaw-mlx-audio"] }
  }
}

2. Configure the Plugin

Set options in plugins.entries.openclaw-mlx-audio.config within openclaw.json:

{
  "plugins": {
    "entries": {
      "openclaw-mlx-audio": {
        "enabled": true,
        "config": {}
      }
    }
  }
}

The default configuration uses Kokoro-82M with langCode: auto (Kokoro language auto-detection). For Chinese with Qwen3-TTS, set model:

{
  "config": {
    "model": "mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16",
    "workers": 1
  }
}

3. Point OpenClaw TTS to the Local Endpoint

{
  "env": {
    "vars": {
      "OPENAI_TTS_BASE_URL": "http://127.0.0.1:19280/v1"
    }
  },
  "messages": {
    "tts": {
      "provider": "openai",
      "openai": { "apiKey": "local" },
      "timeoutMs": 120000
    }
  }
}

4. Restart OpenClaw

On startup, the plugin will:

  • Start a proxy on the configured port (default 19280)
  • Launch mlx_audio.server on an internal derived port (default 19281)
  • If autoStart: true, warm up the mlx-audio server in the background
  • If autoStart: false, start the server on first /v1/audio/speech, GET /v1/models, tool generate, or /mlx-tts test
  • Require upstream /v1/models health to pass within about 10 seconds during startup, otherwise the request returns unavailable and startup is retried on next request
  • If pythonEnvMode: managed, bootstrap uv into ~/.openclaw/mlx-audio/bin/uv, sync ~/.openclaw/mlx-audio/runtime/ from bundled pyproject.toml and uv.lock, then launch the server via uv run --project ...
  • If pythonEnvMode: external, validate pythonExecutable (Python 3.11-3.13, required modules importable) and use it directly

Plugin config is refreshed in the background while the service is running (every ~2 seconds). You can also run /mlx-tts reload (or tool action reload) to force immediate apply without restarting the OpenClaw gateway.

On first launch, the model will be downloaded (Kokoro-82M is ~345 MB, Qwen3-TTS-0.6B-Base is ~2.3 GB). During startup, /mlx-tts status and tool action status report startup phase and approximate model cache progress (text bar + percentage). If startup times out, the 503 detail returned to OpenClaw includes the same status snapshot. No network connection is needed after the initial download.

Models

The default model is Kokoro-82M. The following models are selected for distinct use cases:

Model Description Languages Repo
Kokoro Fast, multilingual TTS with 54 voice presets EN, JA, ZH, FR, ES, IT, PT, HI Kokoro-82M-bf16
Qwen3-TTS Base Alibaba's multilingual TTS with 3-second voice cloning ZH, EN, JA, KO, and more 0.6B-Base-bf16
Qwen3-TTS VoiceDesign Generates voices from natural language descriptions ZH, EN, JA, KO, and more 1.7B-VoiceDesign-bf16
Chatterbox Expressive multilingual TTS EN, ES, FR, DE, IT, PT, and 10 more chatterbox-fp16

mlx-audio supports additional models (Soprano, Spark-TTS, OuteTTS, CSM, Dia, etc.). See the mlx-audio README for the full list.

Qwen3-TTS Model Variants

Variant Description
Base Foundation model. Supports voice cloning from 3-second reference audio. Can be fine-tuned.
VoiceDesign Generates voices from natural language descriptions (e.g. "a deep male voice with a British accent"). Does not accept reference audio.
CustomVoice Provides 9 preset voices with instruction-based style control.

Currently, mlx-community offers MLX-converted versions of 0.6B-Base and 1.7B-VoiceDesign.

Selection Guide

Memory usage reference:

Model Disk RAM (1 worker)
Kokoro-82M 345 MB ~400 MB
Qwen3-TTS-0.6B-Base 2.3 GB ~1.4 GB
Qwen3-TTS-1.7B-VoiceDesign 4.2 GB ~3.8 GB
Chatterbox ~3 GB ~3.5 GB

For Chatterbox, plan for about 3.5 GB RAM at runtime (1 worker).

  • 8 GB Mac: Kokoro-82M or Qwen3-TTS-0.6B-Base with workers: 1. Models at 1.7B and above will be terminated by the OS due to insufficient memory.
  • 16 GB and above: All models listed above are viable.
  • Chinese: Qwen3-TTS series. Kokoro supports Chinese but produces lower quality output compared to Qwen3-TTS.
  • English: Kokoro-82M has the smallest footprint and lowest latency.
  • Multilingual: Chatterbox covers 16 languages.

Language Codes (Kokoro)

langCode is Kokoro-specific. Qwen3-TTS auto-detects language from input text. Other models ignore this field.

When langCode: auto, detection currently maps only to a, z, or j.

Code Language
a American English
b British English
z Chinese
j Japanese
e Spanish
f French

Voices

Kokoro includes 50+ preset voices:

Category Examples
American female af_heart, af_bella, af_nova, af_sky
American male am_adam, am_echo
Chinese female zf_xiaobei
Chinese male zm_yunxi
Japanese jf_alpha, jm_kumo

Qwen3-TTS Base clones voices from reference audio (refAudio). VoiceDesign generates voices from natural language descriptions (instruct).

When not specified, models use their default voice.

Configuration Reference

All fields are optional:

Field Default Description
model mlx-community/Kokoro-82M-bf16 HuggingFace model ID
port 19280 Public OpenAI-compatible TTS endpoint port (OPENAI_TTS_BASE_URL)
proxyPort Legacy compatibility field. When set, port is treated as server port and proxyPort as public endpoint port
workers 1 Uvicorn worker count
speed 1.0 Speech speed multiplier
langCode auto Kokoro-specific language code. Qwen3-TTS auto-detects from text. Other models ignore this field
refAudio Reference audio path (voice cloning, Base models only)
refText Transcript of reference audio
instruct Voice description text (VoiceDesign models only)
temperature 0.7 Generation temperature
topP 0.95 Nucleus sampling parameter (top_p)
topK 40 Top-k sampling parameter (top_k)
repetitionPenalty 1.0 Repetition penalty (repetition_penalty)
autoStart true Start with OpenClaw
healthCheckIntervalMs 30000 Health check interval in ms
restartOnCrash true Auto-restart on crash
maxRestarts 3 Max consecutive restart attempts

Architecture

OpenClaw tts() -> proxy (:port, default 19280) -> mlx_audio.server (:internal, default 19281) -> Apple Silicon GPU
                 ^ injects model, lang_code, speed, temperature, top_p, top_k, repetition_penalty, response_format=mp3

OpenClaw's TTS client uses the OpenAI /v1/audio/speech API. The additional parameters required by mlx-audio (full model ID, language code, etc.) are not part of the OpenAI API specification.

The proxy intercepts requests, injects configured parameters (model, lang_code, speed, temperature, top_p, top_k, repetition_penalty), forces response_format: "mp3", and forwards them to the mlx-audio server. No changes to OpenClaw are required, the proxy presents itself as a standard OpenAI TTS endpoint. For POST /v1/audio/speech, request bodies larger than 1 MB are rejected with HTTP 413. If the downstream client disconnects before completion, the proxy cancels the upstream request immediately.

The plugin also manages the server lifecycle:

  • In managed mode, bootstraps a local uv toolchain, syncs dependencies from bundled pyproject.toml and uv.lock, and runs from ~/.openclaw/mlx-audio/runtime/.venv/
  • In external mode, validates the configured pythonExecutable and uses that environment without modifying it
  • Starts the mlx-audio server as a child process
  • Auto-restarts on crash (counter resets after 30s of healthy uptime)
  • Cleans up stale processes on the target port before starting
  • Checks available memory before starting; detects OOM kills
  • Tracks startup phase and approximate model cache progress for /mlx-tts status, tool status, and startup timeout errors
  • Restricts tool output paths to /tmp or ~/.openclaw/mlx-audio/outputs, verifies real paths with async filesystem checks, and rejects symbolic-link segments
  • Streams generated audio directly to disk and rejects payloads larger than 64 MB to prevent memory spikes

Troubleshooting

Server crashes 3 times then stops restarting

Check OpenClaw logs for [mlx-audio] Last errors:. Common causes: missing Python dependency, incorrect model name, port conflict. After fixing, modify any config field to reset the crash counter.

SIGKILL

Logs will show ⚠️ Server was killed by SIGKILL (likely out-of-memory). The system terminated the process due to insufficient memory. Use a smaller model or set workers to 1.

Port conflict

The plugin only cleans up stale mlx_audio.server processes on the internal server port. If another app is using the configured port, stop it manually or change port:

# 1) Inspect who owns the public port first (internal server port is +1 in single-port mode)
/usr/sbin/lsof -nP -iTCP:19280 -sTCP:LISTEN

# 2) Only if the command is mlx_audio.server, terminate it gracefully
kill -TERM <mlx_audio_server_pid>

Startup health timeout

If logs show Server did not pass health check within 10000ms, startup did not become healthy in time. The error detail now includes startup phase and approximate model cache progress. Common causes are first-run dependency/model warmup, wrong model name, or dependency mismatch in external mode. Retry after fixing the root cause.

Slow first startup

The model is being downloaded. Kokoro-82M is ~345 MB, Qwen3-TTS-0.6B-Base is ~2.3 GB.

Acknowledgements

License

MIT

About

OpenClaw local TTS plugin powered by mlx-audio, zero API key, zero cloud dependency

Topics

Resources

License

Stars

Watchers

Forks

Packages