openclaw-mlx-audio

Local TTS plugin for OpenClaw, powered by mlx-audio on Apple Silicon.

MLX and Platform Compatibility

MLX is Apple's machine learning framework, optimized for the unified memory architecture of M-series chips. This plugin depends on MLX and therefore only runs on Apple Silicon Macs (M1 and later).

Intel Macs, Windows, and Linux are not supported. Alternatives for those platforms:

openedai-speech (self-hosted, requires NVIDIA GPU)
Chatterbox-TTS-Server (same)
OpenClaw's built-in Edge TTS (cloud-based, no GPU required)

Requirements

macOS, Apple Silicon (M1 and later)
Default pythonEnvMode: managed requires no preinstalled Python or Homebrew, the plugin bootstraps uv and a lockfile-managed local Python runtime
Optional pythonEnvMode: external uses your existing Python environment via pythonExecutable
OpenClaw

Quick Start

Tell your OpenClaw:

Install the @cosformula/openclaw-mlx-audio plugin, configure local TTS, and restart.

OpenClaw will handle plugin installation, config changes, and restart automatically.

For Chinese TTS with Qwen3-TTS:

Install the @cosformula/openclaw-mlx-audio plugin, configure local TTS with Qwen3-TTS-0.6B, and restart.

Manual Installation

1. Install the Plugin

openclaw plugin install @cosformula/openclaw-mlx-audio

Or load from a local path in openclaw.json:

{
  "plugins": {
    "load": { "paths": ["/path/to/openclaw-mlx-audio"] }
  }
}

2. Configure the Plugin

Set options in plugins.entries.openclaw-mlx-audio.config within openclaw.json:

{
  "plugins": {
    "entries": {
      "openclaw-mlx-audio": {
        "enabled": true,
        "config": {}
      }
    }
  }
}

The default configuration uses Kokoro-82M with langCode: auto (Kokoro language auto-detection). For Chinese with Qwen3-TTS, set model:

{
  "config": {
    "model": "mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16",
    "workers": 1
  }
}

3. Point OpenClaw TTS to the Local Endpoint

{
  "env": {
    "vars": {
      "OPENAI_TTS_BASE_URL": "http://127.0.0.1:19280/v1"
    }
  },
  "messages": {
    "tts": {
      "provider": "openai",
      "openai": { "apiKey": "local" },
      "timeoutMs": 120000
    }
  }
}

4. Restart OpenClaw

On startup, the plugin will:

Start a proxy on the configured port (default 19280)
Launch mlx_audio.server on an internal derived port (default 19281)
If autoStart: true, warm up the mlx-audio server in the background
If autoStart: false, start the server on first /v1/audio/speech, GET /v1/models, tool generate, or /mlx-tts test
Require upstream /v1/models health to pass within about 10 seconds during startup, otherwise the request returns unavailable and startup is retried on next request
If pythonEnvMode: managed, bootstrap uv into ~/.openclaw/mlx-audio/bin/uv, sync ~/.openclaw/mlx-audio/runtime/ from bundled pyproject.toml and uv.lock, then launch the server via uv run --project ...
If pythonEnvMode: external, validate pythonExecutable (Python 3.11-3.13, required modules importable) and use it directly

Plugin config is refreshed in the background while the service is running (every ~2 seconds). You can also run /mlx-tts reload (or tool action reload) to force immediate apply without restarting the OpenClaw gateway.

On first launch, the model will be downloaded (Kokoro-82M is ~345 MB, Qwen3-TTS-0.6B-Base is ~2.3 GB). During startup, /mlx-tts status and tool action status report startup phase and approximate model cache progress (text bar + percentage). If startup times out, the 503 detail returned to OpenClaw includes the same status snapshot. No network connection is needed after the initial download.

Models

The default model is Kokoro-82M. The following models are selected for distinct use cases:

Model	Description	Languages	Repo
Kokoro	Fast, multilingual TTS with 54 voice presets	EN, JA, ZH, FR, ES, IT, PT, HI	Kokoro-82M-bf16
Qwen3-TTS Base	Alibaba's multilingual TTS with 3-second voice cloning	ZH, EN, JA, KO, and more	0.6B-Base-bf16
Qwen3-TTS VoiceDesign	Generates voices from natural language descriptions	ZH, EN, JA, KO, and more	1.7B-VoiceDesign-bf16
Chatterbox	Expressive multilingual TTS	EN, ES, FR, DE, IT, PT, and 10 more	chatterbox-fp16

mlx-audio supports additional models (Soprano, Spark-TTS, OuteTTS, CSM, Dia, etc.). See the mlx-audio README for the full list.

Qwen3-TTS Model Variants

Variant	Description
Base	Foundation model. Supports voice cloning from 3-second reference audio. Can be fine-tuned.
VoiceDesign	Generates voices from natural language descriptions (e.g. "a deep male voice with a British accent"). Does not accept reference audio.
CustomVoice	Provides 9 preset voices with instruction-based style control.

Currently, mlx-community offers MLX-converted versions of 0.6B-Base and 1.7B-VoiceDesign.

Selection Guide

Memory usage reference:

Model	Disk	RAM (1 worker)
Kokoro-82M	345 MB	~400 MB
Qwen3-TTS-0.6B-Base	2.3 GB	~1.4 GB
Qwen3-TTS-1.7B-VoiceDesign	4.2 GB	~3.8 GB
Chatterbox	~3 GB	~3.5 GB

For Chatterbox, plan for about 3.5 GB RAM at runtime (1 worker).

8 GB Mac: Kokoro-82M or Qwen3-TTS-0.6B-Base with workers: 1. Models at 1.7B and above will be terminated by the OS due to insufficient memory.
16 GB and above: All models listed above are viable.
Chinese: Qwen3-TTS series. Kokoro supports Chinese but produces lower quality output compared to Qwen3-TTS.
English: Kokoro-82M has the smallest footprint and lowest latency.
Multilingual: Chatterbox covers 16 languages.

Language Codes (Kokoro)

langCode is Kokoro-specific. Qwen3-TTS auto-detects language from input text. Other models ignore this field.

When langCode: auto, detection currently maps only to a, z, or j.

Code	Language
`a`	American English
`b`	British English
`z`	Chinese
`j`	Japanese
`e`	Spanish
`f`	French

Voices

Kokoro includes 50+ preset voices:

Category	Examples
American female	`af_heart`, `af_bella`, `af_nova`, `af_sky`
American male	`am_adam`, `am_echo`
Chinese female	`zf_xiaobei`
Chinese male	`zm_yunxi`
Japanese	`jf_alpha`, `jm_kumo`

Qwen3-TTS Base clones voices from reference audio (refAudio). VoiceDesign generates voices from natural language descriptions (instruct).

When not specified, models use their default voice.

Configuration Reference

All fields are optional:

Field	Default	Description
`model`	`mlx-community/Kokoro-82M-bf16`	HuggingFace model ID
`port`	`19280`	Public OpenAI-compatible TTS endpoint port (`OPENAI_TTS_BASE_URL`)
`proxyPort`		Legacy compatibility field. When set, `port` is treated as server port and `proxyPort` as public endpoint port
`workers`	`1`	Uvicorn worker count
`speed`	`1.0`	Speech speed multiplier
`langCode`	`auto`	Kokoro-specific language code. Qwen3-TTS auto-detects from text. Other models ignore this field
`refAudio`		Reference audio path (voice cloning, Base models only)
`refText`		Transcript of reference audio
`instruct`		Voice description text (VoiceDesign models only)
`temperature`	`0.7`	Generation temperature
`topP`	`0.95`	Nucleus sampling parameter (`top_p`)
`topK`	`40`	Top-k sampling parameter (`top_k`)
`repetitionPenalty`	`1.0`	Repetition penalty (`repetition_penalty`)
`autoStart`	`true`	Start with OpenClaw
`healthCheckIntervalMs`	`30000`	Health check interval in ms
`restartOnCrash`	`true`	Auto-restart on crash
`maxRestarts`	`3`	Max consecutive restart attempts

Architecture

OpenClaw tts() -> proxy (:port, default 19280) -> mlx_audio.server (:internal, default 19281) -> Apple Silicon GPU
                 ^ injects model, lang_code, speed, temperature, top_p, top_k, repetition_penalty, response_format=mp3

OpenClaw's TTS client uses the OpenAI /v1/audio/speech API. The additional parameters required by mlx-audio (full model ID, language code, etc.) are not part of the OpenAI API specification.

The proxy intercepts requests, injects configured parameters (model, lang_code, speed, temperature, top_p, top_k, repetition_penalty), forces response_format: "mp3", and forwards them to the mlx-audio server. No changes to OpenClaw are required, the proxy presents itself as a standard OpenAI TTS endpoint. For POST /v1/audio/speech, request bodies larger than 1 MB are rejected with HTTP 413. If the downstream client disconnects before completion, the proxy cancels the upstream request immediately.

The plugin also manages the server lifecycle:

In managed mode, bootstraps a local uv toolchain, syncs dependencies from bundled pyproject.toml and uv.lock, and runs from ~/.openclaw/mlx-audio/runtime/.venv/
In external mode, validates the configured pythonExecutable and uses that environment without modifying it
Starts the mlx-audio server as a child process
Auto-restarts on crash (counter resets after 30s of healthy uptime)
Cleans up stale processes on the target port before starting
Checks available memory before starting; detects OOM kills
Tracks startup phase and approximate model cache progress for /mlx-tts status, tool status, and startup timeout errors
Restricts tool output paths to /tmp or ~/.openclaw/mlx-audio/outputs, verifies real paths with async filesystem checks, and rejects symbolic-link segments
Streams generated audio directly to disk and rejects payloads larger than 64 MB to prevent memory spikes

Troubleshooting

Server crashes 3 times then stops restarting

Check OpenClaw logs for [mlx-audio] Last errors:. Common causes: missing Python dependency, incorrect model name, port conflict. After fixing, modify any config field to reset the crash counter.

SIGKILL

Logs will show ⚠️ Server was killed by SIGKILL (likely out-of-memory). The system terminated the process due to insufficient memory. Use a smaller model or set workers to 1.

Port conflict

The plugin only cleans up stale mlx_audio.server processes on the internal server port. If another app is using the configured port, stop it manually or change port:

# 1) Inspect who owns the public port first (internal server port is +1 in single-port mode)
/usr/sbin/lsof -nP -iTCP:19280 -sTCP:LISTEN

# 2) Only if the command is mlx_audio.server, terminate it gracefully
kill -TERM <mlx_audio_server_pid>

Startup health timeout

If logs show Server did not pass health check within 10000ms, startup did not become healthy in time. The error detail now includes startup phase and approximate model cache progress. Common causes are first-run dependency/model warmup, wrong model name, or dependency mismatch in external mode. Retry after fixing the root cause.

Slow first startup

The model is being downloaded. Kokoro-82M is ~345 MB, Qwen3-TTS-0.6B-Base is ~2.3 GB.

Acknowledgements

mlx-audio by Prince Canuma
MLX by Apple
OpenClaw

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github/workflows		.github/workflows
docs		docs
python-runtime		python-runtime
scripts		scripts
skills/mlx-audio		skills/mlx-audio
src		src
test		test
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
index.ts		index.ts
openclaw.plugin.json		openclaw.plugin.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openclaw-mlx-audio

MLX and Platform Compatibility

Requirements

Quick Start

Manual Installation

1. Install the Plugin

2. Configure the Plugin

3. Point OpenClaw TTS to the Local Endpoint

4. Restart OpenClaw

Models

Qwen3-TTS Model Variants

Selection Guide

Language Codes (Kokoro)

Voices

Configuration Reference

Architecture

Troubleshooting

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

openclaw-mlx-audio

MLX and Platform Compatibility

Requirements

Quick Start

Manual Installation

1. Install the Plugin

2. Configure the Plugin

3. Point OpenClaw TTS to the Local Endpoint

4. Restart OpenClaw

Models

Qwen3-TTS Model Variants

Selection Guide

Language Codes (Kokoro)

Voices

Configuration Reference

Architecture

Troubleshooting

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages