A production-ready PyTorch → CoreML conversion pipeline for Kokoro-82M, enabling on-device text-to-speech on Apple Silicon with Apple Neural Engine acceleration.
Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
This repository contains both the Python exporters and the CoreML models used by a macOS app (TalkToMe). We converged on a simple, robust runtime architecture designed to avoid CoreML dynamic-shape pitfalls:
- Two-stage pipeline (Duration → Decoder-only Synth) with fixed shapes and Swift-side alignment.
- Duration model: fixed 128-token input/attention; outputs
d,t_en,s, andref_s_out. - Decoder-only synthesizer (3s bucket): fixed inputs
asr [1,512,72],F0_pred [1,144],N_pred [1,144],ref_s [1,256]; outputwaveform [1,43200]. - Swift builds alignment
pred_aln_trg [tokens, frames], computesasr = t_en @ pred_aln_trg, derives simpleF0/Ncurves, normalizes channels, and calls CoreML. - Full synthesizer models are intentionally excluded from the app bundle to prevent E5RT dynamic-shape warnings; we prefer the decoder-only variant.
Why this works: avoiding dynamic shapes and doing pre-decoder math in Swift prevents Espresso/E5RT errors like “Invalid blob shape … non_zero_0_classic_cpu - [?, 3]”.
- Duration model exported with fixed shapes (input_ids/attention_mask:
[1,128]). - Decoder-only 3s model exported with strictly static shapes; no runtime
.expand()/data-dependent ops. - Swift-side shape enforcement (pad/crop) and channel normalization for decoder inputs.
- Persistent CoreML compilation cache in Application Support; automatic recompilation on source change.
- Developer flags to dump waveforms/spectrograms, select compute units, and prefer decoder-only.
- Optional Python tokenizer bridge to feed real Kokoro phoneme IDs during bring-up.
- Tokenization
- Preferred: Python bridge (
kokoro-coreml/dev_tokenize.py) returns true Kokoro phoneme IDs viaKPipeline. The script now prints pure JSON on stdout (warnings redirected to stderr) so the app can parse it reliably. - Fallback: lightweight Swift normalizer maps characters to IDs using
checkpoints/config.jsonvocab; usable, but can produce noisy audio.
- Duration (CoreML)
- Inputs:
input_ids [1,128] (int32),attention_mask [1,128] (int32),ref_s [1,256] (float32),speed [1] (float32). - Outputs:
pred_dur,d,t_en,s,ref_s_out(all float32). We usepred_durto build the alignment matrix in Swift.
- Alignment (Swift)
- Build
pred_aln_trg [tokens, frames]with contiguous per-token spans whose total equals the requested bucket (e.g., 3s → 72 frames at 24kHz with 40 fps text features; we use 72 here for the decoder’s ASR time).
- Decoder‑only Synth (CoreML)
- Shapes enforced before predict:
asr [1,512,72](fromt_en @ pred_aln_trg, channel-padded/cropped, per-channel min-max normalized)F0_pred [1,144],N_pred [1,144](2× ASR time)ref_s [1,256](baseline voice embedding; the decoder slices first 128 dims internally)
- Output:
waveform [1,43200]for 3s at 24 kHz. We convert to PCM and play.
- Audio I/O
- Playback is 24 kHz mono. A developer flag can save the synthesized PCM to WAV for analysis.
Core model/runtime:
com.talktome.coreml.preferDecoderOnly(Bool, default true): prefer decoder-only models.com.talktome.coreml.no5sDecoder(Bool, default true): avoid 5s decoder-only during bring-up.com.talktome.coreml.computeUnits(String): one ofall|cpuAndNeuralEngine|cpuAndGPU|cpuOnly. Decoder-only defaults tocpuAndGPUunless explicitly overridden.com.talktome.coreml.dumpSpectrograms(Bool): dump CSV spectrograms for ASR features.com.talktome.coreml.dumpWaveforms(Bool): write synthesized WAVs to the system temp directory.
Tokenizer bridge:
com.talktome.dev.usePythonTokenizer(Bool): prefer Python tokenizer when enabled.com.talktome.dev.tokenizerScript(String): absolute path todev_tokenize.py.com.talktome.dev.configPath(String): absolute path tocheckpoints/config.json.com.talktome.dev.tokenizerPython(String): absolute path to Python interpreter (venv) to run the script.
Other dev toggles in the app (for audio fallbacks/UI) exist but are not needed for CoreML bring-up.
# Prefer decoder-only; avoid 5s; target CPU+GPU for decoder-only
defaults write com.transcendence.talktome com.talktome.coreml.preferDecoderOnly -bool YES
defaults write com.transcendence.talktome com.talktome.coreml.no5sDecoder -bool YES
defaults write com.transcendence.talktome com.talktome.coreml.computeUnits -string cpuandgpu
# Enable WAV dumps for analysis
defaults write com.transcendence.talktome com.talktome.coreml.dumpWaveforms -bool YES
# Enable Python tokenizer bridge
defaults write com.transcendence.talktome com.talktome.dev.tokenizerPython -string "$HOME/Documents/GitHub/talktome/.venv-coreml/bin/python"
defaults write com.transcendence.talktome com.talktome.dev.tokenizerScript -string "$HOME/Documents/GitHub/talktome/kokoro-coreml/dev_tokenize.py"
defaults write com.transcendence.talktome com.talktome.dev.configPath -string "$HOME/Documents/GitHub/talktome/kokoro-coreml/checkpoints/config.json"
defaults write com.transcendence.talktome com.talktome.dev.usePythonTokenizer -bool YESNote: In Debug, some apps use an ephemeral bundle id; mirror the same defaults to com.Transcendence.talktome if needed.
- Duration: fixed token length 128; attention mask 128;
ref_s256;speedscalar. - Decoder-only 3s:
asr [1,512,72],F0_pred [1,144],N_pred [1,144],ref_s [1,256];waveform [1,43200].
Symptoms and fixes:
- E5RT warnings about dynamic shapes (non_zero_0_classic_cpu …): remove full synthesizer models from the bundle and use decoder-only fixed-shape model; clear
~/Library/Application Support/TalkToMe/CoreMLCompiledand Xcode DerivedData. - Beep-only output: “Reliable Mode” or “simpleMode” may be enabled; disable those and ensure CoreML path is taken.
- Noise output: typically caused by character IDs instead of phoneme IDs, or all‑zero
ref_s. Enable the Python tokenizer bridge and provide a realconfig.json. Consider wiring real voice embeddings intoref_s. - No audio: verify
waveform [1,43200]is returned andmakePCMBufferis fed with correct sample rate. A fallback infers rate aswaveform.count / secondswhen converting to PCM.
- Enable dumps:
defaults write com.transcendence.talktome com.talktome.coreml.dumpWaveforms -bool YES - Run a short TTS. The app prints:
💾 Wrote waveform dump: /var/folders/.../tts_3s_YYYYMMDD_HHMMSS_SSS.wav. - Open or copy the path to Desktop for inspection (Audacity, Python, etc.).
coreml/kokoro_duration.mlpackage– fixed-input duration modelcoreml/kokoro_decoder_only_3s.mlpackage– fixed-shape decoder-only synthesizerdev_tokenize.py– Python tokenizer bridge (now prints pure JSON ids)docs/learnings.md– running log of issues and mitigations (E5RT, BNNS, shapes)
CoreML’s E5 runtime is sensitive to dynamic/data-dependent operations. By doing alignment, simple F0/N derivation, and shape enforcement in Swift—and compiling a small, static decoder—we:
- Avoid dynamic‑shape MIL graphs.
- Keep CoreML predict calls fast and robust.
- Retain quality headroom via better tokenization and voice embeddings.
This is the “simplest thing that works”: small, understandable components with explicit shapes. It trades a little flexibility for a lot of stability on Apple’s stack.
- ✅ Successfully exported Kokoro-82M to CoreML with a novel two-stage architecture
- ⚡ 30-50% speedup through Apple Neural Engine (ANE) optimization for the vocoder
- 🏗️ Production-ready pipeline with bucketing strategy for variable-length synthesis
- 📱 iOS/macOS compatible models ready for deployment
# 1. Create fresh virtual environment
python3 -m venv .venv-coreml
source .venv-coreml/bin/activate
# 2. Install dependencies (exact versions for stability)
pip install --upgrade pip
pip install torch==2.6.0 coremltools==8.3.0 safetensors numpy==1.26.4 soundfile
# 3. Clone repository and navigate
git clone https://github.com/yourusername/kokoro-coreml.git
cd kokoro-coreml
# 4. Export to CoreML with HAR decoder buckets
python examples/export_coreml.py --output_dir coreml
# Expected output:
# ✅ kokoro_duration.mlpackage (dynamic text length)
# ✅ KokoroDecoder_HAR_3s.mlpackage (3-second audio synthesis)
# ✅ KokoroDecoder_HAR_10s.mlpackage (10-second audio synthesis)
# ✅ KokoroDecoder_HAR_45s.mlpackage (45-second audio synthesis)# Quick test of exported models
python -c "
import coremltools as ct
duration_model = ct.models.MLModel('coreml/kokoro_duration.mlpackage')
decoder_model = ct.models.MLModel('coreml/KokoroDecoder_HAR_3s.mlpackage')
print(f'✅ Duration model: {duration_model}')
print(f'✅ Decoder model: {decoder_model}')
print('Models loaded successfully!')
"The CoreML conversion uses a two-stage pipeline to handle Kokoro's complex dynamic operations:
- Input: Variable-length text sequences with
ct.RangeDim - Process: Transformer and LSTM layers for duration prediction
- Output: Phoneme durations and intermediate features
- Compute: CPU/GPU (LSTM layers don't support ANE)
- Input: Fixed-size features from duration model + alignment matrix
- Process: Vocoder synthesis using iSTFTNet architecture
- Output: High-quality audio waveforms at 24kHz
- Compute: Apple Neural Engine for maximum performance
Available HAR decoder models for different audio lengths:
- 3s model: Fast synthesis for immediate playback (TTFB optimization)
- 10s model: Balanced performance for medium-length content
- 45s model: Long-form synthesis for full paragraph processing
- Adaptive selection: Client chooses optimal bucket based on predicted duration
Text → [Duration Model] → Duration Predictions + Features
↓
[Client builds alignment matrix]
↓
[HAR Decoder Model] → 24kHz Audio Waveform
- HAR Processing: Harmonic-phase separation for better ANE utilization
- Fixed-size buckets: Pre-compiled models avoid dynamic shape issues
- Client-side alignment: Swift/Python builds alignment matrix from durations
- Memory optimization: Load models on-demand, unload when idle
- ✅ HAR vocoder architecture: Harmonic-phase separation optimized for ANE
- ✅ Conv1d layers: Efficient 1D convolutions for audio synthesis
- ✅ ConvTranspose1d: Upsampling layers for waveform generation
- ✅ Element-wise operations: LeakyReLU, addition, multiplication
- ✅ Result: 17x faster than real-time synthesis
- 📱 LSTM layers: Sequential processing for duration prediction
- 📱 Transformer attention: Complex attention mechanisms
- 📱 AdaLayerNorm: Adaptive normalization layers
- 📱 Dynamic shape handling: Variable-length text processing
- Model Caching: Load models on-demand, unload during idle periods
- Bucket Selection: Automatically choose optimal model size for content
- Memory Management: ~200MB per loaded model, efficient cleanup
- Warm-up Strategy: First inference slower, subsequent calls <1.5s
- Error Handling: Graceful fallback between bucket sizes
- HAR Path: Separate harmonic and phase processing for ANE compatibility
- Fixed-size Compilation: Pre-compiled models avoid CoreML dynamic shape limitations
- MIL Graph Patching: Runtime modifications for CoreML compatibility
- Input Validation: Comprehensive shape and type checking during export
- Detailed Conversion Guide - Step-by-step instructions
- Learnings & Challenges - Technical deep-dive into solutions
The original Kokoro Python library is still available:
!pip install -q kokoro>=0.9.4 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
print(i, gs, ps)
display(Audio(data=audio, rate=24000, autoplay=i==0))
sf.write(f'{i}.wav', audio, 24000)Under the hood, kokoro uses misaki, a G2P library at https://github.com/hexgrad/misaki
You can run this advanced cell on Google Colab.
# 1️⃣ Install kokoro
!pip install -q kokoro>=0.9.4 soundfile
# 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 3️⃣ Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
# 🇪🇸 'e' => Spanish es
# 🇫🇷 'f' => French fr-fr
# 🇮🇳 'h' => Hindi hi
# 🇮🇹 'i' => Italian it
# 🇯🇵 'j' => Japanese: pip install misaki[ja]
# 🇧🇷 'p' => Brazilian Portuguese pt-br
# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice, reference above.
# This text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
# text = '「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。'
# text = '中國人民不信邪也不怕邪,不惹事也不怕事,任何外國不要指望我們會拿自己的核心利益做交易,不要指望我們會吞下損害我國主權、安全、發展利益的苦果!'
# text = 'Los partidos políticos tradicionales compiten con los populismos y los movimientos asamblearios.'
# text = 'Le dromadaire resplendissant déambulait tranquillement dans les méandres en mastiquant de petites feuilles vernissées.'
# text = 'ट्रांसपोर्टरों की हड़ताल लगातार पांचवें दिन जारी, दिसंबर से इलेक्ट्रॉनिक टोल कलेक्शनल सिस्टम'
# text = "Allora cominciava l'insonnia, o un dormiveglia peggiore dell'insonnia, che talvolta assumeva i caratteri dell'incubo."
# text = 'Elabora relatórios de acompanhamento cronológico para as diferentes unidades do Departamento que propõem contratos.'
# 4️⃣ Generate, display, and save audio files in a loop.
generator = pipeline(
text, voice='af_heart', # <= change voice here
speed=1, split_pattern=r'\n+'
)
# Alternatively, load voice tensor directly:
# voice_tensor = torch.load('path/to/voice.pt', weights_only=True)
# generator = pipeline(
# text, voice=voice_tensor,
# speed=1, split_pattern=r'\n+'
# )
for i, (gs, ps, audio) in enumerate(generator):
print(i) # i => index
print(gs) # gs => graphemes/text
print(ps) # ps => phonemes
display(Audio(data=audio, rate=24000, autoplay=i==0))
sf.write(f'{i}.wav', audio, 24000) # save each audio fileTo install espeak-ng on Windows:
- Go to espeak-ng releases
- Click on Latest release
- Download the appropriate
*.msifile (e.g. espeak-ng-20191129-b702b03-x64.msi) - Run the downloaded installer
For advanced configuration and usage on Windows, see the official espeak-ng Windows guide
On Mac M1/M2/M3/M4 devices, you can explicitly specify the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to enable GPU acceleration.
PYTORCH_ENABLE_MPS_FALLBACK=1 python run-your-kokoro-script.pyUse the following conda environment.yml if you're facing any dependency issues.
name: kokoro
channels:
- defaults
dependencies:
- python==3.9
- libstdcxx~=12.4.0 # Needed to load espeak correctly. Try removing this if you're facing issues with Espeak fallback.
- pip:
- kokoro>=0.3.1
- soundfile
- misaki[en]- 5s bucket: ~1.35s total (RTF ≈ 0.057) - 17x faster than real-time
- 15s bucket: ~1.41s total (RTF ≈ 0.060)
- 30s bucket: ~1.38s total (RTF ≈ 0.058)
- ANE (CoreML predict): 0.25–0.31s (dominant computation)
- CPU preprocessing: 0.15–0.17s (hn-nsf + STFT)
- Inverse STFT: 0.02–0.03s
- Orchestration/IO: 0.55–0.60s
- Model Size: ~330MB per HAR decoder model (FP16 precision)
- Export Time: ~2-5 minutes per model
- Inference Speed: 17x faster than real-time (warmed)
- Cold Start: First synthesis takes ~2-3s, subsequent synthesis <1.5s
- Memory Usage: ~200MB per loaded model
- Minimum iOS: 16.0 (for optimal CoreML support)
- Recommended: Apple Silicon (M1/M2/M3) or A15+ for ANE acceleration
- Supported Devices:
- All Apple Silicon Macs (M1/M2/M3/M4)
- iPhone 13+ (A15 Bionic+)
- iPad Air 5+ / iPad Pro M1+
- Performance scales with Neural Engine capabilities
Modify the buckets in export_coreml.py:
buckets = {
"3s": 3 * 24000, # 72,000 frames
"5s": 5 * 24000, # 120,000 frames
"10s": 10 * 24000, # 240,000 frames
"30s": 30 * 24000 # 720,000 frames
}// Load models
let durationModel = try MLModel(contentsOf: durationURL)
let synthesizerModel = try MLModel(contentsOf: synthesizer3sURL)
// Run inference
let durationOutput = try durationModel.prediction(from: inputs)
let audioOutput = try synthesizerModel.prediction(from: features)- 🛠️ @yl4579 for architecting StyleTTS 2
- 🏆 @Pendrokar for TTS Spaces Arena
- 📊 Synthetic training data contributors
- ❤️ Compute sponsors
- 🍎 Apple's coremltools team for excellent documentation
- 🧠 The scrappy approach inspired by startup engineering principles
- 📖 Detailed learnings documented through iterative debugging
- 👾 Discord server: https://discord.gg/QuGxSWBfQy
- 🪽 Kokoro is Japanese for "heart" or "spirit"
