Moshi is an open-source speech-text foundation model for real-time full-duplex voice dialogue. It powers natural voice conversations with low latency using the Mimi neural audio codec. Try the live demo or use Hugging Face models.
Maintainer: KuchikiRenji
| Contact | Details |
|---|---|
| KuchikiRenji@outlook.com | |
| GitHub | github.com/KuchikiRenji |
| Discord | kuchiki_renji |
For questions, contributions, or collaboration, reach out via the channels above.
- What is Moshi?
- Key Features
- Repository Structure
- Models
- Requirements
- Quick Start
- Clients
- Development
- FAQ
- License
- Citation
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It enables real-time voice AI conversations with:
-
Mimi – A state-of-the-art streaming neural audio codec that processes 24 kHz audio down to a 12.5 Hz representation at 1.1 kbps, with 80 ms frame latency, in a fully streaming way. It outperforms non-streaming codecs such as SpeechTokenizer (50 Hz, 4 kbps) and SemantiCodec (50 Hz, 1.3 kbps).
-
Dual audio streams – Moshi models two streams: one for the user (from the microphone) and one for Moshi (from the model). It also predicts text tokens for its own speech (inner monologue), which improves generation quality.
-
Low latency – A small Depth Transformer handles inter-codebook dependencies per time step; a 7B-parameter Temporal Transformer handles time. Theoretical latency is 160 ms (80 ms Mimi frame + 80 ms acoustic); practical end-to-end latency can be as low as ~200 ms on an L4 GPU.
Talk to Moshi on the live demo.
Mimi builds on SoundStream and EnCodec, adding Transformers in both encoder and decoder and using strides for a 12.5 Hz frame rate—closer to text token rates (~3–4 Hz)—reducing autoregressive steps in Moshi. Like SpeechTokenizer, Mimi uses a distillation loss so the first codebook tokens match WavLM representations, modeling both semantic and acoustic content. Mimi is causal and streaming yet matches non-causal WavLM well. Like EBEN, it uses only an adversarial training loss with feature matching for strong subjective quality at low bitrate.
- Real-time full-duplex voice dialogue – Speak and hear responses with low latency.
- Streaming neural codec (Mimi) – 24 kHz → 12.5 Hz, 1.1 kbps, 80 ms frames.
- Multiple backends – PyTorch, MLX (Apple Silicon), and Rust/Candle.
- Voice variants – Moshika (female) and Moshiko (male); multiple quantizations (bf16, int8, int4 for MLX).
- Web UI and CLI – Easy local or remote use.
| Directory | Description |
|---|---|
moshi/ |
Python (PyTorch) – Moshi and Mimi inference. |
moshi_mlx/ |
Python (MLX) – Moshi on Apple M-series Macs. |
rust/ |
Rust – Production backend; includes Mimi in Rust and rustymimi Python bindings. |
client/ |
Web UI – Frontend for the live demo. |
Released models:
- Mimi – Speech codec (included in each Moshi repo).
- Moshika – Moshi with female voice.
- Moshiko – Moshi with male voice.
Formats and quantization depend on the backend. All are on Hugging Face (CC-BY 4.0):
| Model | Backend | Variants |
|---|---|---|
| Moshika | PyTorch | kyutai/moshika-pytorch-bf16 (bf16) |
| Moshiko | PyTorch | kyutai/moshiko-pytorch-bf16 (bf16) |
| Moshika | MLX | q4 / q8 / bf16 |
| Moshiko | MLX | q4 / q8 / bf16 |
| Moshika | Rust/Candle | q8 / bf16 |
| Moshiko | Rust/Candle | q8 / bf16 |
- Python – 3.10 minimum; 3.12 recommended. See each backend’s directory for details.
- PyTorch / MLX – Install via PyPI (see below). MLX and
rustymimimay need Python 3.12 or a Rust toolchain. - Rust backend – Rust toolchain; for GPU: CUDA with
nvcc. - GPU – PyTorch: ~24 GB VRAM (no quantization). MLX tested on MacBook Pro M3. Windows is not officially supported.
pip install moshi # PyTorch
pip install moshi_mlx # MLX (Python 3.12 recommended)
pip install rustymimi # Mimi in Rust (Python bindings)
# Bleeding edge from this repo
pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi"
pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx"Start the server, then open the web UI at http://localhost:8998:
python -m moshi.server [--gradio-tunnel] [--hf-repo kyutai/moshika-pytorch-bf16]- Use
--gradio-tunnelfor a public URL (e.g. remote GPU). Latency may increase (e.g. +500 ms from Europe). - Use
--gradio-tunnel-tokenfor a fixed secret token and stable URL. - Use
--hf-repoto pick another Hugging Face model.
CLI client (no echo cancellation):
python -m moshi.client [--url URL_TO_GRADIO]More details and API: moshi/README.md.
python -m moshi_mlx.local -q 4 # 4-bit quantization
python -m moshi_mlx.local -q 8 # 8-bit
python -m moshi_mlx.local -q 4 --hf-repo kyutai/moshika-mlx-q4Web UI:
python -m moshi_mlx.local_web
# → http://localhost:8998Match -q and --hf-repo (e.g. q4 with *-mlx-q4).
From the rust directory:
cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config.json standaloneOn macOS use --features metal instead of --features cuda. For int8 use config-q8.json. Set "hf_repo" in the config for Moshika/Moshiko.
When you see "standalone worker listening", open the web UI at https://localhost:8998 (browser may show a warning; you can proceed to localhost).
- Web UI (recommended) – Echo cancellation and best experience; usually served automatically at the URLs above.
- Rust CLI – From
rust/:cargo run --bin moshi-cli -r -- tui --host localhost - Python CLI –
python -m moshi.client
cd client
npm install
npm run buildOutput is in client/dist.
From the repo root:
pip install -e 'moshi[dev]'
pip install -e 'moshi_mlx[dev]'
pre-commit installBuild rustymimi locally (with Rust installed):
pip install maturin
maturin dev -r -m rust/mimi-pyo3/Cargo.tomlSee FAQ.md before opening an issue. Common topics: training code, dataset, multilingual support, voice/personality, M1/small GPU, PyTorch quantization.
- Code – MIT (Python, client); Apache (Rust backend). Some code is based on AudioCraft (MIT).
- Model weights – CC-BY 4.0.
If you use Mimi or Moshi, please cite:
@techreport{kyutai2024moshi,
author = {Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and Am\'elie Royer and
Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
title = {Moshi: a speech-text foundation model for real-time dialogue},
institution = {Kyutai},
year = {2024},
month = {September},
url = {http://kyutai.org/Moshi.pdf},
}Paper: Moshi: a speech-text foundation model for real-time dialogue

