Skip to content

Moshi: open-source speech-text foundation model for real-time full-duplex voice dialogue. Uses Mimi neural audio codec. PyTorch, MLX (Apple Silicon) and Rust backends. Moshika & Moshiko voices. Maintained by KuchikiRenji.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

KuchikiRenji/moshi

Moshi – Speech-Text Foundation Model for Real-Time Voice Dialogue | AI Voice Assistant

precommit badge rust ci badge

Moshi is an open-source speech-text foundation model for real-time full-duplex voice dialogue. It powers natural voice conversations with low latency using the Mimi neural audio codec. Try the live demo or use Hugging Face models.


Author & Contact

Maintainer: KuchikiRenji

Contact Details
Email KuchikiRenji@outlook.com
GitHub github.com/KuchikiRenji
Discord kuchiki_renji

For questions, contributions, or collaboration, reach out via the channels above.


Table of Contents


What is Moshi?

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It enables real-time voice AI conversations with:

  • Mimi – A state-of-the-art streaming neural audio codec that processes 24 kHz audio down to a 12.5 Hz representation at 1.1 kbps, with 80 ms frame latency, in a fully streaming way. It outperforms non-streaming codecs such as SpeechTokenizer (50 Hz, 4 kbps) and SemantiCodec (50 Hz, 1.3 kbps).

  • Dual audio streams – Moshi models two streams: one for the user (from the microphone) and one for Moshi (from the model). It also predicts text tokens for its own speech (inner monologue), which improves generation quality.

  • Low latency – A small Depth Transformer handles inter-codebook dependencies per time step; a 7B-parameter Temporal Transformer handles time. Theoretical latency is 160 ms (80 ms Mimi frame + 80 ms acoustic); practical end-to-end latency can be as low as ~200 ms on an L4 GPU.

Talk to Moshi on the live demo.

Moshi architecture: two audio streams (user and Moshi), text tokens, Depth and Temporal Transformers for real-time voice dialogue.

About Mimi (Neural Codec)

Mimi builds on SoundStream and EnCodec, adding Transformers in both encoder and decoder and using strides for a 12.5 Hz frame rate—closer to text token rates (~3–4 Hz)—reducing autoregressive steps in Moshi. Like SpeechTokenizer, Mimi uses a distillation loss so the first codebook tokens match WavLM representations, modeling both semantic and acoustic content. Mimi is causal and streaming yet matches non-causal WavLM well. Like EBEN, it uses only an adversarial training loss with feature matching for strong subjective quality at low bitrate.

Mimi neural codec: Transformer encoder and decoder, 12.5 Hz frame rate, streaming audio compression for Moshi.


Key Features

  • Real-time full-duplex voice dialogue – Speak and hear responses with low latency.
  • Streaming neural codec (Mimi) – 24 kHz → 12.5 Hz, 1.1 kbps, 80 ms frames.
  • Multiple backends – PyTorch, MLX (Apple Silicon), and Rust/Candle.
  • Voice variants – Moshika (female) and Moshiko (male); multiple quantizations (bf16, int8, int4 for MLX).
  • Web UI and CLI – Easy local or remote use.

Organisation of the repository

Directory Description
moshi/ Python (PyTorch) – Moshi and Mimi inference.
moshi_mlx/ Python (MLX) – Moshi on Apple M-series Macs.
rust/ Rust – Production backend; includes Mimi in Rust and rustymimi Python bindings.
client/ Web UI – Frontend for the live demo.

Models

Released models:

  • Mimi – Speech codec (included in each Moshi repo).
  • Moshika – Moshi with female voice.
  • Moshiko – Moshi with male voice.

Formats and quantization depend on the backend. All are on Hugging Face (CC-BY 4.0):

Model Backend Variants
Moshika PyTorch kyutai/moshika-pytorch-bf16 (bf16)
Moshiko PyTorch kyutai/moshiko-pytorch-bf16 (bf16)
Moshika MLX q4 / q8 / bf16
Moshiko MLX q4 / q8 / bf16
Moshika Rust/Candle q8 / bf16
Moshiko Rust/Candle q8 / bf16

Requirements

  • Python – 3.10 minimum; 3.12 recommended. See each backend’s directory for details.
  • PyTorch / MLX – Install via PyPI (see below). MLX and rustymimi may need Python 3.12 or a Rust toolchain.
  • Rust backendRust toolchain; for GPU: CUDA with nvcc.
  • GPU – PyTorch: ~24 GB VRAM (no quantization). MLX tested on MacBook Pro M3. Windows is not officially supported.

Install from PyPI

pip install moshi        # PyTorch
pip install moshi_mlx    # MLX (Python 3.12 recommended)
pip install rustymimi    # Mimi in Rust (Python bindings)

# Bleeding edge from this repo
pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi"
pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx"

Quick Start

Python (PyTorch)

Start the server, then open the web UI at http://localhost:8998:

python -m moshi.server [--gradio-tunnel] [--hf-repo kyutai/moshika-pytorch-bf16]
  • Use --gradio-tunnel for a public URL (e.g. remote GPU). Latency may increase (e.g. +500 ms from Europe).
  • Use --gradio-tunnel-token for a fixed secret token and stable URL.
  • Use --hf-repo to pick another Hugging Face model.

CLI client (no echo cancellation):

python -m moshi.client [--url URL_TO_GRADIO]

More details and API: moshi/README.md.

Python (MLX) for local inference on macOS

python -m moshi_mlx.local -q 4   # 4-bit quantization
python -m moshi_mlx.local -q 8   # 8-bit
python -m moshi_mlx.local -q 4 --hf-repo kyutai/moshika-mlx-q4

Web UI:

python -m moshi_mlx.local_web
# → http://localhost:8998

Match -q and --hf-repo (e.g. q4 with *-mlx-q4).

Rust

From the rust directory:

cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config.json standalone

On macOS use --features metal instead of --features cuda. For int8 use config-q8.json. Set "hf_repo" in the config for Moshika/Moshiko.

When you see "standalone worker listening", open the web UI at https://localhost:8998 (browser may show a warning; you can proceed to localhost).


Clients

  • Web UI (recommended) – Echo cancellation and best experience; usually served automatically at the URLs above.
  • Rust CLI – From rust/: cargo run --bin moshi-cli -r -- tui --host localhost
  • Python CLIpython -m moshi.client

Building the Web UI

cd client
npm install
npm run build

Output is in client/dist.


Development

From the repo root:

pip install -e 'moshi[dev]'
pip install -e 'moshi_mlx[dev]'
pre-commit install

Build rustymimi locally (with Rust installed):

pip install maturin
maturin dev -r -m rust/mimi-pyo3/Cargo.toml

FAQ

See FAQ.md before opening an issue. Common topics: training code, dataset, multilingual support, voice/personality, M1/small GPU, PyTorch quantization.


License

  • Code – MIT (Python, client); Apache (Rust backend). Some code is based on AudioCraft (MIT).
  • Model weights – CC-BY 4.0.

Citation

If you use Mimi or Moshi, please cite:

@techreport{kyutai2024moshi,
    author = {Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and Am\'elie Royer and
              Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
    title = {Moshi: a speech-text foundation model for real-time dialogue},
    institution = {Kyutai},
    year = {2024},
    month = {September},
    url = {http://kyutai.org/Moshi.pdf},
}

Paper: Moshi: a speech-text foundation model for real-time dialogue

About

Moshi: open-source speech-text foundation model for real-time full-duplex voice dialogue. Uses Mimi neural audio codec. PyTorch, MLX (Apple Silicon) and Rust backends. Moshika & Moshiko voices. Maintained by KuchikiRenji.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •