Hi team,
With the recent release of Nvidia PersonaPlex-7B-v1, which is based on the Moshi architecture (Kyutai), I am interested in working on a port for MLX to enable full-duplex speech-to-speech on Apple Silicon.
Since this is a complex architecture, I wanted to open this issue to track progress and gauge community interest.
The Roadmap:
- Port the Mimi Codec: This is the neural audio codec used by Moshi/PersonaPlex. It requires converting the streaming convolutional/LSTM encoder-decoder to
mlx.nn.
- Port the LM Architecture: PersonaPlex uses a specific architecture (interleaved text/audio tokens) potentially based on Helium or Llama, adapted for streaming generation.
- Streaming Loop: Implementing the real-time input/output loop.
I plan to start working on this in a separate repository. If anyone has already started looking into Moshi or Mimi specifically, please let me know to avoid duplicate work!