[Model Request] Porting Nvidia PersonaPlex-7B (Moshi architecture) & Mimi Codec

Hi team,

With the recent release of **Nvidia PersonaPlex-7B-v1**, which is based on the Moshi architecture (Kyutai), I am interested in working on a port for MLX to enable full-duplex speech-to-speech on Apple Silicon.

Since this is a complex architecture, I wanted to open this issue to track progress and gauge community interest.

The Roadmap:
* Port the Mimi Codec: This is the neural audio codec used by Moshi/PersonaPlex. It requires converting the streaming convolutional/LSTM encoder-decoder to `mlx.nn`.
* Port the LM Architecture: PersonaPlex uses a specific architecture (interleaved text/audio tokens) potentially based on Helium or Llama, adapted for streaming generation.
* Streaming Loop: Implementing the real-time input/output loop.

I plan to start working on this in a separate repository. If anyone has already started looking into Moshi or Mimi specifically, please let me know to avoid duplicate work!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model Request] Porting Nvidia PersonaPlex-7B (Moshi architecture) & Mimi Codec #1405

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Model Request] Porting Nvidia PersonaPlex-7B (Moshi architecture) & Mimi Codec #1405

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions