Skip to content

[Model Request] Porting Nvidia PersonaPlex-7B (Moshi architecture) & Mimi Codec #1405

@Vlor999

Description

@Vlor999

Hi team,

With the recent release of Nvidia PersonaPlex-7B-v1, which is based on the Moshi architecture (Kyutai), I am interested in working on a port for MLX to enable full-duplex speech-to-speech on Apple Silicon.

Since this is a complex architecture, I wanted to open this issue to track progress and gauge community interest.

The Roadmap:

  • Port the Mimi Codec: This is the neural audio codec used by Moshi/PersonaPlex. It requires converting the streaming convolutional/LSTM encoder-decoder to mlx.nn.
  • Port the LM Architecture: PersonaPlex uses a specific architecture (interleaved text/audio tokens) potentially based on Helium or Llama, adapted for streaming generation.
  • Streaming Loop: Implementing the real-time input/output loop.

I plan to start working on this in a separate repository. If anyone has already started looking into Moshi or Mimi specifically, please let me know to avoid duplicate work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions