Skip to content

How to finetune audio duplex ability? #1084

@edward15241

Description

@edward15241

This work and demo is very impressive. 😎 👍 👍

I'm interest in fine-tuning the model, particularly for improving speech duplex interaction (simultaneous listening and speaking) and interruption handling. However, it would be important to avoid catastrophic forgetting that might degrade these capabilities.

I would appreciate some clarification about the training design:

  1. Interruption training
    How is the "barge-in" trained in this model? Is it implemented in a way similar to Moshi-style streaming speech interaction, or FLM-Audio style duplex conversational modeling?

  2. Duplex interaction (listen while speaking)
    How is the model trained to listen while speaking? Does the training data contain overlapping speech segments or a special interaction format that enables duplex behavior and monologue generation?

  3. About Some finetuning details
    If we want to fine-tune data to keep duplex capabilities, How to input data to model? ( I assume the training data might follow a format similar to the Hugging Face chat template. However, I'm not sure how barge-in events or interruption labels are encoded in the dataset)

Thank you very much 😍

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions