This work and demo is very impressive. 😎 👍 👍
I'm interest in fine-tuning the model, particularly for improving speech duplex interaction (simultaneous listening and speaking) and interruption handling. However, it would be important to avoid catastrophic forgetting that might degrade these capabilities.
I would appreciate some clarification about the training design:
-
Interruption training
How is the "barge-in" trained in this model? Is it implemented in a way similar to Moshi-style streaming speech interaction, or FLM-Audio style duplex conversational modeling?
-
Duplex interaction (listen while speaking)
How is the model trained to listen while speaking? Does the training data contain overlapping speech segments or a special interaction format that enables duplex behavior and monologue generation?
-
About Some finetuning details
If we want to fine-tune data to keep duplex capabilities, How to input data to model? ( I assume the training data might follow a format similar to the Hugging Face chat template. However, I'm not sure how barge-in events or interruption labels are encoded in the dataset)
Thank you very much 😍