非常感谢贵团队的工作!请问3.1.2 Emotion and Speaking Style Editing构造large margin data所用的scoring model是哪一个模型呀?这个有计划开源嘛?
原文3.1.2 Emotion and Speaking Style Editing
Zero-shot Cloning. A triplet ⟨textprompt,audioneutral,audioemotion,style⟩is constructed for each emo-
tion and speaking style by selecting corresponding emotional and neutral audio clips from the same
speaker as the prompt audio and processing them with the StepTTS voice cloning interface, using a
text instruction that describes the target attribute.
Margin Scoring. To evaluate the triplet generated, we developed a scoring model using a small,
human-annotated dataset. The model evaluates audio pairs on a 1-10 scale, with higher margin
scores corresponding to more desirable outcomes.
非常感谢贵团队的工作!请问3.1.2 Emotion and Speaking Style Editing构造large margin data所用的scoring model是哪一个模型呀?这个有计划开源嘛?
原文3.1.2 Emotion and Speaking Style Editing
Zero-shot Cloning. A triplet ⟨textprompt,audioneutral,audioemotion,style⟩is constructed for each emo-
tion and speaking style by selecting corresponding emotional and neutral audio clips from the same
speaker as the prompt audio and processing them with the StepTTS voice cloning interface, using a
text instruction that describes the target attribute.
Margin Scoring. To evaluate the triplet generated, we developed a scoring model using a small,
human-annotated dataset. The model evaluates audio pairs on a 1-10 scale, with higher margin
scores corresponding to more desirable outcomes.