-
Notifications
You must be signed in to change notification settings - Fork 3
Description
We're currently using smaller high-quality human annotated datasets to fine-tune existing models that have been pre-trained on larger datasets where IPA labels are approximated, from the text transcripts rather than from the audio, using G2P (in the case of xlsr-53-espeak, the G2P models are from Espeak and Phonetisaurus). This makes it easy to obtain multi-lingual training data in large quantities, however the intermediary text transcription step from going from Speech -> Text -> IPA looses some pronunciation nuance that is important for accents, dialects, speech impediments, and disfluencies. Hence why we have to fine-tune on higher quality human annotated data afterwards.
It would be interesting to explore using/training G2P models that account for accents/dialects to improve this step (and possibly combine it with some audio-based articulatory features). It might also be possible to take the fine-tuned model, map its vocab onto unseen languages' and dialects' phonetic inventories using articulatory features, and use that to generate improved labels for pre-training on more diverse dialects and vocabularies.