Skip to content

Improved G2P Pre-training for Speech2IPA #5

@SanderGi

Description

@SanderGi

We're currently using smaller high-quality human annotated datasets to fine-tune existing models that have been pre-trained on larger datasets where IPA labels are approximated, from the text transcripts rather than from the audio, using G2P (in the case of xlsr-53-espeak, the G2P models are from Espeak and Phonetisaurus). This makes it easy to obtain multi-lingual training data in large quantities, however the intermediary text transcription step from going from Speech -> Text -> IPA looses some pronunciation nuance that is important for accents, dialects, speech impediments, and disfluencies. Hence why we have to fine-tune on higher quality human annotated data afterwards.

It would be interesting to explore using/training G2P models that account for accents/dialects to improve this step (and possibly combine it with some audio-based articulatory features). It might also be possible to take the fine-tuned model, map its vocab onto unseen languages' and dialects' phonetic inventories using articulatory features, and use that to generate improved labels for pre-training on more diverse dialects and vocabularies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions