Skip to content

Improved Explanations: Interlingual Homophones Dataset #7

@SanderGi

Description

@SanderGi

Interlingual homophones are words that sound the same/similar across languages.

When users struggle with making a sound in a non-native language, we want to be able to show them words with the same/similar sounds in their native language. For this, we need a fast way to look up similar sounds, i.e., interlingual homophones.

  1. We first need to evaluate existing datasets and how well various LLM models (gpt4o, llama 4, mistral, gemini) handle taking a word with a specific sound/syllable highlighted and finding similar sounds across languages. There is also some existing literature on searching for cognates and/or interlingual homophones.
  2. We then need to explore taking regular dictionaries/sets of common words in various languages, and using g2p to get their pronunciation in IPA, then search for common sounds by looking for matching IPA. There are existing g2p models to explore such as ByT5 and pronunciation dictionaries that already include phonetic information without having to approximate with a g2p model, e.g., CMU Dict.
  3. Ideally, the final dataset will also take into account the various dialects of each language so interlingual homophones can be provided in the native dialect of the user. For this, it might be useful to explore datasets with lots of dialects (e.g., Speech Accent Archive and Common Voice) and using a Speech2IPA model to obtain the phonetic transcription rather than approximating with g2p.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions