Improved Explanations: Interlingual Homophones Dataset

**Interlingual homophones** are words that sound the same/similar across languages.

When users struggle with making a sound in a non-native language, we want to be able to show them words with the same/similar sounds in their native language. For this, we need a fast way to look up similar sounds, i.e., interlingual homophones. 

1. We first need to evaluate existing datasets and how well various LLM models (gpt4o, llama 4, mistral, gemini) handle taking a word with a specific sound/syllable highlighted and finding similar sounds across languages. There is also some existing literature on searching for cognates and/or interlingual homophones.
2. We then need to explore taking regular dictionaries/sets of common words in various languages, and using [g2p](https://deepgram.com/ai-glossary/grapheme-to-phoneme-conversion-g2p) to get their pronunciation in IPA, then search for common sounds by looking for matching IPA. There are existing g2p models to explore such as [ByT5](https://www.isca-archive.org/interspeech_2022/zhu22_interspeech.pdf) and pronunciation dictionaries that already include phonetic information without having to approximate with a g2p model, e.g., [CMU Dict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict).
3. Ideally, the final dataset will also take into account the various dialects of each language so interlingual homophones can be provided in the native dialect of the user. For this, it might be useful to explore datasets with lots of dialects (e.g., Speech Accent Archive and Common Voice) and using [a Speech2IPA model](https://huggingface.co/KoelLabs/xlsr-english-01) to obtain the phonetic transcription rather than approximating with g2p.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved Explanations: Interlingual Homophones Dataset #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improved Explanations: Interlingual Homophones Dataset #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions