The goal of metaphonebr is to simplify brazilian names phonetically using a custom metaphoneBR algorithm that preserves ending vowels, created for aiding in dataset pairing in the absence of unambiguous keys.
The stable version of the package can be installed with:
install.packages("metaphonebr")You can install the development version of metaphonebr from GitHub with :
# install.packages("remotes")
remotes::install_github("ipeadata-lab/metaphonebr")This is a basic example which shows how to use the main function:
example_names <- c("João da Silva", "Maria", "Marya",
"Helena", "Elena", "Philippe", "Filipe", "Xavier", "Chavier")
phonetic_codes <- metaphonebr::metaphonebr(example_names)
print(data.frame(original = example_names, metaphonebr = phonetic_codes))- Initial Cleanup & Preparation:
- Remove all diacritics (e.g., “João” becomes “Joao”).
- Convert the entire string to uppercase (e.g., “Joao” becomes “JOAO”).
- Remove all characters that are not uppercase letters (A-Z) or spaces.
- Ensure single spaces between words and trim leading/trailing whitespace.
- Silent Letter Removal:
- Remove a silent ‘H’ if it appears at the beginning of any word (e.g., “Helena” becomes “Elena”).
- Digraph Simplification (Sound Grouping):
LHis replaced by1(representing a palatal lateral approximant, like in “Filha” -> “FI1A”).NHis replaced by3(representing a palatal nasal, like in “Manhã” -> “MA3A”).CHis replaced byX(representing the /ʃ/ sound, like in “Chico” -> “XICO”).SHis replaced byX(for foreign names with /ʃ/ sound, like in “Shirley” -> “XIRLEY”).SCHis replaced byX(approximating /ʃ/ or /sk/, like in “Schmidt” -> “XMIT”).PHis replaced byF(like in “Philip” -> “FILIP”).SCfollowed byEorIbecomesS(like in “SCENA” -> “SENA”).SCfollowed byA,O, orUbecomesSK(like in “ESCOVA” -> “ESKOVA”).QUorQÜfollowed byEorIbecomesK(e.g., “QUEIJO” -> “KEIJO”).GUorGÜfollowed byEorIbecomesG(theUis silent, e.g., “GUERRA” -> “GERRA”).- Any remaining
QUbecomesK(e.g., “QUANTO” -> “KANTO”).
- Similar Consonant Simplification:
Çis replaced byS.Cfollowed byEorIis replaced byS(like in “CELSO” -> “SELSO”).- Any other
C(not part of an already transformed digraph like CH or SC) is replaced byK(like in “CARLOS” -> “KARLOS”). Gfollowed byEorIis replaced byJ(like in “GELO” -> “JELO”; GUE/GUI already handled).- Any remaining
Q(that wasn’t part of QU) is replaced byK. Wis replaced byV(common Brazilian Portuguese pronunciation, e.g., “WALTER” -> “VALTER”).Yis replaced byI(e.g., “YARA” -> “IARA”).Zis replaced byS(e.g., “ZEBRA” -> “SEBRA”).Xpreceded byShas theXremoved (e.g., “EXCELENTE” -> “ESELENTE”, to avoid a double /s/ representation fromSKS).
- Terminal Nasal Sound Simplification:
- A word-final
Nis replaced byM(e.g., “JOAQUIN” -> “JOAQUIM”). - A word-final
AOis replaced byOM(e.g., “JOÃO” -> “JOOM”). - A word-final
ÃESis replaced byAES(e.g., “MÃES” -> “MAES”).
- A word-final
- Duplicate Vowel Removal:
- Sequences of identical adjacent vowels are reduced to a single vowel (e.g., “AARAO” -> “ARAO”).
- Final Cleanup (Duplicate Letters & Spaces):
- Sequences of identical adjacent letters (except if they are part
of the special codes
1for LH or3for NH) are reduced to a single letter (e.g., “CARRO” might become “CARO”, “LESSA” becomes “LESA”. Note: This rule simplifies sounds like ‘RR’ and ‘SS’ to their single counterparts, which is a common Metaphone-style simplification). - Ensure single spaces between any remaining words and trim leading/trailing whitespace again.
- Sequences of identical adjacent letters (except if they are part
of the special codes
The resulting code is an attempt to represent the phonetic signature of the name in a simplified, standardized way for a Brazilian Portuguese context. In particular, by construction it preserves ending vowels since they imply generally gender information in Brazilian Names (ex.: ADRIANO and ADRIANA).
metaphonebr is developed by a team of researchers at Instituto de Pesquisa Econômica Aplicada (Ipea).
