forked from jhasegaw/phonecodes
-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Description
The arpabet2ipa() function incorrectly places stress marks on the first vowel in a word instead of the vowel that has the stress digit (1 or 2) in the ARPABET input.
Root Cause
The bug occurs in the attach_tones_to_vowels() function when combined with how ARPABET stress markers are converted.
The problem:
- When
translate_string()converts ARPABET likeER1(stressed "er" sound), it produces tokens['ɝ', 'ˈ']- with the stress marker AFTER the vowel in the list - The
attach_tones_to_vowels()function searches backward (searchstep=-1) to find a vowel to attach each stress marker to - When it finds a stress marker (e.g., at position 8), it searches backward for a vowel
- It encounters the vowel that the stress CAME FROM (e.g.,
ɝat position 7), but since the stress marker is after it in the list, the backward search continues past it to find the previous vowel (e.g.,æat position 0) - The stress gets attached to the wrong vowel
Problematic Code
From phonecodes.py lines 48-62:
def attach_tones_to_vowels(il, tones, vowels, searchstep, catdir):
"""Return a copy of il, with each tone attached to nearest vowel if any.
searchstep=1 means search for next vowel, searchstep=-1 means prev vowel.
catdir>=0 means concatenate after vowel, catdir<0 means cat before vowel.
Tones are not combined, except those also included in the vowels set.
"""
ol = il.copy()
v = 0 if searchstep > 0 else len(ol) - 1
t = -1
while 0 <= v and v < len(ol):
if (ol[v] in vowels or (len(ol[v]) > 1 and ol[v][0] in vowels)) and t >= 0:
ol[v] = ol[v] + ol[t] if catdir >= 0 else ol[t] + ol[v]
ol = ol[0:t] + ol[(t + 1) :] # Remove the tone
t = -1 # Done with that tone
if v < len(ol) and ol[v] in tones:
t = v
v += searchstep
return olWhen searching backward (searchstep=-1):
- The algorithm finds a stress marker at some position
t - It continues decrementing
vto find a vowel - Bug: When the stress marker appears RIGHT AFTER its vowel in the token list, the vowel at position
v = t - 1is skipped, and the stress attaches to an earlier vowel
Minimal Reproducible Example
from phonecodes import phonecodes as pc
# Test case: ARPABET with stress on second vowel (ER1)
arpabet = "AE0 D V ER1 T AH0 Z M AH0 N T"
# ^^0 ^^1 ^^0 ^^0
# (no) (PRIMARY!) (no) (no)
ipa = pc.arpabet2ipa(arpabet)
print(f"Input: {arpabet}")
print(f"Output: {ipa}")
print()
# Expected: stress on ɝ (from ER1)
# Actual: stress on æ (from AE0) ❌
tokens = ipa.split()
print("Token breakdown:")
for i, token in enumerate(tokens):
stress_marker = " ← STRESS" if 'ˈ' in token or 'ˌ' in token else ""
print(f" {i}: {token}{stress_marker}")Output:
Input: AE0 D V ER1 T AH0 Z M AH0 N T
Output: ˈæ d v ɝ t ə z m ə n t
Token breakdown:
0: ˈæ ← STRESS
1: d
2: v
3: ɝ
4: t
5: ə
...
Expected output: æ d v ˈɝ t ə z m ə n t (stress on ɝ, the vowel from ER1)
Actual output: ˈæ d v ɝ t ə z m ə n t (stress on æ, the vowel from AE0)
Additional Test Cases
# Works correctly with AE1 (stress on first vowel)
print(pc.arpabet2ipa("AE1 D V ER0 T"))
# Output: ˈæ d v ɚ t ✅ Correct!
# Works correctly with two stresses
print(pc.arpabet2ipa("AE1 D V ER1 T"))
# Output: ˈæ d v ɝˈ t
# Note: Second stress appears AFTER ɝ, showing the order issue
# Fails with single stress on later vowel
print(pc.arpabet2ipa("AE0 ER1 T"))
# Output: ˈæ ɝ t ❌ Wrong! Should be: æ ˈɝ tImpact
This bug affects any ARPABET conversion where:
- The word has exactly one stress marker (most common case)
- The stress is NOT on the first vowel
- Users relying on this library for CMUDict→IPA conversion will get incorrect stress placement for the majority of English words
Expected Behavior
Stress markers should be placed on the vowel that has the stress digit (1 or 2) in the original ARPABET, not redistributed to other vowels.
Metadata
Metadata
Assignees
Labels
No labels