Skip to content

ARPABET to IPA conversion places stress marks on wrong vowels when stress marker appears after vowel in token list #15

@vassiliphilippov

Description

@vassiliphilippov

The arpabet2ipa() function incorrectly places stress marks on the first vowel in a word instead of the vowel that has the stress digit (1 or 2) in the ARPABET input.

Root Cause

The bug occurs in the attach_tones_to_vowels() function when combined with how ARPABET stress markers are converted.

The problem:

  1. When translate_string() converts ARPABET like ER1 (stressed "er" sound), it produces tokens ['ɝ', 'ˈ'] - with the stress marker AFTER the vowel in the list
  2. The attach_tones_to_vowels() function searches backward (searchstep=-1) to find a vowel to attach each stress marker to
  3. When it finds a stress marker (e.g., at position 8), it searches backward for a vowel
  4. It encounters the vowel that the stress CAME FROM (e.g., ɝ at position 7), but since the stress marker is after it in the list, the backward search continues past it to find the previous vowel (e.g., æ at position 0)
  5. The stress gets attached to the wrong vowel

Problematic Code

From phonecodes.py lines 48-62:

def attach_tones_to_vowels(il, tones, vowels, searchstep, catdir):
    """Return a copy of il, with each tone attached to nearest vowel if any.
    searchstep=1 means search for next vowel, searchstep=-1 means prev vowel.
    catdir>=0 means concatenate after vowel, catdir<0 means cat before vowel.
    Tones are not combined, except those also included in the vowels set.
    """
    ol = il.copy()
    v = 0 if searchstep > 0 else len(ol) - 1
    t = -1
    while 0 <= v and v < len(ol):
        if (ol[v] in vowels or (len(ol[v]) > 1 and ol[v][0] in vowels)) and t >= 0:
            ol[v] = ol[v] + ol[t] if catdir >= 0 else ol[t] + ol[v]
            ol = ol[0:t] + ol[(t + 1) :]  # Remove the tone
            t = -1  # Done with that tone
        if v < len(ol) and ol[v] in tones:
            t = v
        v += searchstep
    return ol

When searching backward (searchstep=-1):

  • The algorithm finds a stress marker at some position t
  • It continues decrementing v to find a vowel
  • Bug: When the stress marker appears RIGHT AFTER its vowel in the token list, the vowel at position v = t - 1 is skipped, and the stress attaches to an earlier vowel

Minimal Reproducible Example

from phonecodes import phonecodes as pc

# Test case: ARPABET with stress on second vowel (ER1)
arpabet = "AE0 D V ER1 T AH0 Z M AH0 N T"
#          ^^0     ^^1  ^^0      ^^0
#         (no)    (PRIMARY!) (no)  (no)

ipa = pc.arpabet2ipa(arpabet)
print(f"Input:  {arpabet}")
print(f"Output: {ipa}")
print()

# Expected: stress on ɝ (from ER1)
# Actual:   stress on æ (from AE0) ❌

tokens = ipa.split()
print("Token breakdown:")
for i, token in enumerate(tokens):
    stress_marker = " ← STRESS" if 'ˈ' in token or 'ˌ' in token else ""
    print(f"  {i}: {token}{stress_marker}")

Output:

Input:  AE0 D V ER1 T AH0 Z M AH0 N T
Output: ˈæ d v ɝ t ə z m ə n t

Token breakdown:
  0: ˈæ ← STRESS
  1: d
  2: v
  3: ɝ
  4: t
  5: ə
  ...

Expected output: æ d v ˈɝ t ə z m ə n t (stress on ɝ, the vowel from ER1)

Actual output: ˈæ d v ɝ t ə z m ə n t (stress on æ, the vowel from AE0)

Additional Test Cases

# Works correctly with AE1 (stress on first vowel)
print(pc.arpabet2ipa("AE1 D V ER0 T"))  
# Output: ˈæ d v ɚ t ✅ Correct!

# Works correctly with two stresses  
print(pc.arpabet2ipa("AE1 D V ER1 T"))
# Output: ˈæ d v ɝˈ t  
# Note: Second stress appears AFTER ɝ, showing the order issue

# Fails with single stress on later vowel
print(pc.arpabet2ipa("AE0 ER1 T"))
# Output: ˈæ ɝ t ❌ Wrong! Should be: æ ˈɝ t

Impact

This bug affects any ARPABET conversion where:

  • The word has exactly one stress marker (most common case)
  • The stress is NOT on the first vowel
  • Users relying on this library for CMUDict→IPA conversion will get incorrect stress placement for the majority of English words

Expected Behavior

Stress markers should be placed on the vowel that has the stress digit (1 or 2) in the original ARPABET, not redistributed to other vowels.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions