Support for IPA remappings and better maintainability #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

ginic merged 12 commits into master from ipa_remappings

Nov 25, 2025

CHANGELOG.md

-Original file line number
+Diff line change
@@ Expand Up @@
     # [Unreleased]
+    # [2.0.0] - 11/25/2024
+    ### Added
+    - A Phonecodes enum class to the phonecodes module, to enforce valid conversion and language pairs more explicitly.
+    - Support for post-processing after conversion to/from IPA is performed, to allow for reduction to a shared symbol set. This is useful, for example, to convert standard TIMIT symbol reductions or a shared symbol set between Buckeye and TIMIT.
+    ### Changed
+    - All codeA2codeB conversion functions in phonecodes now rely on the convert function, which should increase maintainability and reusability of the code.
+    ### Fixed
+    - Added missing ARPABET IPA vowels (diphthongs and r-colored vowels) to the set of IPA vowels in phonecode_tables, so that stress markers would be added correctly.  Fixes https://github.com/ginic/phonecodes/issues/15.
     # [1.2.3] - 10/23/2025
     ### Changed
     - Added python 3.14 to package and pytest GitHub actions
@@ Expand Down @@

README.md

-Original file line number
+Diff line change
@@ Expand Up @@
     # Basic Usage
     ## Converting between Phonetic Alphabets
     If you want to convert to or from IPA to some other phonetic code, use `phonecodes.phonecodes` as follows:
-    ```
+    ```python
     >>> from phonecodes import phonecodes
     >>> print(phonecodes.CODES) # available phonetic alphabets
     {'arpabet', 'buckeye', 'ipa', 'timit', 'callhome', 'xsampa', 'disc'}
@@ Expand All @@
     - DISC/CELEX: Dutch `'nld'`, English `'eng'`, German `'deu'`. Uses German if unspecified.
     - Callhome: Spanish `'spa'`, Egyptian Arabic `'arz'`, Mandarin Chinese `'cmn'`. You MUST specify an appropriate language code or you'll get a KeyError.
+    ## Additional post-processing
+    An additional use case when converting between phonecodes is to normalize the final mapping to a subset of IPA symbols. This is useful if you are collapsing similar sounds together to a reduced symbol inventory or if you are standardizing two corpora with different IPA inventories/conventions to a shared subset.
+    We support this use case through the `post_conversion_mapping` keyword argument, an optional dictionary remapping provided with all phonecodes conversion functions. You can provide a custom mapping. Be aware that the remapping algorithm is greedy, proceeds in the order that keys appear in the dictionary, and diacritics need to appear with a base symbol in the mapping.
+    Additionally, we provide IPA-to-IPA post-processing dictionary mappings in `phonecodes.phonecode_tables`:
+    - `phonecodes.phonecode_tables.STANDARD_TIMIT_IPA_REDUCTION`: The 'standard' TIMIT label reduction used in Lee and Hon (1989) that reduces the original 64 TIMIT phonetic labels to 39 categories. This reduction is widely used in the speech recognition community.
+    - `phonecodes.phonecode_tables.BUCKEYE_IPA_TO_TIMIT_BUCKEYE_SHARED` and `phonecodes.phonecode_tables.TIMIT_IPA_TO_TIMIT_BUCKEYE_SHARED`: A conservative reduction from the Buckeye and TIMIT IPA inventories, respectively, to a shared symbol set. This maps nasalized vowels and flaps to their non-nasalized versions, r-colored vowels ('ɚ', 'ɝ') to syllabic r ('ɹ̩'), and normalizes variants of 'ʌ' and schwa to sch
+    ```python
+    >>> from phonecodes import phonecodes
+    # Conversion from Buckeye to IPA using the original published Buckeye mapping
+    >>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa")
+    'b ʌ̃ ɾ̃ ɑ̃ ɾ̃ ʌ'
+    # Conversion from Buckeye to IPA with postprocessing to an IPA inventory shared with TIMIT
+    >>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa", post_conversion_mapping = phonecodes.phonecode_tables.BUCKEYE_IPA_TO_TIMIT_BUCKEYE_SHARED)
+    'b ə n ɑ n ə'
+    # Custom mapping example - note that the nasalized diacritics are not affected by the remapping
+    >>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa", post_conversion_mapping = {'ʌ':'ə'})
+    'b ə̃ ɾ̃ ɑ̃ ɾ̃ ə'
+    ```
     ## Reading Corpus Files
     If you are working with specific corpora, you can also convert between certain corpus formats as follows:
@@ Expand Down @@

pyproject.toml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -4,17 +4,15 @@ build-backend = "setuptools.build_meta"
  
    [project]

    name = "phonecodes"

    version = "1.2.3"

    version = "2.0.0"

    description = "Tools for loading dictionaries with various phonecodes (IPA, Callhome, X-SAMPA, ARPABET, DISC=CELEX, Buckeye), for converting among those phonecodes, and for searching those dictionaries for word sequences matching a target."

    readme = "README.md"

    license = {file = "LICENSE.txt"}

    requires-python = ">=3.7"

    requires-python = ">=3.9"

    classifiers = [

        "Programming Language :: Python :: 3",

        "Programming Language :: Python :: 3.7",

        "Programming Language :: Python :: 3.8",

        "Programming Language :: Python :: 3.9",

        "Programming Language :: Python :: 3.10",

        "Programming Language :: Python :: 3.11",

src/phonecodes/phonecode_tables.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,5 +1,8 @@
  
    """

    Tables mapping other phonecodes to/from IPA

    Tables mapping other phonecodes to/from IPA.

    Note that working with unicode symbols can be tricky.

    Refer to the Unicode standards at https://www.unicode.org/charts/ and

    check symbols against a Unicode character inspector like https://apps.timwhitlock.info/unicode/inspect.

    """

    import re

    @@ -541,6 +544,10 @@
  
    # TIMIT is written in a variant of ARPABET that includes a couple

    # of non-standard allophones, and most significantly, includes

    # separate symbols for the closure and release portions of each stop and affricate.

    # This is the official mapping published with the TIMIT corpus, but you

    # likely want to post-process to one of the standard shared IPA inventories

    # defined below.

    #

    # Because the TIMIT corpus has separate symbols for closure and release,

    # but IPA only has one corresponding symbol, we need to map all

    # possibilities for inputs with and without spaces.

    @@ -566,7 +573,7 @@
  
            "DCL JH": "dʒ",

            "DX": "ɾ",

            "ENG": "ŋ̩",

            "ER": "ɹ̩",

            "ER": "ɝ",

            "EPI": "",

            "GCL": "ɡ",

            "GCLG": "ɡ",

    @@ -608,7 +615,7 @@
  
            "AXN": "ə̃",

            "IYN": "ĩ",

            "EYN": "ẽɪ̃",

            "OWN": "õʊ̃",

            "OWN": "õʊ̃",

            "DX": "ɾ",

            "AYN": "ãɪ̃",

            "AAN": "ɑ̃",

    @@ -633,10 +640,9 @@
  
    _ipa2buckeye = {v: k for k, v in _buckeye2ipa.items()}

    #######################################################################

    # IPA

    _ipa_vowels = set("aeiouyɑɒɛɪɔʘʊʌʏəɘæʉɨøɜɞɐɤɵœɶ") | set(("ɪ̈", "ʊ̈"))

    _ipa_vowels = set("aeiouyɑɒɛɪɔʘʊʌʏəɘæʉɨøɜɞɐɤɵœɶɝɚ") | set(("ɪ̈", "ʊ̈", "oʊ", "aʊ", "eɪ", "ɔɪ", "aɪ"))

    _ipa_consonants = set("bɓcdɖɗfɡɠhɦjʝklɭɺmnɳpɸqrɽɹɻsʂɕtʈvʋwxɧzʐʑβʙçðɱɣɢʛɥʜɲɟʄɬɮʎʟɯɰŋɴʋɒʁʀʃθʍχħʒɾɫʔʕʢʡꜛꜜǃ|ǀ‖ǁǂ")

    _ipa_diacritics = set(re.sub(r"◌", "", "◌̈◌̟◌̠◌̌◌̥◌̩◌◌◌̂◌̯◌̚◌◌̃◌̘◌̺◌̏◌◌̜◌̪◌̴◌̂◌◌́◌◌◌◌̰◌̀◌◌̄◌̻◌̼◌◌̹◌̞◌̙◌̌◌◌̝◌̋◌̤◌̬◌◌̆◌̽ːʰˀʷʱʼʲˤ"))

    _ipa_stressmarkers = set("ˈˌ")

    @@ -648,3 +654,82 @@
  
    _ipa_tones |= set(x + y for x in _ipa_tones for y in _ipa_tones)

    _ipa_symbols = _ipa_vowels | _ipa_consonants | _ipa_diacritics

    #######################################################################

    # Shared IPA inventories

    # Many projects will actually use a subset of the full IPA inventory and it's useful to have an

    # explicitly defined mapping to transform and validate.

    # These are some standard mappings from an original IPA inventory to a subset of IPA symbols.

    # These mappings are expected to be one-to-many reductions,

    # Since each replacements are done by iterating over the mapping, cascading

    # replacements are supported, but they are not recommended.

    # Use the functions in the phonecodes module to check for cascading replacements.

    # N.B. This assumes mapping takes place in dictionary insertion order (this is guaranteed since python 3.7).

    #

    # but can support multi-character symbols.

    # This is the standard TIMIT label reduction described by Lee and Hon (1989)

    # described in https://drive.google.com/file/d/1QI4_omp8E9EvO71jZQBGdH2GV6Pn7FPh/view?usp=sharing.

    # Closure symbols are also removed using the standard reduction, but

    # this is already handled by _timit2ipa

    STANDARD_TIMIT_IPA_REDUCTION = {

        "ɔ": "ɑ",

        "ɚ": "ɝ",

        "ʒ": "ʃ",

        "ɦ": "h",

        "ɨ": "ɪ",

        "ʉ": "u",

        # Syllabic markers are dropped

        "l̩": "l",

        "m̩": "m",

        "n̩": "n",

        "ŋ̩": "ŋ",

        "ʔ": "",

        # "ə" and "ʌ" are collapsed

        "ə": "ʌ",

        "ə̥": "ʌ",

    }

    # In many cases it is important that the mapped subsets match,

    # especially working with models that are trained on one corpus and evaluated on another.

    # These dictionaries map TIMIT and Buckeye IPA inventories to the same IPA subset in post-processing.

    BUCKEYE_IPA_TO_TIMIT_BUCKEYE_SHARED = {

        # Reduced nasalized vowels and diphthongs to non-nasal versions

        "ãʊ̃": "aʊ",

        "ẽɪ̃": "eɪ",

        "õʊ̃": "oʊ",

        "ãɪ̃": "aɪ",

        "ɔ̃ɪ̃": "ɔɪ",

        "æ̃": "æ",

        "ɔ̃": "ɔ",

        "ə̃": "ə",

        "ĩ": "i",

        "ɑ̃": "ɑ",

        "ũ": "u",

        "ɛ̃": "ɛ",

        "ʊ̃": "ʊ",

        "ɪ̃": "ɪ",

        "ɹ̩̃": "ɹ̩",

        # β doesn't appear in TIMIT annotations, so must be reduced

        "β": "f",

        # Nasalized flap is too inconsistently annotated, so reduce to 'n'

        "ɾ̃": "n",

        # Use schwa in final vocabulary

        "ʌ̃": "ə",

        "ʌ": "ə",

    }

    TIMIT_IPA_TO_TIMIT_BUCKEYE_SHARED = {

        # These symbols are not present in Buckeye IPA, so must be reduced

        "ɦ": "h",

        "ɨ": "ɪ",

        "ʉ": "u",

        # Nasalized flap is too inconsistently annotated, so reduce to 'n'

        "ɾ̃": "n",

        # Vocalic r all map to ɹ̩

        "ɝ": "ɹ̩",

        "ɚ": "ɹ̩",

        # Use schwa in final vocabulary

        "ʌ": "ə",

    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for IPA remappings and better maintainability #19

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!