Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,17 @@ You should also add project tags for each release in Github, see [Managing relea

# [Unreleased]

# [2.0.0] - 11/25/2024
### Added
- A Phonecodes enum class to the phonecodes module, to enforce valid conversion and language pairs more explicitly.
- Support for post-processing after conversion to/from IPA is performed, to allow for reduction to a shared symbol set. This is useful, for example, to convert standard TIMIT symbol reductions or a shared symbol set between Buckeye and TIMIT.

### Changed
- All codeA2codeB conversion functions in phonecodes now rely on the convert function, which should increase maintainability and reusability of the code.

### Fixed
- Added missing ARPABET IPA vowels (diphthongs and r-colored vowels) to the set of IPA vowels in phonecode_tables, so that stress markers would be added correctly. Fixes https://github.com/ginic/phonecodes/issues/15.

# [1.2.3] - 10/23/2025
### Changed
- Added python 3.14 to package and pytest GitHub actions
Expand Down
23 changes: 22 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Developers may refer to the CONTRIBUTIONS.md for information on the development
# Basic Usage
## Converting between Phonetic Alphabets
If you want to convert to or from IPA to some other phonetic code, use `phonecodes.phonecodes` as follows:
```
```python
>>> from phonecodes import phonecodes
>>> print(phonecodes.CODES) # available phonetic alphabets
{'arpabet', 'buckeye', 'ipa', 'timit', 'callhome', 'xsampa', 'disc'}
Expand All @@ -35,6 +35,27 @@ For 'callhome' and 'disc' you should also specify a language code from the follo
- DISC/CELEX: Dutch `'nld'`, English `'eng'`, German `'deu'`. Uses German if unspecified.
- Callhome: Spanish `'spa'`, Egyptian Arabic `'arz'`, Mandarin Chinese `'cmn'`. You MUST specify an appropriate language code or you'll get a KeyError.

## Additional post-processing
An additional use case when converting between phonecodes is to normalize the final mapping to a subset of IPA symbols. This is useful if you are collapsing similar sounds together to a reduced symbol inventory or if you are standardizing two corpora with different IPA inventories/conventions to a shared subset.

We support this use case through the `post_conversion_mapping` keyword argument, an optional dictionary remapping provided with all phonecodes conversion functions. You can provide a custom mapping. Be aware that the remapping algorithm is greedy, proceeds in the order that keys appear in the dictionary, and diacritics need to appear with a base symbol in the mapping.

Additionally, we provide IPA-to-IPA post-processing dictionary mappings in `phonecodes.phonecode_tables`:
- `phonecodes.phonecode_tables.STANDARD_TIMIT_IPA_REDUCTION`: The 'standard' TIMIT label reduction used in Lee and Hon (1989) that reduces the original 64 TIMIT phonetic labels to 39 categories. This reduction is widely used in the speech recognition community.
- `phonecodes.phonecode_tables.BUCKEYE_IPA_TO_TIMIT_BUCKEYE_SHARED` and `phonecodes.phonecode_tables.TIMIT_IPA_TO_TIMIT_BUCKEYE_SHARED`: A conservative reduction from the Buckeye and TIMIT IPA inventories, respectively, to a shared symbol set. This maps nasalized vowels and flaps to their non-nasalized versions, r-colored vowels ('ɚ', 'ɝ') to syllabic r ('ɹ̩'), and normalizes variants of 'ʌ' and schwa to sch

```python
>>> from phonecodes import phonecodes
# Conversion from Buckeye to IPA using the original published Buckeye mapping
>>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa")
'b ʌ̃ ɾ̃ ɑ̃ ɾ̃ ʌ'
# Conversion from Buckeye to IPA with postprocessing to an IPA inventory shared with TIMIT
>>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa", post_conversion_mapping = phonecodes.phonecode_tables.BUCKEYE_IPA_TO_TIMIT_BUCKEYE_SHARED)
'b ə n ɑ n ə'
# Custom mapping example - note that the nasalized diacritics are not affected by the remapping
>>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa", post_conversion_mapping = {'ʌ':'ə'})
'b ə̃ ɾ̃ ɑ̃ ɾ̃ ə'
```

## Reading Corpus Files
If you are working with specific corpora, you can also convert between certain corpus formats as follows:
Expand Down
6 changes: 2 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,15 @@ build-backend = "setuptools.build_meta"

[project]
name = "phonecodes"
version = "1.2.3"
version = "2.0.0"
description = "Tools for loading dictionaries with various phonecodes (IPA, Callhome, X-SAMPA, ARPABET, DISC=CELEX, Buckeye), for converting among those phonecodes, and for searching those dictionaries for word sequences matching a target."
readme = "README.md"
license = {file = "LICENSE.txt"}

requires-python = ">=3.7"
requires-python = ">=3.9"

classifiers = [
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
Expand Down
95 changes: 90 additions & 5 deletions src/phonecodes/phonecode_tables.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
"""
Tables mapping other phonecodes to/from IPA
Tables mapping other phonecodes to/from IPA.
Note that working with unicode symbols can be tricky.
Refer to the Unicode standards at https://www.unicode.org/charts/ and
check symbols against a Unicode character inspector like https://apps.timwhitlock.info/unicode/inspect.
"""

import re
Expand Down Expand Up @@ -541,6 +544,10 @@
# TIMIT is written in a variant of ARPABET that includes a couple
# of non-standard allophones, and most significantly, includes
# separate symbols for the closure and release portions of each stop and affricate.
# This is the official mapping published with the TIMIT corpus, but you
# likely want to post-process to one of the standard shared IPA inventories
# defined below.
#
# Because the TIMIT corpus has separate symbols for closure and release,
# but IPA only has one corresponding symbol, we need to map all
# possibilities for inputs with and without spaces.
Expand All @@ -566,7 +573,7 @@
"DCL JH": "dʒ",
"DX": "ɾ",
"ENG": "ŋ̩",
"ER": "ɹ̩",
"ER": "ɝ",
"EPI": "",
"GCL": "ɡ",
"GCLG": "ɡ",
Expand Down Expand Up @@ -608,7 +615,7 @@
"AXN": "ə̃",
"IYN": "ĩ",
"EYN": "ẽɪ̃",
"OWN": "õʊ̃",
"OWN": "õʊ̃",
"DX": "ɾ",
"AYN": "ãɪ̃",
"AAN": "ɑ̃",
Expand All @@ -633,10 +640,9 @@

_ipa2buckeye = {v: k for k, v in _buckeye2ipa.items()}


#######################################################################
# IPA
_ipa_vowels = set("aeiouyɑɒɛɪɔʘʊʌʏəɘæʉɨøɜɞɐɤɵœɶ") | set(("ɪ̈", "ʊ̈"))
_ipa_vowels = set("aeiouyɑɒɛɪɔʘʊʌʏəɘæʉɨøɜɞɐɤɵœɶɝɚ") | set(("ɪ̈", "ʊ̈", "oʊ", "aʊ", "eɪ", "ɔɪ", "aɪ"))
_ipa_consonants = set("bɓcdɖɗfɡɠhɦjʝklɭɺmnɳpɸqrɽɹɻsʂɕtʈvʋwxɧzʐʑβʙçðɱɣɢʛɥʜɲɟʄɬɮʎʟɯɰŋɴʋɒʁʀʃθʍχħʒɾɫʔʕʢʡꜛꜜǃ|ǀ‖ǁǂ")
_ipa_diacritics = set(re.sub(r"◌", "", "◌̈◌̟◌̠◌̌◌̥◌̩◌◌◌̂◌̯◌̚◌◌̃◌̘◌̺◌̏◌◌̜◌̪◌̴◌̂◌◌́◌◌◌◌̰◌̀◌◌̄◌̻◌̼◌◌̹◌̞◌̙◌̌◌◌̝◌̋◌̤◌̬◌◌̆◌̽ːʰˀʷʱʼʲˤ"))
_ipa_stressmarkers = set("ˈˌ")
Expand All @@ -648,3 +654,82 @@
_ipa_tones |= set(x + y for x in _ipa_tones for y in _ipa_tones)

_ipa_symbols = _ipa_vowels | _ipa_consonants | _ipa_diacritics

#######################################################################
# Shared IPA inventories
# Many projects will actually use a subset of the full IPA inventory and it's useful to have an
# explicitly defined mapping to transform and validate.
# These are some standard mappings from an original IPA inventory to a subset of IPA symbols.
# These mappings are expected to be one-to-many reductions,
# Since each replacements are done by iterating over the mapping, cascading
# replacements are supported, but they are not recommended.
# Use the functions in the phonecodes module to check for cascading replacements.

# N.B. This assumes mapping takes place in dictionary insertion order (this is guaranteed since python 3.7).
#

# but can support multi-character symbols.

# This is the standard TIMIT label reduction described by Lee and Hon (1989)
# described in https://drive.google.com/file/d/1QI4_omp8E9EvO71jZQBGdH2GV6Pn7FPh/view?usp=sharing.
# Closure symbols are also removed using the standard reduction, but
# this is already handled by _timit2ipa
STANDARD_TIMIT_IPA_REDUCTION = {
"ɔ": "ɑ",
"ɚ": "ɝ",
"ʒ": "ʃ",
"ɦ": "h",
"ɨ": "ɪ",
"ʉ": "u",
# Syllabic markers are dropped
"l̩": "l",
"m̩": "m",
"n̩": "n",
"ŋ̩": "ŋ",
"ʔ": "",
# "ə" and "ʌ" are collapsed
"ə": "ʌ",
"ə̥": "ʌ",
}

# In many cases it is important that the mapped subsets match,
# especially working with models that are trained on one corpus and evaluated on another.
# These dictionaries map TIMIT and Buckeye IPA inventories to the same IPA subset in post-processing.
BUCKEYE_IPA_TO_TIMIT_BUCKEYE_SHARED = {
# Reduced nasalized vowels and diphthongs to non-nasal versions
"ãʊ̃": "aʊ",
"ẽɪ̃": "eɪ",
"õʊ̃": "oʊ",
"ãɪ̃": "aɪ",
"ɔ̃ɪ̃": "ɔɪ",
"æ̃": "æ",
"ɔ̃": "ɔ",
"ə̃": "ə",
"ĩ": "i",
"ɑ̃": "ɑ",
"ũ": "u",
"ɛ̃": "ɛ",
"ʊ̃": "ʊ",
"ɪ̃": "ɪ",
"ɹ̩̃": "ɹ̩",
# β doesn't appear in TIMIT annotations, so must be reduced
"β": "f",
# Nasalized flap is too inconsistently annotated, so reduce to 'n'
"ɾ̃": "n",
# Use schwa in final vocabulary
"ʌ̃": "ə",
"ʌ": "ə",
}
TIMIT_IPA_TO_TIMIT_BUCKEYE_SHARED = {
# These symbols are not present in Buckeye IPA, so must be reduced
"ɦ": "h",
"ɨ": "ɪ",
"ʉ": "u",
# Nasalized flap is too inconsistently annotated, so reduce to 'n'
"ɾ̃": "n",
# Vocalic r all map to ɹ̩
"ɝ": "ɹ̩",
"ɚ": "ɹ̩",
# Use schwa in final vocabulary
"ʌ": "ə",
}
Loading
Loading