Stable char-to-phoneme mapping with alignment across phonetic transformations #2

hossam-96 · 2025-09-25T13:12:19Z

Summary

Introduce and propagate a char-to-phoneme origin mapping through all text normalization and phonetic operations. This enables precise back-references from phonetic output to original Uthmani indices for highlighting, sync, and annotations.

What changed

Mapping utilities in src/quran_transcript/phonetics/phonetizer.py
- _distribute_mapping(segment, new_len): spreads original indices across replaced segments.
- _align_mapping(before_text, after_text, before_mapping): aligns origin indices using difflib.SequenceMatcher opcodes.
- _substitute_with_mapping(text, mapping, pattern, replacement): regex substitution that preserves mapping via alignment.
quran_phonetizer
- Initialize a mapping array with original character indices.
- Normalize whitespace to alph.uthmani.space and trim edges via mapping-aware substitution.
- After each op in OPERATION_ORDER, realign mapping with the transformed text only when the text changes.
- Respect remove_spaces by filtering both phonemes and mapping.
- Return QuranPhoneticScriptOutput with a new char_map field populated.
Typing/docstrings and light refactors for clarity.

Why

Required for UI features such as per-phoneme highlighting, cursor sync, and error localization.
Ensures transformations that insert/delete/replace characters maintain a stable origin link to original Uthmani indices.

Behavior

Phoneme string remains functionally unchanged.
QuranPhoneticScriptOutput.char_map exposes per-phoneme origin indices (use None when a phoneme has no direct source, e.g., inserted characters).
Whitespace is canonicalized to alph.uthmani.space; remove_spaces=True removes those spaces and their indices consistently.

Implementation notes

Alignment via SequenceMatcher.get_opcodes():
- equal: carry over the original slice of the mapping.
- replace: distribute the source indices across the new span using _distribute_mapping.
- insert: fill with None for newly inserted characters.
- delete: dropped characters are omitted from the mapping.
_substitute_with_mapping provides a mapping-safe wrapper around regex substitutions used for normalization.

Performance

Alignment is only performed when text actually changes (fast no-op path otherwise).
SequenceMatcher is typically fast for the targeted inputs; worst-case quadratic behavior is acceptable for current use and input sizes.

Testing suggestions

Cover opcode cases: equal, replace, insert, delete should yield expected char_map segments.
Validate whitespace normalization and remove_spaces=True keep phonemes and char_map in sync.
Exercise sequences of OPERATION_ORDER that expand/contract text; verify indices remain correctly aligned to the original Uthmani string.
Round-trip sanity: for any mapped index i, ensure output.char_map[j] == i points back to uhtmani_text[i].

Backwards compatibility

No breaking API removals. Downstream consumers can begin using output.char_map.
If external code previously stripped spaces manually, prefer remove_spaces=True to keep mapping aligned.

Affected files

src/quran_transcript/phonetics/phonetizer.py

Checklist

Unit tests for mapping alignment and space handling
Documentation for the char_map contract and examples
Changelog entry

obadx · 2025-10-23T08:26:06Z

Thank you so much. This is a really useful feature. I'm very sorry for this long time for not responding. I will look at it tomorrow In Shaa ALLAH

feat: add phonemes to uthmani mapping output to the quran_phonetizer

305a8a0

hossam-96 changed the title ~~feat: add phonemes to uthmani mapping output to the quran_phonetizer~~ Stable char-to-phoneme mapping with alignment across phonetic transformations Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable char-to-phoneme mapping with alignment across phonetic transformations #2

Stable char-to-phoneme mapping with alignment across phonetic transformations #2

Uh oh!

hossam-96 commented Sep 25, 2025 •

edited

Loading

Uh oh!

obadx commented Oct 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Stable char-to-phoneme mapping with alignment across phonetic transformations #2

Are you sure you want to change the base?

Stable char-to-phoneme mapping with alignment across phonetic transformations #2

Uh oh!

Conversation

hossam-96 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why

Behavior

Implementation notes

Performance

Testing suggestions

Backwards compatibility

Affected files

Checklist

Uh oh!

obadx commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hossam-96 commented Sep 25, 2025 •

edited

Loading

obadx commented Oct 23, 2025 •

edited

Loading