Skip to content

Conversation

@hossam-96
Copy link

@hossam-96 hossam-96 commented Sep 25, 2025

Summary

Introduce and propagate a char-to-phoneme origin mapping through all text normalization and phonetic operations. This enables precise back-references from phonetic output to original Uthmani indices for highlighting, sync, and annotations.

What changed

  • Mapping utilities in src/quran_transcript/phonetics/phonetizer.py
    • _distribute_mapping(segment, new_len): spreads original indices across replaced segments.
    • _align_mapping(before_text, after_text, before_mapping): aligns origin indices using difflib.SequenceMatcher opcodes.
    • _substitute_with_mapping(text, mapping, pattern, replacement): regex substitution that preserves mapping via alignment.
  • quran_phonetizer
    • Initialize a mapping array with original character indices.
    • Normalize whitespace to alph.uthmani.space and trim edges via mapping-aware substitution.
    • After each op in OPERATION_ORDER, realign mapping with the transformed text only when the text changes.
    • Respect remove_spaces by filtering both phonemes and mapping.
    • Return QuranPhoneticScriptOutput with a new char_map field populated.
  • Typing/docstrings and light refactors for clarity.

Why

  • Required for UI features such as per-phoneme highlighting, cursor sync, and error localization.
  • Ensures transformations that insert/delete/replace characters maintain a stable origin link to original Uthmani indices.

Behavior

  • Phoneme string remains functionally unchanged.
  • QuranPhoneticScriptOutput.char_map exposes per-phoneme origin indices (use None when a phoneme has no direct source, e.g., inserted characters).
  • Whitespace is canonicalized to alph.uthmani.space; remove_spaces=True removes those spaces and their indices consistently.

Implementation notes

  • Alignment via SequenceMatcher.get_opcodes():
    • equal: carry over the original slice of the mapping.
    • replace: distribute the source indices across the new span using _distribute_mapping.
    • insert: fill with None for newly inserted characters.
    • delete: dropped characters are omitted from the mapping.
  • _substitute_with_mapping provides a mapping-safe wrapper around regex substitutions used for normalization.

Performance

  • Alignment is only performed when text actually changes (fast no-op path otherwise).
  • SequenceMatcher is typically fast for the targeted inputs; worst-case quadratic behavior is acceptable for current use and input sizes.

Testing suggestions

  • Cover opcode cases: equal, replace, insert, delete should yield expected char_map segments.
  • Validate whitespace normalization and remove_spaces=True keep phonemes and char_map in sync.
  • Exercise sequences of OPERATION_ORDER that expand/contract text; verify indices remain correctly aligned to the original Uthmani string.
  • Round-trip sanity: for any mapped index i, ensure output.char_map[j] == i points back to uhtmani_text[i].

Backwards compatibility

  • No breaking API removals. Downstream consumers can begin using output.char_map.
  • If external code previously stripped spaces manually, prefer remove_spaces=True to keep mapping aligned.

Affected files

  • src/quran_transcript/phonetics/phonetizer.py

Checklist

  • Unit tests for mapping alignment and space handling
  • Documentation for the char_map contract and examples
  • Changelog entry

@hossam-96 hossam-96 changed the title feat: add phonemes to uthmani mapping output to the quran_phonetizer Stable char-to-phoneme mapping with alignment across phonetic transformations Sep 25, 2025
@obadx
Copy link
Owner

obadx commented Oct 23, 2025

Thank you so much. This is a really useful feature. I'm very sorry for this long time for not responding. I will look at it tomorrow In Shaa ALLAH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants