Stable char-to-phoneme mapping with alignment across phonetic transformations #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Introduce and propagate a char-to-phoneme origin mapping through all text normalization and phonetic operations. This enables precise back-references from phonetic output to original Uthmani indices for highlighting, sync, and annotations.
What changed
src/quran_transcript/phonetics/phonetizer.py_distribute_mapping(segment, new_len): spreads original indices across replaced segments._align_mapping(before_text, after_text, before_mapping): aligns origin indices usingdifflib.SequenceMatcheropcodes._substitute_with_mapping(text, mapping, pattern, replacement): regex substitution that preserves mapping via alignment.quran_phonetizermappingarray with original character indices.alph.uthmani.spaceand trim edges via mapping-aware substitution.OPERATION_ORDER, realign mapping with the transformed text only when the text changes.remove_spacesby filtering both phonemes and mapping.QuranPhoneticScriptOutputwith a newchar_mapfield populated.Why
Behavior
QuranPhoneticScriptOutput.char_mapexposes per-phoneme origin indices (useNonewhen a phoneme has no direct source, e.g., inserted characters).alph.uthmani.space;remove_spaces=Trueremoves those spaces and their indices consistently.Implementation notes
SequenceMatcher.get_opcodes():equal: carry over the original slice of the mapping.replace: distribute the source indices across the new span using_distribute_mapping.insert: fill withNonefor newly inserted characters.delete: dropped characters are omitted from the mapping._substitute_with_mappingprovides a mapping-safe wrapper around regex substitutions used for normalization.Performance
SequenceMatcheris typically fast for the targeted inputs; worst-case quadratic behavior is acceptable for current use and input sizes.Testing suggestions
equal,replace,insert,deleteshould yield expectedchar_mapsegments.remove_spaces=Truekeepphonemesandchar_mapin sync.OPERATION_ORDERthat expand/contract text; verify indices remain correctly aligned to the original Uthmani string.i, ensureoutput.char_map[j] == ipoints back touhtmani_text[i].Backwards compatibility
output.char_map.remove_spaces=Trueto keep mapping aligned.Affected files
src/quran_transcript/phonetics/phonetizer.pyChecklist
char_mapcontract and examples