KoEx uses a multi-stage language analysis pipeline to convert raw Korean input into semantically meaningful components.
This powers both the standalone language analysis and integrated subtitle analysis within the YouTube player.
- Preprocess
- Strip all non-Hangul characters
- Idioms (사자성어)
- Replace detected idioms with placeholders (
@@{idx}using U+FF20 full-width @)
- Replace detected idioms with placeholders (
- Tokenize
- Use KoNLPy Komoran to tokenize input with placeholders
- Grammar Chunking
- Merge auxiliary grammar using rules from
auxiliary_grammar_rules.yaml- At each position, greedily apply the longest matching rule
- Each rule specifies a sequence of tags (and optionally tokens) that represent a meaningful grammar chunk
- Example (
aux_negation_adverb_past)- pattern:
- token: 안
- tag: MAG
- tag: VV
- tag: EP
- tag: EF (use
TAG_EQUIVALENTSmap to matchEForEChere)
- token: 안
- Input:
[('나', 'NP'), ('는', 'JX'), ('안', 'MAG'), ('가', 'VV'), ('았', 'EP'), ('어요', 'EC')] - Output:
[('나', 'NP'), ('는', 'JX'), ('안갔어요', 'VV')]
- pattern:
- Merge auxiliary grammar using rules from
- Morphological Grouping (meaning)
- Greedily group linguistically meaningful units (typically noun/verb + ending)
- e.g.
[("평화", "NNG"), ("롭", "XSA"), ("다", "EF")]→[("평화롭다", "VA")]
- e.g.
- Greedily group linguistically meaningful units (typically noun/verb + ending)
- Morphophonological Contraction (sound)
- Contract verb stems (e.g.
하 + 였 → 했)
- Contract verb stems (e.g.
- Finalize tokens
- Substitute idiom placeholders with idioms
- Filter out excluded tokens (stopwords, particles, etc.)
- Extract and tag candidate Korean words
- Extract nouns and verbs from the final tokens
- Return both base and derived forms if applicable (e.g.
평화(base) vs평화롭다(derived))
- Final GPT integration
hanja_batcher- Input all base words --> returns the Hanja form of a word (if applicable), Korean gloss, Pinyin, and English gloss
korean_analyzer- Input the original query --> returns the English gloss
- Input all words (use derived only if applicable) --> returns the word, part-of-speech, English gloss, and example sentence in Korean
- Romanization
- Decompose Hangul syllables into initial (초성), medial (중성), and final (종성)
- Apply phonological rules:
- Handle
ㅎtransformations - Apply consonant assimilation (e.g. 받침 + next consonant)
- Split double final consonants
- Handle
- Handle incomplete or partial Hangul/Jamo input (e.g.
ㅋㅋㅋ,ㅏㅏㅏ) - Output syllable-level romanization with hyphen separation (e.g.
an-nyeong-ha-se-yo) - Supports Hangul, English, and unknown letters
- Romanization
- Sentence gloss
- Korean word information
- Hanja annotations
- Perform idiom detection using a hardcoded list to remove dependence on GPT
- Extend the grammar rules to support more patterns (non-auxiliary)
- conjunctions (
일하고 나서) - conditionals (
좋으면) - supposition (
갈 것 같아) - reported speech / quotatives (
공부하자고 했다) - noun-modifying clauses (
보던 영화) - etc.
- conjunctions (