KoEx/docs/language-analysis.md at main · andknam/KoEx

Language Analysis Overview

KoEx uses a multi-stage language analysis pipeline to convert raw Korean input into semantically meaningful components.

This powers both the standalone language analysis and integrated subtitle analysis within the YouTube player.

Processing Flow

Preprocess
- Strip all non-Hangul characters
Idioms (사자성어)
- Replace detected idioms with placeholders (＠＠{idx} using U+FF20 full-width ＠)
Tokenize
- Use KoNLPy Komoran to tokenize input with placeholders
Grammar Chunking
- Merge auxiliary grammar using rules from auxiliary_grammar_rules.yaml
  - At each position, greedily apply the longest matching rule
  - Each rule specifies a sequence of tags (and optionally tokens) that represent a meaningful grammar chunk
  - Example (aux_negation_adverb_past)
    - pattern:
      - token: 안
        
        tag: MAG
      - tag: VV
      - tag: EP
      - tag: EF (use TAG_EQUIVALENTS map to match EF or EC here)
    - Input: [('나', 'NP'), ('는', 'JX'), ('안', 'MAG'), ('가', 'VV'), ('았', 'EP'), ('어요', 'EC')]
    - Output: [('나', 'NP'), ('는', 'JX'), ('안갔어요', 'VV')]
Morphological Grouping (meaning)
- Greedily group linguistically meaningful units (typically noun/verb + ending)
  - e.g. [("평화", "NNG"), ("롭", "XSA"), ("다", "EF")] → [("평화롭다", "VA")]
Morphophonological Contraction (sound)
- Contract verb stems (e.g. 하 + 였 → 했)
Finalize tokens
- Substitute idiom placeholders with idioms
- Filter out excluded tokens (stopwords, particles, etc.)
Extract and tag candidate Korean words
- Extract nouns and verbs from the final tokens
- Return both base and derived forms if applicable (e.g. 평화 (base) vs 평화롭다 (derived))
Final GPT integration
- hanja_batcher
  - Input all base words --> returns the Hanja form of a word (if applicable), Korean gloss, Pinyin, and English gloss
- korean_analyzer
  - Input the original query --> returns the English gloss
  - Input all words (use derived only if applicable) --> returns the word, part-of-speech, English gloss, and example sentence in Korean
Romanization
- Decompose Hangul syllables into initial (초성), medial (중성), and final (종성)
- Apply phonological rules:
  - Handle ㅎ transformations
  - Apply consonant assimilation (e.g. 받침 + next consonant)
  - Split double final consonants
- Handle incomplete or partial Hangul/Jamo input (e.g. ㅋㅋㅋ, ㅏㅏㅏ)
- Output syllable-level romanization with hyphen separation (e.g. an-nyeong-ha-se-yo)
- Supports Hangul, English, and unknown letters

Output Layers

Romanization
Sentence gloss
Korean word information
Hanja annotations

Future Improvements

Perform idiom detection using a hardcoded list to remove dependence on GPT
Extend the grammar rules to support more patterns (non-auxiliary)
- conjunctions (일하고 나서)
- conditionals (좋으면)
- supposition (갈 것 같아)
- reported speech / quotatives (공부하자고 했다)
- noun-modifying clauses (보던 영화)
- etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language Analysis Overview

Processing Flow

Output Layers

Future Improvements

FilesExpand file tree

language-analysis.md

Latest commit

History

language-analysis.md

File metadata and controls

Language Analysis Overview

Processing Flow

Output Layers

Future Improvements