-
Notifications
You must be signed in to change notification settings - Fork 0
Phase 2 — Translation, Tokenization, Readings, and Word Model #2
Copy link
Copy link
Open
Description
Phase 2 — Translation, Tokenization, Readings, and Word Model
Purpose
Implement translation, tokenization, reading generation, and a local word database that stores word metadata for learning features.
Tasks
-
2.1 Translator class (Sugoi/CTranslate2 wrapper)
- File:
src/translation/translator.py - Work:
class Translatorwithload_model(device='cpu', quantized=True),translate(texts: List[str]) -> List[TranslationResult]. Implement batching and error handling. - Tests:
tests/test_translator.pywith mocked model responses and batching tests. - DoD: translator API stable and documented; unit tests pass.
- File:
-
2.2 Tokenization service using fugashi
- File:
src/translation/tokenizer.py - Work:
tokenize(text: str) -> List[Token]whereTokencontains surface, normalized form, POS. Wrap fugashi; provide fallback in test mode. - Tests:
tests/test_tokenizer.pyfor sample Japanese strings. - DoD: tokens include surface and POS suitable for readings.
- File:
-
2.3 Reading generation (kana + romaji)
- File:
src/translation/readings.py - Work:
generate_readings(surface: str) -> (kana: str, romaji: str)using pykakasi or fallback library. - Tests:
tests/test_readings.pyfor several kanji/katakana samples. - DoD: readings are produced deterministically.
- File:
-
2.4 Word DB schema and basic CRUD
- Files:
src/words/models.py,src/words/database.py - Work: Create SQLite schema:
words(id, surface, kana, romaji, first_seen, last_seen, known BOOLEAN)and migration/initialization helper. - Tests:
tests/test_database.pyverifying uniqueness constraints and upsert behavior. - DoD: DB init works and basic upsert is tested.
- Files:
-
2.5 WordManager API
- File:
src/words/word_manager.py - Work:
add_word(surface, kana, romaji, context),mark_known(word_id, known=True),list_recent(limit=50). - Tests:
tests/test_word_manager.pyasserting persistence and queries. - DoD: WordManager used by translator/pipeline to persist words.
- File:
Notes
- Keep model-specific code optional (allow
mode=testto avoid GPU or heavy model loads during CI).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels