Skip to content

Phase 2 — Translation, Tokenization, Readings, and Word Model #2

@Swiftburn

Description

@Swiftburn

Phase 2 — Translation, Tokenization, Readings, and Word Model

Purpose

Implement translation, tokenization, reading generation, and a local word database that stores word metadata for learning features.

Tasks

  • 2.1 Translator class (Sugoi/CTranslate2 wrapper)

    • File: src/translation/translator.py
    • Work: class Translator with load_model(device='cpu', quantized=True), translate(texts: List[str]) -> List[TranslationResult]. Implement batching and error handling.
    • Tests: tests/test_translator.py with mocked model responses and batching tests.
    • DoD: translator API stable and documented; unit tests pass.
  • 2.2 Tokenization service using fugashi

    • File: src/translation/tokenizer.py
    • Work: tokenize(text: str) -> List[Token] where Token contains surface, normalized form, POS. Wrap fugashi; provide fallback in test mode.
    • Tests: tests/test_tokenizer.py for sample Japanese strings.
    • DoD: tokens include surface and POS suitable for readings.
  • 2.3 Reading generation (kana + romaji)

    • File: src/translation/readings.py
    • Work: generate_readings(surface: str) -> (kana: str, romaji: str) using pykakasi or fallback library.
    • Tests: tests/test_readings.py for several kanji/katakana samples.
    • DoD: readings are produced deterministically.
  • 2.4 Word DB schema and basic CRUD

    • Files: src/words/models.py, src/words/database.py
    • Work: Create SQLite schema: words(id, surface, kana, romaji, first_seen, last_seen, known BOOLEAN) and migration/initialization helper.
    • Tests: tests/test_database.py verifying uniqueness constraints and upsert behavior.
    • DoD: DB init works and basic upsert is tested.
  • 2.5 WordManager API

    • File: src/words/word_manager.py
    • Work: add_word(surface, kana, romaji, context), mark_known(word_id, known=True), list_recent(limit=50).
    • Tests: tests/test_word_manager.py asserting persistence and queries.
    • DoD: WordManager used by translator/pipeline to persist words.

Notes

  • Keep model-specific code optional (allow mode=test to avoid GPU or heavy model loads during CI).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions