Skip to content

Remove language limits: replace spaCy with language-independent tokenization#5

Open
azaripov1-oss wants to merge 15 commits intomainfrom
feature/multilingualism_support
Open

Remove language limits: replace spaCy with language-independent tokenization#5
azaripov1-oss wants to merge 15 commits intomainfrom
feature/multilingualism_support

Conversation

@azaripov1-oss
Copy link
Copy Markdown
Collaborator

Closes #1

  • Removed spaCy dependency — tokenization replaced with a Unicode-aware regex tokenizer that handles CJK character splitting and combining marks, no language-specific configuration required.
  • LLM prompts fallback: for languages without dedicated prompts (only en/ru have them), falls back to English prompts with an instruction to generate text in the same language as examples.
  • AncSetFit config: requires explicit template for non-en/ru languages; warns when using default templates.
  • NER data provider: _parse_record() supports three input formats (tokens+labels, text+spans, bracket format).
  • NER evaluation fix: added _predict_from_tokens so evaluation aligns with gold token boundaries.
  • Tokenizer tests: offset consistency, no overlap, and full coverage tests on 6 languages (en, ru, zh, ar, hi, fr).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove language limits

2 participants