Skip to content

Conversation

@ssvb
Copy link

@ssvb ssvb commented Nov 13, 2025

Issues

#2155

Description

This is not intended to be merged. It's just the most simple test to see whether the Harper engine is able to shoulder the load of supporting one more language in addition to English without degrading performance and without increasing the memory footprint too much.

The Belarusian language uses a different Cyrillic alphabet, so it doesn't clash with the existing English language support. The Belarusian dictionary has been taken from https://github.com/375gnu/spell-be-tarask

I could have added a simple Belarusian linter rule as well, but would prefer to keep it simple until the seemingly unreasonable memory footprint issue is resolved. The Belarusian wordlist has ~20MB size in a plain text format. Using the fst-bin command line tool (which accompanies the fst crate used by Harper) the Belarusian spellchecker dictionary can be compressed to a merely ~1.1MB binary. At least that's the theory.

Demo

How Has This Been Tested?

Used the just lint command to check short text files, such as "Ths is an test. Гта тэст."
And also the Firefox plugin to check the same short texts in https://textarea.online

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

@ssvb ssvb force-pushed the belarusian_language_preview branch 2 times, most recently from accc3ea to 22406f4 Compare November 13, 2025 23:26
ssvb added 2 commits November 14, 2025 01:43
So that the Belarusian words written using the Cyrillic
script are now tokenized.

This is not a proper patch, because the Cyrillic alphabet
is added to the English language support code. But it's
very simple and small, so it's useful for preview/testing
purposes.
Taken from the https://github.com/375gnu/spell-be-tarask repository.
Creative Commons Attribution-ShareAlike 3.0 Unported License

wget -O - https://github.com/375gnu/spell-be-tarask/raw/refs/tags/v0.62/wordlist \
     | ruby -e "while l = gets ; l.split(', ').each {|w| puts w } ; end" \
     | LC_ALL=C sort > wordlist-be-tarask-sorted.txt

Now at least spellchecking works for the Belarusian words. English
linters are expected to still keep working properly, because none
of the linters is going to match Cyrillic words.
@ssvb ssvb force-pushed the belarusian_language_preview branch from 01399d9 to d573e5d Compare November 13, 2025 23:44
This is done only partially, because Harper's condition field
regexp pattern matching code appears to be buggy for multibyte UTF-8
@ssvb ssvb force-pushed the belarusian_language_preview branch from d573e5d to 2d1c7b5 Compare November 13, 2025 23:57
@elijah-potter
Copy link
Collaborator

This is super cool. I'd love to help with any modularity changes that would need to be made to make this production-ready. How would you characterize the result of your experiment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants