The Belarusian language support (at least just spellchecking for it) #2195

ssvb · 2025-11-13T19:50:47Z

Issues

Description

This is not intended to be merged. It's just the most simple test to see whether the Harper engine is able to shoulder the load of supporting one more language in addition to English without degrading performance and without increasing the memory footprint too much.

The Belarusian language uses a different Cyrillic alphabet, so it doesn't clash with the existing English language support. The Belarusian dictionary has been taken from https://github.com/375gnu/spell-be-tarask

I could have added a simple Belarusian linter rule as well, but would prefer to keep it simple until the seemingly unreasonable memory footprint issue is resolved. The Belarusian wordlist has ~20MB size in a plain text format. Using the fst-bin command line tool (which accompanies the fst crate used by Harper) the Belarusian spellchecker dictionary can be compressed to a merely ~1.1MB binary. At least that's the theory.

Demo

How Has This Been Tested?

Used the just lint command to check short text files, such as "Ths is an test. Гта тэст."
And also the Firefox plugin to check the same short texts in https://textarea.online

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes

So that the Belarusian words written using the Cyrillic script are now tokenized. This is not a proper patch, because the Cyrillic alphabet is added to the English language support code. But it's very simple and small, so it's useful for preview/testing purposes.

Taken from the https://github.com/375gnu/spell-be-tarask repository. Creative Commons Attribution-ShareAlike 3.0 Unported License wget -O - https://github.com/375gnu/spell-be-tarask/raw/refs/tags/v0.62/wordlist \ | ruby -e "while l = gets ; l.split(', ').each {|w| puts w } ; end" \ | LC_ALL=C sort > wordlist-be-tarask-sorted.txt Now at least spellchecking works for the Belarusian words. English linters are expected to still keep working properly, because none of the linters is going to match Cyrillic words.

This is done only partially, because Harper's condition field regexp pattern matching code appears to be buggy for multibyte UTF-8

elijah-potter · 2025-12-04T15:29:55Z

This is super cool. I'd love to help with any modularity changes that would need to be made to make this production-ready. How would you characterize the result of your experiment?

ssvb force-pushed the belarusian_language_preview branch 2 times, most recently from accc3ea to 22406f4 Compare November 13, 2025 23:26

ssvb added 2 commits November 14, 2025 01:43

Add Cyrillic support

fb40720

So that the Belarusian words written using the Cyrillic script are now tokenized. This is not a proper patch, because the Cyrillic alphabet is added to the English language support code. But it's very simple and small, so it's useful for preview/testing purposes.

ssvb force-pushed the belarusian_language_preview branch from 01399d9 to d573e5d Compare November 13, 2025 23:44

Affix compress the Belarusian dictionary

2d1c7b5

This is done only partially, because Harper's condition field regexp pattern matching code appears to be buggy for multibyte UTF-8

ssvb force-pushed the belarusian_language_preview branch from d573e5d to 2d1c7b5 Compare November 13, 2025 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The Belarusian language support (at least just spellchecking for it) #2195

The Belarusian language support (at least just spellchecking for it) #2195

ssvb commented Nov 13, 2025

Uh oh!

elijah-potter commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

The Belarusian language support (at least just spellchecking for it) #2195

Are you sure you want to change the base?

The Belarusian language support (at least just spellchecking for it) #2195

Conversation

ssvb commented Nov 13, 2025

Issues

Description

Demo

How Has This Been Tested?

Checklist

Uh oh!

elijah-potter commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants