Initially, grammar_ru was envisioned as a set of tools to correct grammar and style errors in Russian texts.
Since the beginning, a lot has changed.
Grammar_ru is now a module that governs a representation of language-independent texts in tabular format.
- The texts are processed with tokenization and sentenization, placed in pandas DataFrames and stored in zip-files along with Table-Of-Contents, or toc-files that contains metadata about each dataframe.
- These zip-files we call corpora.
- Each word, sentence and paragraph receives its unique ID in the corpus
- Relations can be placed in corpus, establishing the relations between fragments of texts (e.g. that the chapters from translation and original texts are in fact the same chapter).
Grammar_rualso allows you to apply featurizers, such as pymorphy, snowball, slovnet, etc.Grammar_rucontains useful components to further convert such datasets in torch tensors (based on Training Grounds Framework)
Aside from grammar_ru, the repository contains a not-yet-working app_grammar_ru which is a docker app that actually
checks errors in Russian texts. This app is to utilize existing python solutions (pyenchant), as well as ML models,
trained in grammar_ru paradygm.
Finally, a creative articulator (ca) project is also temporarily hosted in this repo.