Persian text preprocessing package with multiple independent pipelines:
- Normalization (
normalize) - Spell correction (
spell) - Informal-to-formal conversion (
formal) - Stopword removal (
stopword) - Lemmatization (
lemma) - Stemming (
stem)
This repository is designed so each processing task has its own pipeline and can be used through both Python API and CLI.
- Python:
>=3.8,<3.9 - Core dependencies:
hazm==0.9.4parsivar==0.2.3.1nltk==3.9.1
- Optional formalizer dependencies:
transformers==4.41.2torch==2.2.2+cpu
For reproducible installs, use the pinned file at constraints/py38-cpu.txt.
python -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txtpython -m pip install "preproc-pkg[formalizer]" \
-c constraints/py38-cpu.txt \
--extra-index-url https://download.pytorch.org/whl/cpupython -m pip install -e .All pipelines take str input and return str output (except normalizer when return_report=True).
from preproc_pkg import (
create_normalizer_pipeline,
create_spell_pipeline,
create_formal_pipeline,
create_stopword_pipeline,
create_lemma_pipeline,
create_stem_pipeline,
)
text = "میخوام آدرس www.example.com رو بفرستم؛ شمارهاش 09123456789 ــ اوکیه؟"
norm = create_normalizer_pipeline(enable_metrics=True)
normalized, report = norm(text, return_report=True)
spell = create_spell_pipeline()
formal = create_formal_pipeline(model_name="PardisSzah/PersianTextFormalizer")
stopword = create_stopword_pipeline()
lemma = create_lemma_pipeline(prefer_past=False)
stem = create_stem_pipeline(prefer_past=False)Entry point:
preproc-cli --help
preproc-cli --versionFor all subcommands, input is resolved in this order:
--text--input-filestdin
preproc-cli normalize --text "سلام دنیا"
preproc-cli normalize --input-file input.txt
echo "سلام دنیا" | preproc-cli normalize
preproc-cli spell --text "این ی متن ناسحیح است"
preproc-cli spell --text "متن" --use-transformer --model-name "your-seq2seq-model"
preproc-cli formal --text "میخوام برم بیرون"
preproc-cli stopword --text "این یک متن نمونه است"
preproc-cli lemma --text "آنها کتابهایشان را آوردند و میخواندند."
preproc-cli stem --text "آنها کتابهایشان را آوردند و میخواندند."create_normalizer_pipeline(...)
Main stages:
- Initial cleanup (HTML/URL/Email/Mention/Hashtag/Phone/Non-BMP)
- Pinglish conversion (optional)
- Parsivar (optional)
- Hazm (optional)
- Final punctuation and spacing cleanup
Key parameters:
enable_parsivarenable_hazmenable_pinglishenable_metricscollapse_keep_newlines
create_spell_pipeline(use_parsivar=True, use_transformer=False, **kwargs)
- If
use_transformer=True, you must provide a valid Seq2Seq model (model_name).
create_formal_pipeline(use_rules=True, **kwargs)
- Includes rule-based + transformer steps
- Default model:
PardisSzah/PersianTextFormalizer
create_stopword_pipeline(extra_stopwords=None)
create_lemma_pipeline(use_hazm=True, use_parsivar=True, prefer_past=False)
create_stem_pipeline(use_hazm=True, use_parsivar=True, prefer_past=False)
Main paths:
preproc_pkg/cli.pypreproc_pkg/normalizer/preproc_pkg/spell/preproc_pkg/formal/preproc_pkg/stopword/preproc_pkg/lemma/preproc_pkg/stem/
Runnable examples are available in usage_examples/:
quickstart_example.pynormalization_example.pyspell_example.pyformal_t5_example.pystopword_example.pylemma_example.pystem_example.pycli_examples.shcli_examples.ps1
python -m pip install -e .
python -m pip install -r requirements.txtProject file:
pyproject.toml
CLI entrypoint:
preproc-cli = preproc_pkg.cli:main
- Public API (
create_*_pipeline) is preserved. - Core pipeline logic has not been changed.
- Package folders were standardized to conventional names without
_pkgsuffix. - Legacy
_pkgmodule paths have been removed.
See LICENSE.