preproc-pkg

Persian text preprocessing package with multiple independent pipelines:

Normalization (normalize)
Spell correction (spell)
Informal-to-formal conversion (formal)
Stopword removal (stopword)
Lemmatization (lemma)
Stemming (stem)

This repository is designed so each processing task has its own pipeline and can be used through both Python API and CLI.

Python Version and Dependencies

Python: >=3.8,<3.9
Core dependencies:
- hazm==0.9.4
- parsivar==0.2.3.1
- nltk==3.9.1
Optional formalizer dependencies:
- transformers==4.41.2
- torch==2.2.2+cpu

For reproducible installs, use the pinned file at constraints/py38-cpu.txt.

Installation

1) Base install

python -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

2) Install with formalizer support (Transformer)

python -m pip install "preproc-pkg[formalizer]" \
  -c constraints/py38-cpu.txt \
  --extra-index-url https://download.pytorch.org/whl/cpu

3) Local project install

python -m pip install -e .

Python API

All pipelines take str input and return str output (except normalizer when return_report=True).

from preproc_pkg import (
    create_normalizer_pipeline,
    create_spell_pipeline,
    create_formal_pipeline,
    create_stopword_pipeline,
    create_lemma_pipeline,
    create_stem_pipeline,
)

text = "میخوام    آدرس www.example.com رو بفرستم؛ شماره‌اش 09123456789 ــ اوکیه؟"

norm = create_normalizer_pipeline(enable_metrics=True)
normalized, report = norm(text, return_report=True)

spell = create_spell_pipeline()
formal = create_formal_pipeline(model_name="PardisSzah/PersianTextFormalizer")
stopword = create_stopword_pipeline()
lemma = create_lemma_pipeline(prefer_past=False)
stem = create_stem_pipeline(prefer_past=False)

CLI

Entry point:

preproc-cli --help
preproc-cli --version

Text input precedence

For all subcommands, input is resolved in this order:

--text
--input-file
stdin

Examples

preproc-cli normalize --text "سلام    دنیا"
preproc-cli normalize --input-file input.txt
echo "سلام دنیا" | preproc-cli normalize

preproc-cli spell --text "این ی متن ناسحیح است"
preproc-cli spell --text "متن" --use-transformer --model-name "your-seq2seq-model"

preproc-cli formal --text "میخوام برم بیرون"
preproc-cli stopword --text "این یک متن نمونه است"
preproc-cli lemma --text "آن‌ها کتاب‌هایشان را آوردند و می‌خواندند."
preproc-cli stem --text "آن‌ها کتاب‌هایشان را آوردند و می‌خواندند."

Pipeline Details

1) Normalizer

create_normalizer_pipeline(...)

Main stages:

Initial cleanup (HTML/URL/Email/Mention/Hashtag/Phone/Non-BMP)
Pinglish conversion (optional)
Parsivar (optional)
Hazm (optional)
Final punctuation and spacing cleanup

Key parameters:

enable_parsivar
enable_hazm
enable_pinglish
enable_metrics
collapse_keep_newlines

2) Spell

create_spell_pipeline(use_parsivar=True, use_transformer=False, **kwargs)

If use_transformer=True, you must provide a valid Seq2Seq model (model_name).

3) Formal

create_formal_pipeline(use_rules=True, **kwargs)

Includes rule-based + transformer steps
Default model: PardisSzah/PersianTextFormalizer

4) Stopword

create_stopword_pipeline(extra_stopwords=None)

5) Lemma

create_lemma_pipeline(use_hazm=True, use_parsivar=True, prefer_past=False)

6) Stem

create_stem_pipeline(use_hazm=True, use_parsivar=True, prefer_past=False)

Package Structure

Main paths:

preproc_pkg/cli.py
preproc_pkg/normalizer/
preproc_pkg/spell/
preproc_pkg/formal/
preproc_pkg/stopword/
preproc_pkg/lemma/
preproc_pkg/stem/

Usage Examples

Runnable examples are available in usage_examples/:

quickstart_example.py
normalization_example.py
spell_example.py
formal_t5_example.py
stopword_example.py
lemma_example.py
stem_example.py
cli_examples.sh
cli_examples.ps1

Development and Build

python -m pip install -e .
python -m pip install -r requirements.txt

Project file:

pyproject.toml

CLI entrypoint:

preproc-cli = preproc_pkg.cli:main

Compatibility Notes

Public API (create_*_pipeline) is preserved.
Core pipeline logic has not been changed.
Package folders were standardized to conventional names without _pkg suffix.
Legacy _pkg module paths have been removed.

License

See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preproc-pkg

Python Version and Dependencies

Installation

1) Base install

2) Install with formalizer support (Transformer)

3) Local project install

Python API

CLI

Text input precedence

Examples

Pipeline Details

1) Normalizer

2) Spell

3) Formal

4) Stopword

5) Lemma

6) Stem

Package Structure

Usage Examples

Development and Build

Compatibility Notes

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

preproc-pkg

Python Version and Dependencies

Installation

1) Base install

2) Install with formalizer support (Transformer)

3) Local project install

Python API

CLI

Text input precedence

Examples

Pipeline Details

1) Normalizer

2) Spell

3) Formal

4) Stopword

5) Lemma

6) Stem

Package Structure

Usage Examples

Development and Build

Compatibility Notes

License