Python Package: `impresso-pipelines`

Overview

This repository contains a Python package designed for modular and efficient text processing workflows. Currently, it includes the following subpackages:

Language Identification Pipeline: Identifies the language of input text and returns a probability score.
OCR QA Pipeline: Assesses the quality of OCR text by estimating the proportion of recognized vocabulary items (0–1), using efficient language-specific Bloom filters.
LDA Topic Modeling Pipeline: Soft clustering of input texts using LDA-based topic modeling.
News Agencies Pipeline: Extracts and ranks news agency entities from text, providing relevance scores and optional links to Wikidata.
Advertisement Classifier: Identifies advertisements in historical newspaper content using a fine-tuned XLM-RoBERTa model with rule-based features.
Lucene/Solr normalization Pipeline: Replicates Solr's language-specific text normalization to clarify how input text is tokenized and indexed in impresso.

Installation

Quick Install (with uv - recommended)

uv is an extremely fast Python package installer (10-100x faster than pip):

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package with all dependencies
uv pip install "impresso-pipelines[all]"

Standard Install (with pip)

To install the full package with all submodules:

pip install "impresso-pipelines[all]"

The [all] extra installs all dependencies required for each component.

Install Individual Modules

To install individual modules without unnecessary dependencies, use:

pip install "impresso-pipelines[langident]"         # Language Identification
pip install "impresso-pipelines[ocrqa]"             # OCR QA
pip install "impresso-pipelines[ldatopics]"         # LDA Topics
pip install "impresso-pipelines[newsagencies]"      # News Agencies
pip install "impresso-pipelines[adclassifier]"      # Advertisement Classifier
pip install "impresso-pipelines[solrnormalization]" # Solr text normalization

Development Setup

For contributors, we support both uv (faster) and Poetry:

# Clone the repository
git clone https://github.com/impresso/impresso-pipelines.git
cd impresso-pipelines

# Option 1: Using uv (recommended - 3-6x faster)
uv sync --extra all --extra dev

# Option 2: Using Poetry
poetry install --all-extras --with dev

# Or use Make (auto-detects uv or Poetry)
make install-dev

See UV_MIGRATION.md for more details on using uv.

Usage

Each pipeline is instantiated from a corresponding class.

from impresso_pipelines.langident import LangIdentPipeline
from impresso_pipelines.ocrqa import OCRQAPipeline
from impresso_pipelines.ldatopics import LDATopicsPipeline
from impresso_pipelines.newsagencies import NewsAgenciesPipeline
from impresso_pipelines.adclassifier import AdClassifierPipeline
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline

Pipeline Examples

For usage examples, refer to the individual README files:

See also the interactive notebooks for further examples:

Future Plans

Additional functionality will be added to extend use cases and support further processing tasks.

Local Development

For contributors and developers who want to test locally before pushing to GitHub:

Quick Start

# Clone and install
git clone https://github.com/impresso/impresso-pipelines.git
cd impresso-pipelines

# Option 1: Poetry (recommended for full development)
make install-dev

# Option 2: Pip editable mode (faster for testing changes)
make install-editable-dev

# Run tests
make test

# Run all QA checks (mimics CI)
make qa

Available Commands

make help              # Show all available commands
make install          # Install package with all extras
make install-dev      # Install with dev dependencies
make test             # Run tests (skipping JVM tests)
make test-all         # Run all tests including JVM tests
make test-ocrqa       # Run only OCRQA tests
make test-cov         # Run tests with coverage report
make lint             # Run linting checks
make format           # Format code with black
make type-check       # Run type checking
make qa               # Run all QA checks
make clean            # Remove build artifacts

For detailed development instructions, see CONTRIBUTING.md.

About Impresso

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Copyright

License

This program is provided as open source under the GNU Affero General Public License v3 or later.

Name		Name	Last commit message	Last commit date
Latest commit History 397 Commits
.github/workflows		.github/workflows
impresso_pipelines		impresso_pipelines
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LOCAL_TESTING.md		LOCAL_TESTING.md
Makefile		Makefile
README.md		README.md
README_adclassifier.md		README_adclassifier.md
README_langident.md		README_langident.md
README_ldatopics.md		README_ldatopics.md
README_newsagencies.md		README_newsagencies.md
README_ocrqa.md		README_ocrqa.md
README_solrnormalization.md		README_solrnormalization.md
UV_MIGRATION.md		UV_MIGRATION.md
pyproject.toml		pyproject.toml
run_tests.sh		run_tests.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Package: `impresso-pipelines`

Overview

Installation

Quick Install (with uv - recommended)

Standard Install (with pip)

Install Individual Modules

Development Setup

Usage

Pipeline Examples

Future Plans

Local Development

Quick Start

Available Commands

About Impresso

Impresso project

Copyright

License

About

Uh oh!

Releases 46

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

impresso/impresso-pipelines

Folders and files

Latest commit

History

Repository files navigation

Python Package: impresso-pipelines

Overview

Installation

Quick Install (with uv - recommended)

Standard Install (with pip)

Install Individual Modules

Development Setup

Usage

Pipeline Examples

Future Plans

Local Development

Quick Start

Available Commands

About Impresso

Impresso project

Copyright

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 46

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Python Package: `impresso-pipelines`

Packages