A modular, high-performance toolkit for building document extraction pipelines. The library provides clear interfaces for every pipeline stage, plus orchestrators that wire the stages together with async I/O and CPU-bound parallelism.
This repo is intentionally implementation-light: you plug in your own components (readers, converters, extractors, exporters, evaluators) for each specific document type or data source.
- document-extraction-tools
Install from PyPI:
pip install document-extraction-toolsOr with uv:
uv add document-extraction-tools.
├── src
│ └── document_extraction_tools
│ ├── base # abstract base classes you implement
│ │ ├── converter # conversion interface definitions
│ │ ├── evaluator # evaluation interface definitions
│ │ ├── exporter # export interface definitions
│ │ ├── extractor # extraction interface definitions
│ │ ├── file_lister # file discovery interface definitions
│ │ ├── reader # document read interface definitions
│ │ └── test_data_loader # evaluation dataset loader interfaces
│ ├── config # Pydantic configs + YAML loader helpers
│ ├── runners # orchestrators that run pipelines
│ │ ├── evaluation # evaluation pipeline orchestration
│ │ └── extraction # extraction pipeline orchestration
│ ├── types # shared models/types used across modules
│ └── py.typed
├── tests
├── pull_request_template.md
├── pyproject.toml
├── README.md
└── uv.lock- A consistent set of interfaces for the entire document-extraction lifecycle.
- A typed data model for documents, pages, and extraction results.
- Orchestrators that run extraction and evaluation pipelines concurrently and safely.
- A configuration system (Pydantic + YAML) for repeatable pipelines.
PathIdentifier: A uniform handle for file locations plus optional context.DocumentBytes: Raw bytes + MIME type + path identifier.Document: Parsed content (pages, text/image data, metadata).ExtractionSchema: Your Pydantic model (the target output).EvaluationExample: (path, ground truth) pair for evaluation runs.EvaluationResult: Name + result + description for evaluation metrics.
-
FileLister (
BaseFileLister)- Discovers input files and returns a list of
PathIdentifierobjects.
- Discovers input files and returns a list of
-
Reader (
BaseReader)- Reads raw bytes from the source and returns
DocumentBytes.
- Reads raw bytes from the source and returns
-
Converter (
BaseConverter)- Converts raw bytes into a structured
Document(pages, metadata, content type).
- Converts raw bytes into a structured
-
Extractor (
BaseExtractor)- Asynchronously extracts structured data into a Pydantic schema (
ExtractionSchema).
- Asynchronously extracts structured data into a Pydantic schema (
-
ExtractionExporter (
BaseExtractionExporter)- Asynchronously persists extracted data to your desired destination (DB, files, API, etc.).
-
ExtractionOrchestrator
- Runs the pipeline with a thread pool for CPU-bound steps (read/convert) and async concurrency for I/O-bound steps (extract/export).
-
TestDataLoader (
BaseTestDataLoader)- Loads evaluation examples (ground truth + file path) as
EvaluationExample.
- Loads evaluation examples (ground truth + file path) as
-
Evaluator (
BaseEvaluator)- Computes a metric by comparing
truevs.predschemas.
- Computes a metric by comparing
-
EvaluationExporter (
BaseEvaluationExporter)- Persists evaluation results.
-
EvaluationOrchestrator
- Runs extraction + evaluation across examples with the same concurrency model (thread pool + async I/O).
Each component has a matching base config class (Pydantic model) that defines a default YAML filename and acts as the parent for your own config fields. You'll subclass these to add settings specific to your implementation.
Extraction config base classes:
BaseFileListerConfigBaseReaderConfigBaseConverterConfigBaseExtractorConfigBaseExtractionExporterConfigExtractionOrchestratorConfig(you can use as-is; no need to subclass)
Evaluation specific config base classes:
BaseTestDataLoaderConfigBaseEvaluatorConfigBaseEvaluationExporterConfigEvaluationOrchestratorConfig(you can use as-is; no need to subclass)
For a full worked example including evaluation, please see the document-extraction-examples repository. Below we outline the steps for a successful implementation.
Create a Pydantic model that represents the structured data you want out of each document.
Example implementation:
from pydantic import BaseModel, Field
class InvoiceSchema(BaseModel):
invoice_id: str = Field(..., description="Unique invoice identifier.")
vendor: str = Field(..., description="Vendor or issuer name.")
total: float = Field(..., description="Total invoice amount.")Subclass the base interfaces and implement the required methods.
Example implementations:
from document_extraction_tools.base import (
BaseFileLister,
BaseReader,
BaseConverter,
BaseExtractor,
BaseExtractionExporter,
)
from document_extraction_tools.types import Document, DocumentBytes, PathIdentifier
from document_extraction_tools.config import (
BaseFileListerConfig,
BaseReaderConfig,
BaseConverterConfig,
BaseExtractorConfig,
BaseExtractionExporterConfig,
)
class MyFileLister(BaseFileLister):
def __init__(self, config: BaseFileListerConfig) -> None:
super().__init__(config)
def list_files(self) -> list[PathIdentifier]:
# Discover and return file identifiers
...
class MyReader(BaseReader):
def __init__(self, config: BaseReaderConfig) -> None:
super().__init__(config)
def read(self, path_identifier: PathIdentifier) -> DocumentBytes:
# Read file bytes from disk, object storage, etc.
...
class MyConverter(BaseConverter):
def __init__(self, config: BaseConverterConfig) -> None:
super().__init__(config)
def convert(self, document_bytes: DocumentBytes) -> Document:
# Parse PDF, OCR, etc. and return a Document
...
class MyExtractor(BaseExtractor):
def __init__(self, config: BaseExtractorConfig) -> None:
super().__init__(config)
async def extract(self, document: Document, schema: type[InvoiceSchema]) -> InvoiceSchema:
# Call LLM or rules-based system
...
class MyExtractionExporter(BaseExtractionExporter):
def __init__(self, config: BaseExtractionExporterConfig) -> None:
super().__init__(config)
async def export(self, document: Document, data: InvoiceSchema) -> None:
# Persist data to DB, filesystem, etc.
...Each component has a base config class with a default filename (e.g. extractor.yaml).
Subclass the config models to add your own fields, then provide YAML files in
the directory you pass as config_dir to load_config (default is
config/yaml/).
Default filenames:
extraction_orchestrator.yamlfile_lister.yamlreader.yamlconverter.yamlextractor.yamlextraction_exporter.yaml
Example config model:
from document_extraction_tools.config import BaseExtractorConfig
class MyExtractorConfig(BaseExtractorConfig):
model_name: strExample YAML (config/yaml/extractor.yaml):
# add fields your Extractor config defines
model_name: "gemini-3-flash-preview"Example usage:
import asyncio
from document_extraction_tools.config import load_config
from document_extraction_tools.runners import ExtractionOrchestrator
from document_extraction_tools.config import ExtractionOrchestratorConfig
config = load_config(
lister_config_cls=MyFileListerConfig,
reader_config_cls=MyReaderConfig,
converter_config_cls=MyConverterConfig,
extractor_config_cls=MyExtractorConfig,
exporter_config_cls=MyExtractionExporterConfig,
orchestrator_config_cls=ExtractionOrchestratorConfig,
config_dir=Path("config/yaml"),
)
orchestrator = ExtractionOrchestrator.from_config(
config=config,
schema=InvoiceSchema,
reader_cls=MyReader,
converter_cls=MyConverter,
extractor_cls=MyExtractor,
exporter_cls=MyExtractionExporter,
)
file_lister = MyFileLister(config.file_lister)
file_paths = file_lister.list_files()
asyncio.run(orchestrator.run(file_paths))The evaluation pipeline reuses your reader/converter/extractor and adds three pieces:
- TestDataLoader: loads evaluation examples (file + ground truth)
- Evaluator(s): compute metrics for each example
- EvaluationExporter: persist results
Example implementations:
from document_extraction_tools.base import (
BaseTestDataLoader,
BaseEvaluator,
BaseEvaluationExporter,
)
from document_extraction_tools.config import (
BaseTestDataLoaderConfig,
BaseEvaluatorConfig,
BaseEvaluationExporterConfig,
)
from document_extraction_tools.types import EvaluationExample, EvaluationResult, PathIdentifier
class MyTestDataLoader(BaseTestDataLoader[InvoiceSchema]):
def __init__(self, config: BaseTestDataLoaderConfig) -> None:
super().__init__(config)
def load_test_data(
self, path_identifier: PathIdentifier
) -> list[EvaluationExample[InvoiceSchema]]:
# Load ground-truth + path pairs from disk/DB/etc.
...
class MyEvaluator(BaseEvaluator[InvoiceSchema]):
def __init__(self, config: BaseEvaluatorConfig) -> None:
super().__init__(config)
def evaluate(
self, true: InvoiceSchema, pred: InvoiceSchema
) -> EvaluationResult:
# Compare true vs pred and return a metric
...
class MyEvaluationExporter(BaseEvaluationExporter):
def __init__(self, config: BaseEvaluationExporterConfig) -> None:
super().__init__(config)
async def export(
self, results: list[tuple[Document, list[EvaluationResult]]]
) -> None:
# Persist evaluation results
...Implement your own config models by subclassing the base evaluation configs and adding any fields your components need.
Default YAML filenames for evaluation:
evaluation_orchestrator.yamltest_data_loader.yamlevaluator.yaml(one top-level key per evaluator config class name)reader.yamlconverter.yamlextractor.yamlevaluation_exporter.yaml
Warning: The top-level key in the YAML MUST match the evaluator configuration class name, and the evaluator configuration class name MUST be the name of the evaluator class with the suffix Config. For example:
class MyEvaluator(BaseEvaluator):
...
class MyEvaluatorConfig(BaseEvaluatorConfig):
...Example YAML (config/yaml/evaluator.yaml):
MyEvaluatorConfig:
# add fields your Evaluator config defines
threshold: 0.8Example usage:
from document_extraction_tools.config import load_evaluation_config
from document_extraction_tools.runners import EvaluationOrchestrator
from document_extraction_tools.config import EvaluationOrchestratorConfig
config = load_evaluation_config(
test_data_loader_config_cls=MyTestDataLoaderConfig,
evaluator_config_classes=[MyEvaluatorConfig],
reader_config_cls=MyReaderConfig,
converter_config_cls=MyConverterConfig,
extractor_config_cls=MyExtractorConfig,
evaluation_exporter_config_cls=MyEvaluationExporterConfig,
orchestrator_config_cls=EvaluationOrchestratorConfig,
config_dir=Path("config/yaml"),
)
orchestrator = EvaluationOrchestrator.from_config(
config=config,
schema=InvoiceSchema,
reader_cls=MyReader,
converter_cls=MyConverter,
extractor_cls=MyExtractor,
test_data_loader_cls=MyTestDataLoader,
evaluator_classes=[MyEvaluator],
evaluation_exporter_cls=MyEvaluationExporter,
)
examples = MyTestDataLoader(config.test_data_loader).load_test_data(
PathIdentifier(path="/path/to/eval-set")
)
asyncio.run(orchestrator.run(examples))- Reader + Converter run in a thread pool (CPU-bound work).
- Extractor + Exporter run concurrently in the event loop (I/O-bound work).
- Tuning options live in
extraction_orchestrator.yamlandevaluation_orchestrator.yaml:max_workers(thread pool size)max_concurrency(async I/O semaphore limit)
# Install dependencies
uv sync
# Run pre-commit
uv run pre-commit run --all-files
# Run tests
uv run pytest
# Build and preview docs locally
uv run mkdocs serve-
Create a release branch and bump version:
git checkout -b release/v0.2.0-rc1 uv version --bump rc # Or manually: uv version 0.2.0-rc1 -
Commit and push the branch:
VERSION=$(uv version --short) git add pyproject.toml git commit -m "Bump version to $VERSION" git push -u origin release/v$VERSION
-
Create and merge a PR to main.
-
Tag the merge commit and push:
git checkout main && git pull VERSION=$(uv version --short) git tag "v$VERSION" git push --tags
-
The
publish-test.yamlworkflow automatically publishes to TestPyPI. -
Verify installation:
uv pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ document-extraction-tools
-
Create a release branch and bump version:
git checkout -b release/v0.2.0 uv version --bump minor # or: major, minor, patch -
Commit and push the branch:
VERSION=$(uv version --short) git add pyproject.toml git commit -m "Bump version to $VERSION" git push -u origin release/v$VERSION
-
Create and merge a PR to main.
-
Tag the merge commit and create the release:
git checkout main && git pull VERSION=$(uv version --short) git tag "v$VERSION" git push --tags gh release create "v$VERSION" --title "v$VERSION" --generate-notes
-
The
publish.yamlworkflow automatically builds, publishes to PyPI, and runs smoke tests.
Contributions are welcome! Please see our Contributing Guide for details.
- Report bugs or feature requests by opening an issue.
- Create a new branch using the following naming conventions:
feat/short-description,fix/short-description, etc. - Describe the change clearly in the PR description.
- Add or update tests in
tests/. - Run linting and tests before pushing:
uv run pre-commit run --all-filesanduv run pytest. - If you open a PR, please notify the maintainers (Ollie Kemp or Nikolas Moatsos).