Practical, end-to-end examples that implement the interfaces and orchestrators from the
document-extraction-tools package.
This repository is for data scientists/engineers who want to see real, working pipelines and use them as a starting point for their own document extraction systems.
- A set of concrete implementations for the interfaces defined in
document-extraction-tools. - A reference for how to wire components with the provided orchestrators.
- Runnable baseline for extraction + evaluation workflows.
The document-extraction-tools library is intentionally implementation-light. It
defines the interfaces and orchestration logic, and you implement the pieces.
This repo provides those pieces:
- Interfaces implemented here
BaseFileListerBaseReaderBaseConverterBaseExtractorBaseExtractionExporterBaseTestDataLoaderBaseEvaluatorBaseEvaluationExporter
- Orchestrators used from the library
ExtractionOrchestratorEvaluationOrchestrator
- Config system used from the library
load_extraction_config/load_evaluation_config- Base config classes (subclassed in this repo)
The orchestrators handle concurrency (thread pool for CPU-bound steps and async concurrency for I/O-bound steps); this repo focuses on the actual logic for each pipeline stage.
.
├── src
│ └── document_extraction_examples
│ └── simple_lease_extraction
│ ├── components # Implementations of the base interfaces
│ ├── config # Pydantic config classes + YAML
│ ├── data # Example inputs/outputs/eval data
│ ├── prompts # Prompt references (see MLflow prompt usage)
│ ├── schemas # Extraction schema (Pydantic)
│ ├── utils # MLflow + LLM-as-a-judge utilities
│ ├── extraction_main.py # Extraction entrypoint
│ └── evaluation_main.py # Evaluation entrypoint
├── tests
├── Makefile
├── Dockerfile
├── docker-compose.yaml # MLflow server
├── docs # Documentation assets (e.g., MLflow UI screenshot)
├── pull_request_template.md
├── pyproject.toml
├── uv.lock
└── README.mdLocation: src/document_extraction_examples/simple_lease_extraction
This example extracts structured lease details from PDFs using Gemini with image inputs, and evaluates results against a small labeled dataset.
The example is instrumented with MLflow: traces and evaluation metrics are logged so you can inspect runs and results in the MLflow UI.
Schema
SimpleLeaseDetails is a Pydantic model that defines the target output fields:
- landlord/tenant names and addresses
- property address + postcode
- lease start/end dates
- rent and deposit amounts
Extraction pipeline components
- File lister:
LocalFileLister
Lists PDFs underdata/input. - Reader:
LocalFileReader
Reads PDF bytes from disk. - Converter:
PDFToImageConverter
Usespdf2imageto convert each PDF into image pages. - Extractor:
GeminiImageExtractor
Calls the Gemini API with the prompt + image pages and parses a structured response against the schema. - Exporter:
LocalFileExtractionExporter
Writes JSON results todata/output.
Evaluation pipeline components
- Test data loader:
LocalJSONTestDataLoader
Loads labeled examples fromdata/evaluation/test_data.json. - Evaluators:
AccuracyEvaluator,F1Evaluator
Field-level accuracy and F1; can optionally use an LLM-as-a-judge for fuzzy equality. - Exporter:
LocalFileEvaluationExporter
Writes per-metric JSON files and logs average metrics to MLflow.
Config files
Configuration lives under:
src/document_extraction_examples/simple_lease_extraction/config/yaml
Each file maps to a component’s config model in config/ and is loaded by
load_extraction_config / load_evaluation_config.
This project uses uv and a pinned Python version (see pyproject.toml).
First-time setup checklist:
- Ensure Homebrew is installed (macOS):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" - Ensure
makeis installed.- macOS:
brew install make
- macOS:
- To run the MLflow server, install Docker:
- macOS:
brew install docker docker-buildx colima - Link Docker BuildX plugin:
mkdir -p ~/.docker/cli-plugins ln -sfn $(brew --prefix)/opt/docker-buildx/bin/docker-buildx ~/.docker/cli-plugins/docker-buildx
- macOS:
Then install dependencies:
make installNotes:
pdf2imagetypically requires Poppler installed on your system.- macOS:
brew install poppler
- macOS:
Create a .env file (or export variables) for API keys and MLflow:
cp .env.example .envRequired:
GEMINI_API_KEYfor the extractor and optional LLM-as-a-judge.
For MLflow server via Docker Compose:
PG_USERPG_PASSWORD
The default example configuration is under:
src/document_extraction_examples/simple_lease_extraction/config/yaml
Key settings to pay attention to:
extractor.yaml
mlflow_prompt_name+mlflow_prompt_versionmust exist in your MLflow prompt registry.file_lister.yaml
Input directory and file extensions.test_data_loader.yaml
Evaluation dataset location.extraction_exporter.yaml/evaluation_exporter.yaml
Output directories for results.evaluator.yaml
Enables LLM-as-a-judge comparisons if desired.
The extractor retrieves a prompt stored in the MLflow prompt registry. Configure the
prompt name and version in config/yaml/extractor.yaml.
Example:
mlflow_prompt_name: "lease_extraction_prompt"
mlflow_prompt_version: 1Create the prompt in MLflow before running the pipeline. The prompt text lives in
src/document_extraction_examples/simple_lease_extraction/prompts/system_prompt.md
as a starting point. If you change the prompt in MLflow, bump the version and
update mlflow_prompt_version accordingly.
How to add your prompt in MLflow:
(More information about MLflow setup here.)
make runOutputs are written to:
src/document_extraction_examples/simple_lease_extraction/data/output
make evaluateOutputs are written to:
src/document_extraction_examples/simple_lease_extraction/data/evaluation
This example is instrumented with MLflow tracing. You can run a local MLflow server via Docker Compose:
make start-mlflowThe extraction/evaluation entrypoints default to:
- Tracking URI:
http://localhost:8080 - Experiments:
simple_lease_extraction/simple_lease_evaluation
If you want to create your own pipeline:
- Define a schema
Add a Pydantic model underschemas/. - Implement components
Subclass the base interfaces fromdocument-extraction-tools. - Add config models + YAML
Create config classes inconfig/and YAML inconfig/yaml/. - Wire an entrypoint
Follow the pattern inextraction_main.py/evaluation_main.py. - Update the Makefile (optional)
PointAPP_ENTRYPOINT/EVAL_ENTRYPOINTto your new module.
You can also extend the example:
- Swap the extractor to another model/provider.
- Add post-processing or validation inside the extractor or exporter.
- Add more evaluators or a different evaluation dataset.
make lint
make testThese run pre-commit and pytest using the locked uv environment.

