Epstein Pipeline

World-class document processing pipeline for the Epstein case files. Download, OCR, extract entities, deduplicate, embed, and export 140,000+ documents to Neon Postgres with pgvector semantic search.

This is the data processing engine behind epsteinexposed.com — the most comprehensive searchable database of the Epstein files.

Architecture

Raw DOJ PDFs
    │
    ▼
┌──────────────────────────────────────────────────────────┐
│  OCR (multi-backend)                                     │
│  PyMuPDF → Surya → olmOCR 2 → Docling                   │
│  Per-page confidence scoring, fallback chain             │
└──────────────────────┬───────────────────────────────────┘
                       │
    ┌──────────────────┼──────────────────┐
    ▼                  ▼                  ▼
┌────────────┐  ┌────────────┐  ┌──────────────────┐
│ NER        │  │ Dedup      │  │ Classifier       │
│ spaCy trf  │  │ Hash →     │  │ Zero-shot BART   │
│ + GLiNER   │  │ MinHash →  │  │ 12 doc categories│
│ + regex    │  │ Semantic   │  │                  │
└─────┬──────┘  └─────┬──────┘  └────────┬─────────┘
      │               │                  │
      └───────────────┼──────────────────┘
                      ▼
┌──────────────────────────────────────────────────────────┐
│  Semantic Chunker → Embeddings (nomic-embed-text-v2-moe) │
│  Paragraph-aware splitting, 768-dim / 256-dim Matryoshka │
└──────────────────────┬───────────────────────────────────┘
                       │
    ┌──────────────────┼──────────────────┐
    ▼                  ▼                  ▼
┌────────────┐  ┌────────────┐  ┌──────────────────┐
│ Neon PG    │  │ JSON/CSV   │  │ Knowledge Graph  │
│ + pgvector │  │ SQLite     │  │ GEXF + JSON      │
│ cosine ANN │  │ NDJSON     │  │ LLM extraction   │
└────────────┘  └────────────┘  └──────────────────┘

Quickstart

# Install with all features
pip install "epstein-pipeline[all]"
python -m spacy download en_core_web_sm

# Download a dataset
epstein-pipeline download kaggle

# OCR with automatic backend selection
epstein-pipeline ocr ./raw-pdfs/ --output ./processed/

# Extract entities (spaCy + GLiNER)
epstein-pipeline extract-entities ./processed/ --output ./entities/

# Generate embeddings and push to Neon
epstein-pipeline embed ./processed/ --output ./embeddings/ --format neon

# Export everything to Neon Postgres
epstein-pipeline export neon --input-dir ./processed/

Neon Postgres Setup

# Set your Neon connection string
export EPSTEIN_NEON_DATABASE_URL="postgresql://user:pass@ep-xxx.us-east-2.aws.neon.tech/epstein"

# Run schema migration (idempotent, safe to re-run)
epstein-pipeline migrate

# Semantic search from the command line
epstein-pipeline search "financial transactions offshore accounts"

Processing Backends

Component	Backend	Speed	Accuracy	GPU Required
OCR	PyMuPDF	Instant	Text layers only	No
OCR	Surya	Fast	High (90+ langs)	Optional
OCR	olmOCR 2	Slow	Highest (VLM)	Yes (8GB+)
OCR	Docling (IBM)	Medium	High	No
NER	spaCy `en_core_web_trf`	Fast	High	Optional
NER	GLiNER	Medium	High (zero-shot)	Optional
Dedup	Content hash + fuzzy	Instant	Exact only	No
Dedup	MinHash/LSH	O(n)	Near-duplicate	No
Dedup	Semantic embeddings	Slow	OCR-variant	Optional
Embeddings	nomic-embed-text-v2-moe	Fast	SOTA	Optional
Classifier	BART-large-mnli	Medium	Good	Optional

Installation

# Core only (no ML models)
pip install epstein-pipeline

# With OCR (CPU — Surya)
pip install "epstein-pipeline[ocr-surya]"

# With OCR (GPU — olmOCR 2, requires CUDA)
pip install "epstein-pipeline[ocr-gpu]"

# With NLP (spaCy + GLiNER)
pip install "epstein-pipeline[nlp,nlp-gliner]"

# With embeddings (sentence-transformers + torch)
pip install "epstein-pipeline[embeddings]"

# With Neon Postgres export (psycopg + pgvector)
pip install "epstein-pipeline[neon]"

# Everything (except GPU-only olmOCR)
pip install "epstein-pipeline[all]"

Docker

docker compose run pipeline --help
docker compose run pipeline ocr ./raw-pdfs/ --output ./output/
docker compose run pipeline migrate

CLI Commands

# ── Data Ingestion ──────────────────────────────────────────────
epstein-pipeline download doj --dataset 9       # Download DOJ dataset
epstein-pipeline download kaggle                # Download Kaggle dataset
epstein-pipeline download huggingface           # Download HuggingFace datasets

# ── Processing ──────────────────────────────────────────────────
epstein-pipeline ocr ./pdfs/ -o ./out/          # OCR (auto backend)
epstein-pipeline ocr ./pdfs/ --backend surya    # OCR with specific backend
epstein-pipeline extract-entities ./out/ -o ./e/ # NER extraction
epstein-pipeline classify --input-dir ./out/    # Document classification
epstein-pipeline dedup ./out/ --mode all        # 3-pass deduplication
epstein-pipeline embed ./out/ -o ./emb/         # Generate embeddings

# ── Export ──────────────────────────────────────────────────────
epstein-pipeline export json ./out/ -o ./site/  # JSON for website
epstein-pipeline export csv ./out/ -o docs.csv  # CSV for research
epstein-pipeline export sqlite ./out/ -o ep.db  # SQLite database
epstein-pipeline export neon --input-dir ./out/ # Push to Neon Postgres

# ── Database ────────────────────────────────────────────────────
epstein-pipeline migrate                        # Run Neon schema migration
epstein-pipeline search "query text here"       # Semantic search (pgvector)

# ── Utilities ───────────────────────────────────────────────────
epstein-pipeline validate ./out/                # Data quality checks
epstein-pipeline stats ./out/                   # Show statistics

Key Features

Multi-backend OCR with automatic fallback chain and per-page confidence scoring
Three-pass deduplication: exact hash → MinHash/LSH (O(n)) → semantic similarity
GLiNER zero-shot NER for custom legal entity types (case numbers, flight IDs, financial amounts)
Semantic chunking that respects paragraph and sentence boundaries
pgvector embeddings with cosine similarity search via Neon Postgres
Document classification using zero-shot BART into 12 legal categories
Knowledge graph with co-occurrence edges and opt-in LLM relationship extraction
Idempotent Neon schema migration with pgvector, pg_trgm, and IVFFlat indexes

Data Sources

See DATA_SOURCES.md for all known public data sources.

Contributing

We welcome contributions! See CONTRIBUTING.md for details.

No coding required: Report data quality issues, suggest sources, review processed data.

Code contributions: Add downloaders, improve extraction, add export formats, fix bugs.

Related Projects

epsteinexposed.com — The live website powered by this pipeline
Epstein-Files — DOJ file mirrors
Epstein-doc-explorer — Email graph explorer
Epstein-research-data — Community research dataset

License

MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
data		data
docs		docs
scripts		scripts
src/epstein_pipeline		src/epstein_pipeline
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Epstein Pipeline

Architecture

Quickstart

Neon Postgres Setup

Processing Backends

Installation

Docker

CLI Commands

Key Features

Data Sources

Contributing

Related Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Epstein Pipeline

Architecture

Quickstart

Neon Postgres Setup

Processing Backends

Installation

Docker

CLI Commands

Key Features

Data Sources

Contributing

Related Projects

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages