Thank you for your interest in contributing! This project helps make the Epstein case files accessible and searchable for public interest research.
- Report data quality issues — Found a wrong person match or duplicate? Open an issue.
- Suggest new data sources — Know of a public dataset we're missing? Let us know.
- Review processed data — Check
known-duplicates.jsonand verify dedup decisions. - Improve documentation — Fix typos, clarify instructions, add examples.
- Add a new downloader — Support a new data source in
src/epstein_pipeline/downloaders/. - Improve entity extraction — Better person matching, new entity types.
- Add export formats — New output formats for researchers.
- Fix bugs — Check the issue tracker for open bugs.
- Python 3.10+
- Git
# Clone the repo
git clone https://github.com/stonesalltheway1/Epstein-Pipeline.git
cd Epstein-Pipeline
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # Windows
# Install in development mode
pip install -e ".[dev]"
# Run tests to verify setup
pytest -v# Install OCR dependencies
pip install -e ".[ocr]"# Install NLP dependencies
pip install -e ".[nlp]"
python -m spacy download en_core_web_sm- Fork the repo and create a branch:
git checkout -b my-feature - Make changes and add tests
- Run checks:
ruff check src/ tests/ # Lint ruff format src/ tests/ # Format pytest -v # Test
- Commit with a clear message
- Push and open a Pull Request
- We use Ruff for linting and formatting
- Line length: 100 characters
- Type hints are encouraged but not required
- Docstrings for public functions
If you've run the pipeline on new documents and want to contribute the results:
- Export as JSON:
epstein-pipeline export json ./processed/ --output ./contribution/ - Validate:
epstein-pipeline validate ./contribution/ - Open a PR with the JSON files in a
contributions/directory - Our CI will automatically validate the data
This is a public interest research project. We expect all contributors to:
- Be respectful and constructive
- Focus on facts and verifiable information
- Not engage in harassment or doxxing
- Respect the privacy of uninvolved individuals