The Epstein Files Dataset

Dataset Description

Dataset Summary

This dataset comprises a curated collection of publicly available documents and related materials concerning Jeffrey Epstein. It includes unsealed court filings, FBI reports, DOJ publications, and other official investigative records. These files have been aggregated from reputable public sources, such as the U.S. Department of Justice Epstein Library, House Oversight Committee releases, and unsealed federal court documents.

The dataset is presented in processed formats, including extracted text from PDFs and binary representations of any associated audio, images, or videos, to facilitate research, analysis, and archival purposes. All content is derived from public domain materials, with no addition of new copyrighted elements. Any applied processing is released under the MIT License.

Dataset Creation

Curation Rationale

The primary goal of this dataset is to enhance accessibility to public domain Epstein-related documents for researchers, journalists, and the general public, consolidating dispersed resources into a single, user-friendly repository on Hugging Face.

Source Data

Initial Data Collection or Creation

The data originates from a public torrent containing official releases. To ensure transparency and reproducibility, the raw sources align with government archives like the DOJ Epstein Library and FBI Vault.

Structured Dataset (99.99% Complete) (206.18 GB)

Name: Epstein Files — Structured Dataset (Mostly Full) (1-12) 2026-02-04
Torrent Magnet: Magnet Link
SHA256: 29acc987cd7fadfbbf94444ed165750b84d82c85af3703bab74308ea9e91e910

Source: yung-megafone/Epstein-Files

Processing Details

Text Extraction: Automated extraction of text from PDF documents.
Multimedia Handling: Audio, images, and videos extracted and stored as binary files optimized for Hugging Face datasets.

Recreate the Dataset

To recreate this dataset exactly as provided, follow the instructions below. The processing script will convert the source .tar.zst archive into chunked Parquet files (~500MB each) compatible with HuggingFace.

Installation

pip install pandas pyarrow pypdfium2 zstandard pillow tqdm

Running the Pipeline

python script_torrent.py path/to/epstein-files.tar.zst --output-dir ./data --workers 8

Additional Information

Licensing Information

Underlying documents are in the public domain per U.S. law. Processing contributions are licensed under MIT.

Ethical Considerations

This dataset involves sensitive topics related to investigations of abuse and exploitation. Users are encouraged to handle the data responsibly, respecting privacy and legal standards.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
.gitignore		.gitignore
README.md		README.md
script_torrent.py		script_torrent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Epstein Files Dataset

Dataset Description

Dataset Summary

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection or Creation

Processing Details

Recreate the Dataset

Installation

Running the Pipeline

Additional Information

Licensing Information

Ethical Considerations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

The Epstein Files Dataset

Dataset Description

Dataset Summary

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection or Creation

Processing Details

Recreate the Dataset

Installation

Running the Pipeline

Additional Information

Licensing Information

Ethical Considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages