This dataset comprises a curated collection of publicly available documents and related materials concerning Jeffrey Epstein. It includes unsealed court filings, FBI reports, DOJ publications, and other official investigative records. These files have been aggregated from reputable public sources, such as the U.S. Department of Justice Epstein Library, House Oversight Committee releases, and unsealed federal court documents.
The dataset is presented in processed formats, including extracted text from PDFs and binary representations of any associated audio, images, or videos, to facilitate research, analysis, and archival purposes. All content is derived from public domain materials, with no addition of new copyrighted elements. Any applied processing is released under the MIT License.
The primary goal of this dataset is to enhance accessibility to public domain Epstein-related documents for researchers, journalists, and the general public, consolidating dispersed resources into a single, user-friendly repository on Hugging Face.
The data originates from a public torrent containing official releases. To ensure transparency and reproducibility, the raw sources align with government archives like the DOJ Epstein Library and FBI Vault.
Structured Dataset (99.99% Complete) (206.18 GB)
- Name: Epstein Files — Structured Dataset (Mostly Full) (1-12) 2026-02-04
- Torrent Magnet: Magnet Link
- SHA256:
29acc987cd7fadfbbf94444ed165750b84d82c85af3703bab74308ea9e91e910
Source: yung-megafone/Epstein-Files
- Text Extraction: Automated extraction of text from PDF documents.
- Multimedia Handling: Audio, images, and videos extracted and stored as binary files optimized for Hugging Face datasets.
To recreate this dataset exactly as provided, follow the instructions below. The processing script will convert the source .tar.zst archive into chunked Parquet files (~500MB each) compatible with HuggingFace.
pip install pandas pyarrow pypdfium2 zstandard pillow tqdmpython script_torrent.py path/to/epstein-files.tar.zst --output-dir ./data --workers 8Underlying documents are in the public domain per U.S. law. Processing contributions are licensed under MIT.
This dataset involves sensitive topics related to investigations of abuse and exploitation. Users are encouraged to handle the data responsibly, respecting privacy and legal standards.
