Skip to content

EastridgeAnalytics/Epstein-Files-Huggingface

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

The Epstein Files Dataset

Epstein-Header

Dataset Description

Dataset Summary

This dataset comprises a curated collection of publicly available documents and related materials concerning Jeffrey Epstein. It includes unsealed court filings, FBI reports, DOJ publications, and other official investigative records. These files have been aggregated from reputable public sources, such as the U.S. Department of Justice Epstein Library, House Oversight Committee releases, and unsealed federal court documents.

The dataset is presented in processed formats, including extracted text from PDFs and binary representations of any associated audio, images, or videos, to facilitate research, analysis, and archival purposes. All content is derived from public domain materials, with no addition of new copyrighted elements. Any applied processing is released under the MIT License.

Dataset Creation

Curation Rationale

The primary goal of this dataset is to enhance accessibility to public domain Epstein-related documents for researchers, journalists, and the general public, consolidating dispersed resources into a single, user-friendly repository on Hugging Face.

Source Data

Initial Data Collection or Creation

The data originates from a public torrent containing official releases. To ensure transparency and reproducibility, the raw sources align with government archives like the DOJ Epstein Library and FBI Vault.

Structured Dataset (99.99% Complete) (206.18 GB)

  • Name: Epstein Files — Structured Dataset (Mostly Full) (1-12) 2026-02-04
  • Torrent Magnet: Magnet Link
  • SHA256: 29acc987cd7fadfbbf94444ed165750b84d82c85af3703bab74308ea9e91e910

Source: yung-megafone/Epstein-Files

Processing Details

  • Text Extraction: Automated extraction of text from PDF documents.
  • Multimedia Handling: Audio, images, and videos extracted and stored as binary files optimized for Hugging Face datasets.

Recreate the Dataset

To recreate this dataset exactly as provided, follow the instructions below. The processing script will convert the source .tar.zst archive into chunked Parquet files (~500MB each) compatible with HuggingFace.

Installation

pip install pandas pyarrow pypdfium2 zstandard pillow tqdm

Running the Pipeline

python script_torrent.py path/to/epstein-files.tar.zst --output-dir ./data --workers 8

Additional Information

Licensing Information

Underlying documents are in the public domain per U.S. law. Processing contributions are licensed under MIT.

Ethical Considerations

This dataset involves sensitive topics related to investigations of abuse and exploitation. Users are encouraged to handle the data responsibly, respecting privacy and legal standards.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%