This project downloads PDF files from the U.S. Department of Justice (DOJ) Epstein-related disclosure pages (data sets 9, 10, and 11) and saves them to a local folder.
- Visits each of three DOJ dataset listing pages in a headless browser.
- Handles the age gate (“Are you 18 years or older?”) by clicking “Yes” when it appears.
- Walks pagination by following the “Next” link on each page until there are no more pages.
- Collects all unique PDF links from every page it visits.
- Downloads each PDF using the same browser session (cookies/headers) and saves it to a local directory, skipping files that already exist.
- Python 3 (tested with 3.13)
- Playwright and its Chromium browser
pip install playwright
python -m playwright installThe second command downloads the Chromium binary used for headless browsing.
From the project directory:
python load_files_locally_20260130.pyThe script prints progress (which URLs it visits, how many PDF links it finds, and which files it downloads or skips). PDFs are written to:
doj_epstein_datasets_9_10_11_pdfs/
The script is configured for three base URLs:
https://www.justice.gov/epstein/doj-disclosures/data-set-9-fileshttps://www.justice.gov/epstein/doj-disclosures/data-set-10-fileshttps://www.justice.gov/epstein/doj-disclosures/data-set-11-files
You can change these in the DATASET_PAGES list at the top of the script.
- Playwright starts Chromium in headless mode and opens a new page.
- For each dataset (each base URL):
- Navigate to the base URL.
- If the response is not OK (e.g. 401, 403), skip that dataset and continue.
- If the “Are you 18 years of age or older?” prompt appears, click “Yes” and wait for the page to settle.
- Pagination loop:
- Record the current page URL (to avoid infinite loops).
- Find all links whose
hrefends in.pdfor.ppdf(including.PDF/.PPDF) and resolve them to absolute URLs. Add them to a set of unique PDF URLs. - Look for a “Next” link (e.g.
li.pager__item--next aora[rel='next']). If found, click it and repeat; if not, stop for that dataset.
- Download phase: For each unique PDF URL (sorted for stable order):
- If a file with the same name already exists and has size > 0, skip it (
[OK] Exists). - Otherwise, request the URL with the same browser context (so cookies/headers are sent). If the response is not OK or the body doesn’t start with
%PDF, skip and log. - Write the response body to a temporary
.partfile, then rename it to the final filename so partial downloads are not left as “finished” files.
- If a file with the same name already exists and has size > 0, skip it (
- Console: Lines like
[*] Dataset: ...,- <url> -> N pdf links,[OK] Exists: ...,[DL] Downloaded: ...,[!] HTTP ...or[!] Not a PDF ..., and a final summary with the output directory path. - Files: Each PDF is saved under
doj_epstein_datasets_9_10_11_pdfs/using the filename from the URL (e.g.EFTA01262782.pdf). The script normalizes.ppdftypos in links to.pdf.
The DOJ site returns 401/403 when paginated URLs (e.g. ?page=2) are requested directly. By using Playwright to load the first page and then click “Next,” the script follows the same path a user would, so the server accepts the requests and all listed PDFs can be collected and downloaded within the same session.
- Data set 9 may return 401 (Unauthorized) in some environments; the script skips it and continues with data sets 10 and 11.
- Re-running is safe: existing PDFs are skipped. Only missing or empty files are downloaded.
- The script uses ASCII-only print messages so it runs cleanly on Windows consoles (e.g. cp1252) without Unicode errors.