This repository contains a pipeline for collecting and downloading scientific figures from Nature and Nature Communications articles.
The pipeline consists of 3 main steps:
Fetches paper metadata from Nature and Nature Communications journals using the OpenAlex API. Collects papers from 1900-2025 and stores them year by year in papers_nature_and_natcomms/{year}.parquet.
Scrapes each paper's webpage to extract figure metadata including images, captions, and descriptions. Saves the figure information to individual CSV files in figures/{DOI}/figures.csv.
Downloads the actual figure images from the extracted URLs to local storage, organizing them by DOI in the figures/ directory.
papers_nature_and_natcomms/{year}.parquet- Paper metadata stored by yearfigures/{DOI}/figures.csv- Figure metadata for each paperfigures/{DOI}/*.jpg|png- Downloaded figure imagesfigures/{DOI}/status_*.txt- Collection status indicators:status_success.txt- Figures successfully extractedstatus_no_figures.txt- No figures found in the paperstatus_404.txt- HTTP 404 error (paper not found)
Current Status: Statistics based on 2020 data only
- Papers collected: ~10,075 papers (2020)
- Total figures: ~41,820 figures
- Total data size: 8.15 GB
Out of 10,075 papers collected:
- 9,204 (91.4%) - Successfully extracted figures
- 868 (8.6%) - No figures found
- 3 (0.03%) - HTTP 404 errors
The histogram below shows the distribution of the number of figures per paper:
Most papers contain between 1-9 figures, with the distribution showing the frequency of papers at each figure count level.
