Scientific Figures Pipeline

This repository contains a pipeline for collecting and downloading scientific figures from Nature and Nature Communications articles.

Pipeline Overview

The pipeline consists of 3 main steps:

Step 1: Collect Papers (`00-collect-paper.ipynb`)

Fetches paper metadata from Nature and Nature Communications journals using the OpenAlex API. Collects papers from 1900-2025 and stores them year by year in papers_nature_and_natcomms/{year}.parquet.

Step 2: Extract Figure Information (`01-collect-paper-figure-info.ipynb`)

Scrapes each paper's webpage to extract figure metadata including images, captions, and descriptions. Saves the figure information to individual CSV files in figures/{DOI}/figures.csv.

Step 3: Download Figures (`02-collect-figures.ipynb`)

Downloads the actual figure images from the extracted URLs to local storage, organizing them by DOI in the figures/ directory.

Data Structure

papers_nature_and_natcomms/{year}.parquet - Paper metadata stored by year
figures/{DOI}/figures.csv - Figure metadata for each paper
figures/{DOI}/*.jpg|png - Downloaded figure images
figures/{DOI}/status_*.txt - Collection status indicators:
- status_success.txt - Figures successfully extracted
- status_no_figures.txt - No figures found in the paper
- status_404.txt - HTTP 404 error (paper not found)

Statistics

Current Status: Statistics based on 2020 data only

Dataset Overview

Papers collected: ~10,075 papers (2020)
Total figures: ~41,820 figures
Total data size: 8.15 GB

Collection Status

Out of 10,075 papers collected:

9,204 (91.4%) - Successfully extracted figures
868 (8.6%) - No figures found
3 (0.03%) - HTTP 404 errors

Distribution of Figure Counts

The histogram below shows the distribution of the number of figures per paper:

Most papers contain between 1-9 figures, with the distribution showing the frequency of papers at each figure count level.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
papers_nature_and_natcomms		papers_nature_and_natcomms
public		public
.gitignore		.gitignore
00-collect-paper.ipynb		00-collect-paper.ipynb
01-collect-paper-figure-info.ipynb		01-collect-paper-figure-info.ipynb
02-collect-figures.ipynb		02-collect-figures.ipynb
99-statistics.ipynb		99-statistics.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scientific Figures Pipeline

Pipeline Overview

Step 1: Collect Papers (`00-collect-paper.ipynb`)

Step 2: Extract Figure Information (`01-collect-paper-figure-info.ipynb`)

Step 3: Download Figures (`02-collect-figures.ipynb`)

Data Structure

Statistics

Dataset Overview

Collection Status

Distribution of Figure Counts

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Northwestern-CSSI/proj-figures

Folders and files

Latest commit

History

Repository files navigation

Scientific Figures Pipeline

Pipeline Overview

Step 1: Collect Papers (00-collect-paper.ipynb)

Step 2: Extract Figure Information (01-collect-paper-figure-info.ipynb)

Step 3: Download Figures (02-collect-figures.ipynb)

Data Structure

Statistics

Dataset Overview

Collection Status

Distribution of Figure Counts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Step 1: Collect Papers (`00-collect-paper.ipynb`)

Step 2: Extract Figure Information (`01-collect-paper-figure-info.ipynb`)

Step 3: Download Figures (`02-collect-figures.ipynb`)

Packages