Skip to content

Northwestern-CSSI/proj-figures

Repository files navigation

Scientific Figures Pipeline

This repository contains a pipeline for collecting and downloading scientific figures from Nature and Nature Communications articles.

Pipeline Overview

The pipeline consists of 3 main steps:

Step 1: Collect Papers (00-collect-paper.ipynb)

Fetches paper metadata from Nature and Nature Communications journals using the OpenAlex API. Collects papers from 1900-2025 and stores them year by year in papers_nature_and_natcomms/{year}.parquet.

Step 2: Extract Figure Information (01-collect-paper-figure-info.ipynb)

Scrapes each paper's webpage to extract figure metadata including images, captions, and descriptions. Saves the figure information to individual CSV files in figures/{DOI}/figures.csv.

Step 3: Download Figures (02-collect-figures.ipynb)

Downloads the actual figure images from the extracted URLs to local storage, organizing them by DOI in the figures/ directory.

Data Structure

  • papers_nature_and_natcomms/{year}.parquet - Paper metadata stored by year
  • figures/{DOI}/figures.csv - Figure metadata for each paper
  • figures/{DOI}/*.jpg|png - Downloaded figure images
  • figures/{DOI}/status_*.txt - Collection status indicators:
    • status_success.txt - Figures successfully extracted
    • status_no_figures.txt - No figures found in the paper
    • status_404.txt - HTTP 404 error (paper not found)

Statistics

Current Status: Statistics based on 2020 data only

Dataset Overview

  • Papers collected: ~10,075 papers (2020)
  • Total figures: ~41,820 figures
  • Total data size: 8.15 GB

Collection Status

Out of 10,075 papers collected:

  • 9,204 (91.4%) - Successfully extracted figures
  • 868 (8.6%) - No figures found
  • 3 (0.03%) - HTTP 404 errors

Distribution of Figure Counts

The histogram below shows the distribution of the number of figures per paper:

Number of Figures Distribution

Most papers contain between 1-9 figures, with the distribution showing the frequency of papers at each figure count level.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •