Skip to content

ourresearch/openalex-overview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

OpenAlex System Documentation

Technical reference for developers and AI coding agents working on the OpenAlex data pipeline.

This repository documents the complete OpenAlex backend system, including the core Databricks processing pipeline and all external modules that feed into it.


System Architecture

flowchart LR
    subgraph sources ["External Data Sources"]
        OAI["OAI-PMH Repos<br/>(openalex-ingest → S3)"]
        CR["Crossref API"]
        PM["PubMed/PMC"]
        DC["DataCite"]
        PDF["PDFs & HTML<br/>(taxicab → R2)"]
    end

    subgraph processing ["Content Processing"]
        GROBID["GROBID<br/>(AWS ECS)"]
        PL["parseland-lib<br/>(landing pages)"]
    end

    subgraph databricks ["Databricks (Walden)"]
        DLT["DLT Pipelines<br/>ML Models<br/>Entity Resolution<br/>PDF Parsing<br/>Landing Pages"]
    end

    subgraph openalex_api ["OpenAlex API"]
        ES["Elasticsearch"]
        ELASTIC_API["openalex-elastic-api<br/>(Heroku)"]
        PROXY["openalex-api-proxy<br/>(Cloudflare Worker)"]
        ENDPOINT["api.openalex.org"]
    end

    subgraph unpaywall ["Unpaywall"]
        UPW_PG["Unpaywall Postgres"]
        UPW_API["api.unpaywall.org<br/>(oadoi)"]
    end

    subgraph webui ["OpenAlex Web UI"]
        GUI["OpenAlex GUI"]
    end

    subgraph users ["User Management"]
        USERS_PG["Users Postgres"]
        USERS_API["openalex-users-api"]
        D1["D1 (API keys)"]
        DO["Durable Objects<br/>(rate limiting)"]
    end

    %% Data ingestion
    OAI --> DLT
    CR --> DLT
    PM --> DLT
    DC --> DLT
    PDF --> DLT

    DLT <--> GROBID
    DLT <--> PL

    %% OpenAlex API branch
    DLT --> ES
    ES --> ELASTIC_API
    ELASTIC_API --> PROXY
    PROXY --> ENDPOINT
    ENDPOINT --> API_USER["User"]

    %% Unpaywall branch
    DLT --> UPW_PG
    UPW_PG --> UPW_API
    UPW_API --> UPW_USER["User"]

    %% Web UI branch
    ENDPOINT --> GUI
    GUI --> UI_USER["User"]

    %% User management flow
    GUI <--> USERS_API
    USERS_API <--> USERS_PG
    USERS_API --> D1
    D1 --> DO
    DO --> PROXY
Loading

Documentation Index

Core System

Document Description
Databricks Overview Complete guide to the Walden system on Databricks - pipelines, schemas, workflows, and data flow

External Modules

Document Description Status
openalex-ingest OAI-PMH repository harvesting and data ingestion to S3 ✅ Done
openalex-taxicab PDF and landing page download system (ECS + Cloudflare R2) ✅ Done
GROBID PDF text extraction (AWS ECS) ✅ Done
parseland-lib Landing page parsing for metadata extraction ✅ Done
openalex-api-proxy Cloudflare Worker for API auth/rate limiting ✅ Done
openalex-users-api User management API (Heroku) ✅ Done
openalex-elastic-api Elasticsearch-backed API for OpenAlex queries 🔄 Planned
Unpaywall Open access detection (branch of OpenAlex pipeline) 🔄 Planned

Databricks-Native Components

These are handled directly within Databricks, not as separate external repos:

Component Location Notes
Topic Classification openalex-walden/notebooks/topics/ ML-based topic assignment
RecordThresher /Shared/recordthresher/ Crossref data processing

Red Herrings (Not Used / Legacy)

These repositories may appear relevant but are either deprecated or not used in the current system:

Repository Status Notes
openalex-api-proxy (Python/Heroku) Legacy Old proxy; production uses Cloudflare Worker
openalex-grobid Legacy Old REST API wrapper; GROBID now called directly from Databricks
parseland Not used Use parseland-lib instead
openalex-guts Legacy Old system, not used in Walden
openalex-topic-classification Legacy Topics now handled natively in Databricks notebooks
sickle Low-level OAI-PMH library, but higher-level harvesting done by openalex-ingest

Quick Facts

Aspect Details
Core Platform Databricks on AWS (Unity Catalog)
Processing Apache Spark, Delta Live Tables (DLT)
Storage Delta Lake (ACID transactions), S3, Cloudflare R2
ML PySpark ML, BERT models for topics
Search Elasticsearch
API Proxy Cloudflare Workers (Durable Objects + D1)
APIs openalex-elastic-api, openalex-users-api (Heroku)

Data Flow Summary

For the detailed pipeline DAG with all task dependencies, see: Databricks Overview - Walden End-to-End Pipeline DAG

High-Level Flow

  1. External Sources feed data into the system:

    • OAI-PMH repositories (via openalex-ingest to S3)
    • Direct API ingestion (Crossref, PubMed, DataCite)
    • PDFs and landing pages (via taxicab to Cloudflare R2)
  2. Content Processing:

    • PDFs → GROBID → text extraction
    • Landing pages → parseland-lib → metadata extraction
  3. Databricks (Walden) is the processing nexus (walden_end2end workflow):

    • Ingestion: 7 parallel DLT pipelines (Crossref, PubMed, DataCite, MAG, PDF, Repos, Landing_Page)
    • Union: Consolidates 6 sources into unified Works tables
    • Enrichment: Super Authorships, Locations, Entity Resolution
    • Works Creation: Final OpenAlex Works records
    • Runs nightly at 10:30 PM UTC
  4. Production Output:

    • Syncs to Elasticsearch
    • Served via openalex-elastic-api
    • Proxied through Cloudflare Worker (rate limiting, API keys)
    • User management via openalex-users-api
  5. Unpaywall is a branch of the OpenAlex pipeline (prod only):

    • Takes OpenAlex Works data from walden_end2end workflow
    • Wunpaywall task transforms to Unpaywall schema
    • Conditional export (only if env == prod):
      • Wunpaywall_Data_Feed → S3 for data feed subscribers
      • Wunpaywall_to_OpenAlex_DB → PostgreSQL for unpaywall.org API

Key Repositories

Active in Production

Repository Purpose Platform
openalex-walden Core Databricks pipeline code Databricks
openalex-ingest OAI-PMH harvesting to S3 AWS Lambda
openalex-taxicab PDF/landing page harvester AWS ECS
parseland-lib Landing page parsing library Python (in Databricks)
openalex-api-proxy API rate limiting/auth Cloudflare Workers
openalex-elastic-api Elasticsearch API Heroku
openalex-users-api User management API Heroku
oadoi Unpaywall backend Heroku

Databricks Components (in-repo)

These are found within the Databricks workspace, not as separate GitHub repos:

  • recordthresher - Crossref data processing (notebooks in /Shared/recordthresher/)
  • topics notebooks - Topic classification ML (in openalex-walden/notebooks/topics/)
  • openalex_dlt_utils - Custom DLT library for normalization

API Architecture

flowchart TB
    USER["User Request<br/>(with API key)"] --> ENDPOINT["api.openalex.org"]
    ENDPOINT --> PROXY

    subgraph PROXY ["Cloudflare Worker (openalex-api-proxy)"]
        DO["Durable Objects<br/>(rate limiting)"]
        D1["D1 (API keys)"]
    end

    PROXY --> ELASTIC["openalex-elastic-api<br/>(Heroku)"]
    ELASTIC --> ES["Elasticsearch<br/>(synced from Databricks)"]

    subgraph users ["User Management"]
        GUI["OpenAlex GUI"]
        USERS_API["openalex-users-api<br/>(Heroku)"]
        USERS_PG["Users Postgres"]
    end

    GUI_USER["User"] <--> GUI
    GUI <--> USERS_API
    USERS_API <--> USERS_PG
    USERS_API --> D1
    GUI --> ENDPOINT
Loading

How it works:

  • Users create accounts and API keys via the OpenAlex GUI
  • The GUI talks to openalex-users-api, which stores user data in Users Postgres
  • API keys are synced to D1, which feeds Durable Objects for rate limiting
  • When users make API requests (directly or via the GUI), the Cloudflare Worker validates their API key and enforces rate limits

Note: The "polite pool" mentioned in public documentation is not currently implemented. Rate limiting is based on IP address and API key only.


Getting Started

For Developers

  1. Start with Databricks Overview to understand the core system
  2. Review specific module documentation as needed
  3. Check the "Gotchas" section in Databricks Overview for common pitfalls

For AI Coding Agents

  1. This documentation provides system context for code generation
  2. Key patterns are documented in each module's overview
  3. When modifying code, check the "Related Projects" sections to understand dependencies
  4. Pay attention to the Red Herrings section to avoid working on deprecated code

How This Documentation Was Created

This documentation was automatically generated by Claude Code with human steering. No manual writing—just an AI agent exploring the OpenAlex ecosystem with guidance on where to look and what to prioritize.

Tools Used

Tool Purpose
Claude Chrome Extension Browser automation for exploring Databricks UI, Heroku dashboards, Cloudflare dashboards, and other web interfaces. Essential when CLIs or MCPs lacked the needed functionality.
Databricks MCP Unity Catalog exploration—listing catalogs, schemas, tables, and querying data. More capable than the official Databricks MCP tools.
GitHub CLI (gh) Searching repos, listing organization repositories, cloning repos for local analysis.
Heroku CLI Inspecting Heroku apps, configs, and add-ons for the API services.
Wrangler CLI Exploring Cloudflare Workers configuration for the API proxy.
Bash tools (grep, etc.) Reading and searching through locally-cloned repositories.

Limitations Encountered

  • Chrome extension for code reading: While powerful for UI exploration, scrolling through Databricks notebooks in the browser is slow. For actual code/notebook content, cloning repos locally or using the Databricks MCP is faster.
  • MCP installation: The Databricks MCP required some setup effort but was worth it for catalog exploration.

Auditing & Updating This Documentation

This section is for future developers or AI agents who need to verify or update this documentation.

General Approach

  1. Be interactive: Work with a human in the loop. Ask questions when something is unclear or when you need to make judgment calls about what's important.

  2. Be persistent: Mapping this system takes time. Don't give up when you hit dead ends—follow threads across multiple tools and sources.

  3. Use all available tools: Different parts of the system are best explored with different tools. Web UIs, CLIs, MCPs, and local code search all have their place.

Start with Databricks

Databricks (Walden) is the center of everything. The most reliable way to understand which peripheral repositories are actually used—and for what—is to:

  1. Read the Databricks notebooks and workflows in the openalex-walden repo
  2. Trace the dependencies: When a notebook imports something or calls an external service, follow that thread
  3. Check the DLT pipelines: These define the actual data flow and show which sources feed into the system
  4. Look at job configurations: The walden_end2end workflow and its tasks reveal the production architecture

Many repositories exist that look relevant but aren't actually used in production. Starting from Databricks and following the threads outward is how you distinguish active components from legacy/deprecated ones.

Recommended Audit Process

  1. Verify Databricks first: Use the Databricks MCP to explore Unity Catalog schemas and tables. Check if the documented tables/pipelines still exist and match the descriptions.

  2. Check external services: Use the Chrome extension to browse Heroku, Cloudflare, and AWS dashboards. Verify apps are still running and configs match documentation.

  3. Clone and search repos: Use gh to clone relevant repos, then use grep/search to verify code patterns and integrations described in the docs.

  4. Cross-reference: When documentation claims "X calls Y", verify by finding the actual code that does this.

  5. Update the Red Herrings section: As repos are deprecated or new ones created, keep this section current to save future auditors time.

Tools Setup

To audit this documentation, you'll want:

# GitHub CLI (for repo exploration)
brew install gh
gh auth login

# Heroku CLI (for Heroku app inspection)
brew tap heroku/brew && brew install heroku
heroku login

# Wrangler (for Cloudflare Workers)
npm install -g wrangler
wrangler login

# Databricks MCP - see MCP server documentation for setup
# Claude Chrome Extension - install from Chrome Web Store

Contributing

Documentation lives at: github.com/ourresearch/openalex-overview

When updating:

  1. Keep module documentation in /modules/ directory
  2. Update this index when adding new modules
  3. Mark deprecated modules in the "Red Herrings" section
  4. If you discover the documented behavior no longer matches reality, update the docs and note the date

Contact

  • Primary: OurResearch team
  • Databricks Owners: Casey Meyer, Artem Kazmerchuk

Last updated: January 2026

About

Technical documentation for the OpenAlex data pipeline system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published