OpenAlex System Documentation

Technical reference for developers and AI coding agents working on the OpenAlex data pipeline.

This repository documents the complete OpenAlex backend system, including the core Databricks processing pipeline and all external modules that feed into it.

System Architecture

flowchart LR
    subgraph sources ["External Data Sources"]
        OAI["OAI-PMH Repos<br/>(openalex-ingest → S3)"]
        CR["Crossref API"]
        PM["PubMed/PMC"]
        DC["DataCite"]
        PDF["PDFs & HTML<br/>(taxicab → R2)"]
    end

    subgraph processing ["Content Processing"]
        GROBID["GROBID<br/>(AWS ECS)"]
        PL["parseland-lib<br/>(landing pages)"]
    end

    subgraph databricks ["Databricks (Walden)"]
        DLT["DLT Pipelines<br/>ML Models<br/>Entity Resolution<br/>PDF Parsing<br/>Landing Pages"]
    end

    subgraph openalex_api ["OpenAlex API"]
        ES["Elasticsearch"]
        ELASTIC_API["openalex-elastic-api<br/>(Heroku)"]
        PROXY["openalex-api-proxy<br/>(Cloudflare Worker)"]
        ENDPOINT["api.openalex.org"]
    end

    subgraph unpaywall ["Unpaywall"]
        UPW_PG["Unpaywall Postgres"]
        UPW_API["api.unpaywall.org<br/>(oadoi)"]
    end

    subgraph webui ["OpenAlex Web UI"]
        GUI["OpenAlex GUI"]
    end

    subgraph users ["User Management"]
        USERS_PG["Users Postgres"]
        USERS_API["openalex-users-api"]
        D1["D1 (API keys)"]
        DO["Durable Objects<br/>(rate limiting)"]
    end

    %% Data ingestion
    OAI --> DLT
    CR --> DLT
    PM --> DLT
    DC --> DLT
    PDF --> DLT

    DLT <--> GROBID
    DLT <--> PL

    %% OpenAlex API branch
    DLT --> ES
    ES --> ELASTIC_API
    ELASTIC_API --> PROXY
    PROXY --> ENDPOINT
    ENDPOINT --> API_USER["User"]

    %% Unpaywall branch
    DLT --> UPW_PG
    UPW_PG --> UPW_API
    UPW_API --> UPW_USER["User"]

    %% Web UI branch
    ENDPOINT --> GUI
    GUI --> UI_USER["User"]

    %% User management flow
    GUI <--> USERS_API
    USERS_API <--> USERS_PG
    USERS_API --> D1
    D1 --> DO
    DO --> PROXY

Documentation Index

Core System

Document	Description
Databricks Overview	Complete guide to the Walden system on Databricks - pipelines, schemas, workflows, and data flow

External Modules

Document	Description	Status
openalex-ingest	OAI-PMH repository harvesting and data ingestion to S3	✅ Done
openalex-taxicab	PDF and landing page download system (ECS + Cloudflare R2)	✅ Done
GROBID	PDF text extraction (AWS ECS)	✅ Done
parseland-lib	Landing page parsing for metadata extraction	✅ Done
openalex-api-proxy	Cloudflare Worker for API auth/rate limiting	✅ Done
openalex-users-api	User management API (Heroku)	✅ Done
openalex-elastic-api	Elasticsearch-backed API for OpenAlex queries	🔄 Planned
Unpaywall	Open access detection (branch of OpenAlex pipeline)	🔄 Planned

Databricks-Native Components

These are handled directly within Databricks, not as separate external repos:

Component	Location	Notes
Topic Classification	`openalex-walden/notebooks/topics/`	ML-based topic assignment
RecordThresher	`/Shared/recordthresher/`	Crossref data processing

Red Herrings (Not Used / Legacy)

These repositories may appear relevant but are either deprecated or not used in the current system:

Repository	Status	Notes
`openalex-api-proxy` (Python/Heroku)	Legacy	Old proxy; production uses Cloudflare Worker
`openalex-grobid`	Legacy	Old REST API wrapper; GROBID now called directly from Databricks
`parseland`	Not used	Use `parseland-lib` instead
`openalex-guts`	Legacy	Old system, not used in Walden
`openalex-topic-classification`	Legacy	Topics now handled natively in Databricks notebooks
`sickle`	Low-level	OAI-PMH library, but higher-level harvesting done by openalex-ingest

Quick Facts

Aspect	Details
Core Platform	Databricks on AWS (Unity Catalog)
Processing	Apache Spark, Delta Live Tables (DLT)
Storage	Delta Lake (ACID transactions), S3, Cloudflare R2
ML	PySpark ML, BERT models for topics
Search	Elasticsearch
API Proxy	Cloudflare Workers (Durable Objects + D1)
APIs	openalex-elastic-api, openalex-users-api (Heroku)

Data Flow Summary

For the detailed pipeline DAG with all task dependencies, see: Databricks Overview - Walden End-to-End Pipeline DAG

High-Level Flow

External Sources feed data into the system:
- OAI-PMH repositories (via openalex-ingest to S3)
- Direct API ingestion (Crossref, PubMed, DataCite)
- PDFs and landing pages (via taxicab to Cloudflare R2)
Content Processing:
- PDFs → GROBID → text extraction
- Landing pages → parseland-lib → metadata extraction
Databricks (Walden) is the processing nexus (walden_end2end workflow):
- Ingestion: 7 parallel DLT pipelines (Crossref, PubMed, DataCite, MAG, PDF, Repos, Landing_Page)
- Union: Consolidates 6 sources into unified Works tables
- Enrichment: Super Authorships, Locations, Entity Resolution
- Works Creation: Final OpenAlex Works records
- Runs nightly at 10:30 PM UTC
Production Output:
- Syncs to Elasticsearch
- Served via openalex-elastic-api
- Proxied through Cloudflare Worker (rate limiting, API keys)
- User management via openalex-users-api
Unpaywall is a branch of the OpenAlex pipeline (prod only):
- Takes OpenAlex Works data from walden_end2end workflow
- Wunpaywall task transforms to Unpaywall schema
- Conditional export (only if env == prod):
  - Wunpaywall_Data_Feed → S3 for data feed subscribers
  - Wunpaywall_to_OpenAlex_DB → PostgreSQL for unpaywall.org API

Key Repositories

Active in Production

Repository	Purpose	Platform
openalex-walden	Core Databricks pipeline code	Databricks
openalex-ingest	OAI-PMH harvesting to S3	AWS Lambda
openalex-taxicab	PDF/landing page harvester	AWS ECS
parseland-lib	Landing page parsing library	Python (in Databricks)
openalex-api-proxy	API rate limiting/auth	Cloudflare Workers
openalex-elastic-api	Elasticsearch API	Heroku
openalex-users-api	User management API	Heroku
oadoi	Unpaywall backend	Heroku

Databricks Components (in-repo)

These are found within the Databricks workspace, not as separate GitHub repos:

recordthresher - Crossref data processing (notebooks in /Shared/recordthresher/)
topics notebooks - Topic classification ML (in openalex-walden/notebooks/topics/)
openalex_dlt_utils - Custom DLT library for normalization

API Architecture

flowchart TB
    USER["User Request<br/>(with API key)"] --> ENDPOINT["api.openalex.org"]
    ENDPOINT --> PROXY

    subgraph PROXY ["Cloudflare Worker (openalex-api-proxy)"]
        DO["Durable Objects<br/>(rate limiting)"]
        D1["D1 (API keys)"]
    end

    PROXY --> ELASTIC["openalex-elastic-api<br/>(Heroku)"]
    ELASTIC --> ES["Elasticsearch<br/>(synced from Databricks)"]

    subgraph users ["User Management"]
        GUI["OpenAlex GUI"]
        USERS_API["openalex-users-api<br/>(Heroku)"]
        USERS_PG["Users Postgres"]
    end

    GUI_USER["User"] <--> GUI
    GUI <--> USERS_API
    USERS_API <--> USERS_PG
    USERS_API --> D1
    GUI --> ENDPOINT

How it works:

Users create accounts and API keys via the OpenAlex GUI
The GUI talks to openalex-users-api, which stores user data in Users Postgres
API keys are synced to D1, which feeds Durable Objects for rate limiting
When users make API requests (directly or via the GUI), the Cloudflare Worker validates their API key and enforces rate limits

Note: The "polite pool" mentioned in public documentation is not currently implemented. Rate limiting is based on IP address and API key only.

Getting Started

For Developers

Start with Databricks Overview to understand the core system
Review specific module documentation as needed
Check the "Gotchas" section in Databricks Overview for common pitfalls

For AI Coding Agents

This documentation provides system context for code generation
Key patterns are documented in each module's overview
When modifying code, check the "Related Projects" sections to understand dependencies
Pay attention to the Red Herrings section to avoid working on deprecated code

How This Documentation Was Created

This documentation was automatically generated by Claude Code with human steering. No manual writing—just an AI agent exploring the OpenAlex ecosystem with guidance on where to look and what to prioritize.

Tools Used

Tool	Purpose
Claude Chrome Extension	Browser automation for exploring Databricks UI, Heroku dashboards, Cloudflare dashboards, and other web interfaces. Essential when CLIs or MCPs lacked the needed functionality.
Databricks MCP	Unity Catalog exploration—listing catalogs, schemas, tables, and querying data. More capable than the official Databricks MCP tools.
GitHub CLI (`gh`)	Searching repos, listing organization repositories, cloning repos for local analysis.
Heroku CLI	Inspecting Heroku apps, configs, and add-ons for the API services.
Wrangler CLI	Exploring Cloudflare Workers configuration for the API proxy.
Bash tools (grep, etc.)	Reading and searching through locally-cloned repositories.

Limitations Encountered

Chrome extension for code reading: While powerful for UI exploration, scrolling through Databricks notebooks in the browser is slow. For actual code/notebook content, cloning repos locally or using the Databricks MCP is faster.
MCP installation: The Databricks MCP required some setup effort but was worth it for catalog exploration.

Auditing & Updating This Documentation

This section is for future developers or AI agents who need to verify or update this documentation.

General Approach

Be interactive: Work with a human in the loop. Ask questions when something is unclear or when you need to make judgment calls about what's important.
Be persistent: Mapping this system takes time. Don't give up when you hit dead ends—follow threads across multiple tools and sources.
Use all available tools: Different parts of the system are best explored with different tools. Web UIs, CLIs, MCPs, and local code search all have their place.

Start with Databricks

Databricks (Walden) is the center of everything. The most reliable way to understand which peripheral repositories are actually used—and for what—is to:

Read the Databricks notebooks and workflows in the openalex-walden repo
Trace the dependencies: When a notebook imports something or calls an external service, follow that thread
Check the DLT pipelines: These define the actual data flow and show which sources feed into the system
Look at job configurations: The walden_end2end workflow and its tasks reveal the production architecture

Many repositories exist that look relevant but aren't actually used in production. Starting from Databricks and following the threads outward is how you distinguish active components from legacy/deprecated ones.

Recommended Audit Process

Verify Databricks first: Use the Databricks MCP to explore Unity Catalog schemas and tables. Check if the documented tables/pipelines still exist and match the descriptions.
Check external services: Use the Chrome extension to browse Heroku, Cloudflare, and AWS dashboards. Verify apps are still running and configs match documentation.
Clone and search repos: Use gh to clone relevant repos, then use grep/search to verify code patterns and integrations described in the docs.
Cross-reference: When documentation claims "X calls Y", verify by finding the actual code that does this.
Update the Red Herrings section: As repos are deprecated or new ones created, keep this section current to save future auditors time.

Tools Setup

To audit this documentation, you'll want:

# GitHub CLI (for repo exploration)
brew install gh
gh auth login

# Heroku CLI (for Heroku app inspection)
brew tap heroku/brew && brew install heroku
heroku login

# Wrangler (for Cloudflare Workers)
npm install -g wrangler
wrangler login

# Databricks MCP - see MCP server documentation for setup
# Claude Chrome Extension - install from Chrome Web Store

Contributing

Documentation lives at: github.com/ourresearch/openalex-overview

When updating:

Keep module documentation in /modules/ directory
Update this index when adding new modules
Mark deprecated modules in the "Red Herrings" section
If you discover the documented behavior no longer matches reality, update the docs and note the date

Contact

Primary: OurResearch team
Databricks Owners: Casey Meyer, Artem Kazmerchuk

Last updated: January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
modules		modules
README.md		README.md
databricks-overview.md		databricks-overview.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenAlex System Documentation

System Architecture

Documentation Index

Core System

External Modules

Databricks-Native Components

Red Herrings (Not Used / Legacy)

Quick Facts

Data Flow Summary

High-Level Flow

Key Repositories

Active in Production

Databricks Components (in-repo)

API Architecture

Getting Started

For Developers

For AI Coding Agents

How This Documentation Was Created

Tools Used

Limitations Encountered

Auditing & Updating This Documentation

General Approach

Start with Databricks

Recommended Audit Process

Tools Setup

Contributing

Contact

About

Uh oh!

Releases

Packages

Contributors 2

ourresearch/openalex-overview

Folders and files

Latest commit

History

Repository files navigation

OpenAlex System Documentation

System Architecture

Documentation Index

Core System

External Modules

Databricks-Native Components

Red Herrings (Not Used / Legacy)

Quick Facts

Data Flow Summary

High-Level Flow

Key Repositories

Active in Production

Databricks Components (in-repo)

API Architecture

Getting Started

For Developers

For AI Coding Agents

How This Documentation Was Created

Tools Used

Limitations Encountered

Auditing & Updating This Documentation

General Approach

Start with Databricks

Recommended Audit Process

Tools Setup

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages