Technical reference for developers and AI coding agents working on the OpenAlex data pipeline.
This repository documents the complete OpenAlex backend system, including the core Databricks processing pipeline and all external modules that feed into it.
flowchart LR
subgraph sources ["External Data Sources"]
OAI["OAI-PMH Repos<br/>(openalex-ingest → S3)"]
CR["Crossref API"]
PM["PubMed/PMC"]
DC["DataCite"]
PDF["PDFs & HTML<br/>(taxicab → R2)"]
end
subgraph processing ["Content Processing"]
GROBID["GROBID<br/>(AWS ECS)"]
PL["parseland-lib<br/>(landing pages)"]
end
subgraph databricks ["Databricks (Walden)"]
DLT["DLT Pipelines<br/>ML Models<br/>Entity Resolution<br/>PDF Parsing<br/>Landing Pages"]
end
subgraph openalex_api ["OpenAlex API"]
ES["Elasticsearch"]
ELASTIC_API["openalex-elastic-api<br/>(Heroku)"]
PROXY["openalex-api-proxy<br/>(Cloudflare Worker)"]
ENDPOINT["api.openalex.org"]
end
subgraph unpaywall ["Unpaywall"]
UPW_PG["Unpaywall Postgres"]
UPW_API["api.unpaywall.org<br/>(oadoi)"]
end
subgraph webui ["OpenAlex Web UI"]
GUI["OpenAlex GUI"]
end
subgraph users ["User Management"]
USERS_PG["Users Postgres"]
USERS_API["openalex-users-api"]
D1["D1 (API keys)"]
DO["Durable Objects<br/>(rate limiting)"]
end
%% Data ingestion
OAI --> DLT
CR --> DLT
PM --> DLT
DC --> DLT
PDF --> DLT
DLT <--> GROBID
DLT <--> PL
%% OpenAlex API branch
DLT --> ES
ES --> ELASTIC_API
ELASTIC_API --> PROXY
PROXY --> ENDPOINT
ENDPOINT --> API_USER["User"]
%% Unpaywall branch
DLT --> UPW_PG
UPW_PG --> UPW_API
UPW_API --> UPW_USER["User"]
%% Web UI branch
ENDPOINT --> GUI
GUI --> UI_USER["User"]
%% User management flow
GUI <--> USERS_API
USERS_API <--> USERS_PG
USERS_API --> D1
D1 --> DO
DO --> PROXY
| Document | Description |
|---|---|
| Databricks Overview | Complete guide to the Walden system on Databricks - pipelines, schemas, workflows, and data flow |
| Document | Description | Status |
|---|---|---|
| openalex-ingest | OAI-PMH repository harvesting and data ingestion to S3 | ✅ Done |
| openalex-taxicab | PDF and landing page download system (ECS + Cloudflare R2) | ✅ Done |
| GROBID | PDF text extraction (AWS ECS) | ✅ Done |
| parseland-lib | Landing page parsing for metadata extraction | ✅ Done |
| openalex-api-proxy | Cloudflare Worker for API auth/rate limiting | ✅ Done |
| openalex-users-api | User management API (Heroku) | ✅ Done |
| openalex-elastic-api | Elasticsearch-backed API for OpenAlex queries | 🔄 Planned |
| Unpaywall | Open access detection (branch of OpenAlex pipeline) | 🔄 Planned |
These are handled directly within Databricks, not as separate external repos:
| Component | Location | Notes |
|---|---|---|
| Topic Classification | openalex-walden/notebooks/topics/ |
ML-based topic assignment |
| RecordThresher | /Shared/recordthresher/ |
Crossref data processing |
These repositories may appear relevant but are either deprecated or not used in the current system:
| Repository | Status | Notes |
|---|---|---|
openalex-api-proxy (Python/Heroku) |
Legacy | Old proxy; production uses Cloudflare Worker |
openalex-grobid |
Legacy | Old REST API wrapper; GROBID now called directly from Databricks |
parseland |
Not used | Use parseland-lib instead |
openalex-guts |
Legacy | Old system, not used in Walden |
openalex-topic-classification |
Legacy | Topics now handled natively in Databricks notebooks |
sickle |
Low-level | OAI-PMH library, but higher-level harvesting done by openalex-ingest |
| Aspect | Details |
|---|---|
| Core Platform | Databricks on AWS (Unity Catalog) |
| Processing | Apache Spark, Delta Live Tables (DLT) |
| Storage | Delta Lake (ACID transactions), S3, Cloudflare R2 |
| ML | PySpark ML, BERT models for topics |
| Search | Elasticsearch |
| API Proxy | Cloudflare Workers (Durable Objects + D1) |
| APIs | openalex-elastic-api, openalex-users-api (Heroku) |
For the detailed pipeline DAG with all task dependencies, see: Databricks Overview - Walden End-to-End Pipeline DAG
-
External Sources feed data into the system:
- OAI-PMH repositories (via
openalex-ingestto S3) - Direct API ingestion (Crossref, PubMed, DataCite)
- PDFs and landing pages (via
taxicabto Cloudflare R2)
- OAI-PMH repositories (via
-
Content Processing:
- PDFs → GROBID → text extraction
- Landing pages → parseland-lib → metadata extraction
-
Databricks (Walden) is the processing nexus (
walden_end2endworkflow):- Ingestion: 7 parallel DLT pipelines (Crossref, PubMed, DataCite, MAG, PDF, Repos, Landing_Page)
- Union: Consolidates 6 sources into unified Works tables
- Enrichment: Super Authorships, Locations, Entity Resolution
- Works Creation: Final OpenAlex Works records
- Runs nightly at 10:30 PM UTC
-
Production Output:
- Syncs to Elasticsearch
- Served via openalex-elastic-api
- Proxied through Cloudflare Worker (rate limiting, API keys)
- User management via openalex-users-api
-
Unpaywall is a branch of the OpenAlex pipeline (prod only):
- Takes OpenAlex Works data from
walden_end2endworkflow Wunpaywalltask transforms to Unpaywall schema- Conditional export (only if
env == prod):Wunpaywall_Data_Feed→ S3 for data feed subscribersWunpaywall_to_OpenAlex_DB→ PostgreSQL for unpaywall.org API
- Takes OpenAlex Works data from
| Repository | Purpose | Platform |
|---|---|---|
| openalex-walden | Core Databricks pipeline code | Databricks |
| openalex-ingest | OAI-PMH harvesting to S3 | AWS Lambda |
| openalex-taxicab | PDF/landing page harvester | AWS ECS |
| parseland-lib | Landing page parsing library | Python (in Databricks) |
| openalex-api-proxy | API rate limiting/auth | Cloudflare Workers |
| openalex-elastic-api | Elasticsearch API | Heroku |
| openalex-users-api | User management API | Heroku |
| oadoi | Unpaywall backend | Heroku |
These are found within the Databricks workspace, not as separate GitHub repos:
- recordthresher - Crossref data processing (notebooks in
/Shared/recordthresher/) - topics notebooks - Topic classification ML (in
openalex-walden/notebooks/topics/) - openalex_dlt_utils - Custom DLT library for normalization
flowchart TB
USER["User Request<br/>(with API key)"] --> ENDPOINT["api.openalex.org"]
ENDPOINT --> PROXY
subgraph PROXY ["Cloudflare Worker (openalex-api-proxy)"]
DO["Durable Objects<br/>(rate limiting)"]
D1["D1 (API keys)"]
end
PROXY --> ELASTIC["openalex-elastic-api<br/>(Heroku)"]
ELASTIC --> ES["Elasticsearch<br/>(synced from Databricks)"]
subgraph users ["User Management"]
GUI["OpenAlex GUI"]
USERS_API["openalex-users-api<br/>(Heroku)"]
USERS_PG["Users Postgres"]
end
GUI_USER["User"] <--> GUI
GUI <--> USERS_API
USERS_API <--> USERS_PG
USERS_API --> D1
GUI --> ENDPOINT
How it works:
- Users create accounts and API keys via the OpenAlex GUI
- The GUI talks to openalex-users-api, which stores user data in Users Postgres
- API keys are synced to D1, which feeds Durable Objects for rate limiting
- When users make API requests (directly or via the GUI), the Cloudflare Worker validates their API key and enforces rate limits
Note: The "polite pool" mentioned in public documentation is not currently implemented. Rate limiting is based on IP address and API key only.
- Start with Databricks Overview to understand the core system
- Review specific module documentation as needed
- Check the "Gotchas" section in Databricks Overview for common pitfalls
- This documentation provides system context for code generation
- Key patterns are documented in each module's overview
- When modifying code, check the "Related Projects" sections to understand dependencies
- Pay attention to the Red Herrings section to avoid working on deprecated code
This documentation was automatically generated by Claude Code with human steering. No manual writing—just an AI agent exploring the OpenAlex ecosystem with guidance on where to look and what to prioritize.
| Tool | Purpose |
|---|---|
| Claude Chrome Extension | Browser automation for exploring Databricks UI, Heroku dashboards, Cloudflare dashboards, and other web interfaces. Essential when CLIs or MCPs lacked the needed functionality. |
| Databricks MCP | Unity Catalog exploration—listing catalogs, schemas, tables, and querying data. More capable than the official Databricks MCP tools. |
GitHub CLI (gh) |
Searching repos, listing organization repositories, cloning repos for local analysis. |
| Heroku CLI | Inspecting Heroku apps, configs, and add-ons for the API services. |
| Wrangler CLI | Exploring Cloudflare Workers configuration for the API proxy. |
| Bash tools (grep, etc.) | Reading and searching through locally-cloned repositories. |
- Chrome extension for code reading: While powerful for UI exploration, scrolling through Databricks notebooks in the browser is slow. For actual code/notebook content, cloning repos locally or using the Databricks MCP is faster.
- MCP installation: The Databricks MCP required some setup effort but was worth it for catalog exploration.
This section is for future developers or AI agents who need to verify or update this documentation.
-
Be interactive: Work with a human in the loop. Ask questions when something is unclear or when you need to make judgment calls about what's important.
-
Be persistent: Mapping this system takes time. Don't give up when you hit dead ends—follow threads across multiple tools and sources.
-
Use all available tools: Different parts of the system are best explored with different tools. Web UIs, CLIs, MCPs, and local code search all have their place.
Databricks (Walden) is the center of everything. The most reliable way to understand which peripheral repositories are actually used—and for what—is to:
- Read the Databricks notebooks and workflows in the
openalex-waldenrepo - Trace the dependencies: When a notebook imports something or calls an external service, follow that thread
- Check the DLT pipelines: These define the actual data flow and show which sources feed into the system
- Look at job configurations: The
walden_end2endworkflow and its tasks reveal the production architecture
Many repositories exist that look relevant but aren't actually used in production. Starting from Databricks and following the threads outward is how you distinguish active components from legacy/deprecated ones.
-
Verify Databricks first: Use the Databricks MCP to explore Unity Catalog schemas and tables. Check if the documented tables/pipelines still exist and match the descriptions.
-
Check external services: Use the Chrome extension to browse Heroku, Cloudflare, and AWS dashboards. Verify apps are still running and configs match documentation.
-
Clone and search repos: Use
ghto clone relevant repos, then use grep/search to verify code patterns and integrations described in the docs. -
Cross-reference: When documentation claims "X calls Y", verify by finding the actual code that does this.
-
Update the Red Herrings section: As repos are deprecated or new ones created, keep this section current to save future auditors time.
To audit this documentation, you'll want:
# GitHub CLI (for repo exploration)
brew install gh
gh auth login
# Heroku CLI (for Heroku app inspection)
brew tap heroku/brew && brew install heroku
heroku login
# Wrangler (for Cloudflare Workers)
npm install -g wrangler
wrangler login
# Databricks MCP - see MCP server documentation for setup
# Claude Chrome Extension - install from Chrome Web StoreDocumentation lives at: github.com/ourresearch/openalex-overview
When updating:
- Keep module documentation in
/modules/directory - Update this index when adding new modules
- Mark deprecated modules in the "Red Herrings" section
- If you discover the documented behavior no longer matches reality, update the docs and note the date
- Primary: OurResearch team
- Databricks Owners: Casey Meyer, Artem Kazmerchuk
Last updated: January 2026