Open Library Data Processing

This project processes Open Library data dumps to create optimized Parquet files for analysis. It uses DuckDB for efficient JSON processing and Snakemake for workflow management.

Prerequisites

pixi for environment management

Setup

Install pixi if you haven't already:

curl -fsSL https://pixi.sh/install.sh | bash

Set up the pixi environment
```
pixi install
```

Getting the Data

Download the Open Library works dump from Open Library Data Dumps
- Look for the "works dump" (~2.9GB)
- The file will be named something like ol_dump_works_YYYY-MM-DD.txt.gz
- You may also consider using the torrents provided by the Internet Archive.

Make sure the data is in the correct location, e.g.,

mkdir -p data/ol_dump_2025-01-08
mv ol_dump_works_2025-01-08.txt.gz data/ol_dump_2025-01-08/

Running the Pipeline

pixi run snakemake -c CORES

where CORES is the number of cores you want to use to run the process.

This will:

Process the works dump using DuckDB
Extract relevant fields from the JSON data
Create an optimized Parquet file at data/ol_works.parquet

Output

The pipeline generates a Parquet file (data/ol_works.parquet) containing the following selection of fields:

key
description
title
subtitle
authors
location
first_publish_date
first_sentence
subjects
subject_places
subject_people
subject_times
lc_classifications
dewey_number

NB: The format of the fields may not be uniform, despite what the Open Library schema claims.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
parse_authors.sql		parse_authors.sql
parse_editions.sql		parse_editions.sql
parse_works.sql		parse_works.sql
pixi.lock		pixi.lock
pixi.toml		pixi.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Library Data Processing

Prerequisites

Setup

Getting the Data

Running the Pipeline

Output

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open Library Data Processing

Prerequisites

Setup

Getting the Data

Running the Pipeline

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages