This project processes Open Library data dumps to create optimized Parquet files for analysis. It uses DuckDB for efficient JSON processing and Snakemake for workflow management.
- pixi for environment management
-
Install pixi if you haven't already:
curl -fsSL https://pixi.sh/install.sh | bash -
Set up the pixi environment
pixi install
-
Download the Open Library works dump from Open Library Data Dumps
- Look for the "works dump" (~2.9GB)
- The file will be named something like
ol_dump_works_YYYY-MM-DD.txt.gz - You may also consider using the torrents provided by the Internet Archive.
-
Make sure the data is in the correct location, e.g.,
mkdir -p data/ol_dump_2025-01-08 mv ol_dump_works_2025-01-08.txt.gz data/ol_dump_2025-01-08/
pixi run snakemake -c CORESwhere CORES is the number of cores you want to use to run the process.
This will:
- Process the works dump using DuckDB
- Extract relevant fields from the JSON data
- Create an optimized Parquet file at
data/ol_works.parquet
The pipeline generates a Parquet file (data/ol_works.parquet) containing the
following selection of fields:
- key
- description
- title
- subtitle
- authors
- location
- first_publish_date
- first_sentence
- subjects
- subject_places
- subject_people
- subject_times
- lc_classifications
- dewey_number
NB: The format of the fields may not be uniform, despite what the Open Library schema claims.