Skip to content

GhCristea/rdf-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parse Gutenberg RDF

A streaming parser for Project Gutenberg RDF metadata files. Extracts book and author information from compressed RDF archives and outputs structured NDJSON data.

Features

  • Streaming processing pipeline for efficient memory usage
  • Extracts book metadata (title, formats, subjects, downloads, etc.)
  • Extracts author information with deduplication
  • Outputs NDJSON format for easy processing
  • Configurable processing limits

Installation

npm install

Usage

npm start

The parser will:

  1. Download the RDF archive (if FORCE_DOWNLOAD is enabled or archive doesn't exist)
  2. Extract and parse RDF files from the compressed archive
  3. Write book data to output/books.ndjson
  4. Write author data to output/authors.ndjson

Output

  • output/books.ndjson - One book per line in NDJSON format
  • output/authors.ndjson - One author per line in NDJSON format

Configuration

Edit src/config.ts to customize:

  • MAX_BOOKS - Maximum number of books to process (default: 1200)
  • FORCE_DOWNLOAD - Force re-download of archive (default: false)
  • TO_DIR - Output directory (default: ./output)
  • DW_DIR - Download directory (default: ./download)

About

Books metadata from Project Gutenberg

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published