Parse Gutenberg RDF

A streaming parser for Project Gutenberg RDF metadata files. Extracts book and author information from compressed RDF archives and outputs structured NDJSON data.

Features

Streaming processing pipeline for efficient memory usage
Extracts book metadata (title, formats, subjects, downloads, etc.)
Extracts author information with deduplication
Outputs NDJSON format for easy processing
Configurable processing limits

Installation

npm install

Usage

npm start

The parser will:

Download the RDF archive (if FORCE_DOWNLOAD is enabled or archive doesn't exist)
Extract and parse RDF files from the compressed archive
Write book data to output/books.ndjson
Write author data to output/authors.ndjson

Output

output/books.ndjson - One book per line in NDJSON format
output/authors.ndjson - One author per line in NDJSON format

Configuration

Edit src/config.ts to customize:

MAX_BOOKS - Maximum number of books to process (default: 1200)
FORCE_DOWNLOAD - Force re-download of archive (default: false)
TO_DIR - Output directory (default: ./output)
DW_DIR - Download directory (default: ./download)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc		.prettierrc
Readme.md		Readme.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parse Gutenberg RDF

Features

Installation

Usage

Output

Configuration

About

Uh oh!

Releases

Packages

Languages

GhCristea/rdf-parser

Folders and files

Latest commit

History

Repository files navigation

Parse Gutenberg RDF

Features

Installation

Usage

Output

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages