A streaming parser for Project Gutenberg RDF metadata files. Extracts book and author information from compressed RDF archives and outputs structured NDJSON data.
- Streaming processing pipeline for efficient memory usage
- Extracts book metadata (title, formats, subjects, downloads, etc.)
- Extracts author information with deduplication
- Outputs NDJSON format for easy processing
- Configurable processing limits
npm installnpm startThe parser will:
- Download the RDF archive (if
FORCE_DOWNLOADis enabled or archive doesn't exist) - Extract and parse RDF files from the compressed archive
- Write book data to
output/books.ndjson - Write author data to
output/authors.ndjson
output/books.ndjson- One book per line in NDJSON formatoutput/authors.ndjson- One author per line in NDJSON format
Edit src/config.ts to customize:
MAX_BOOKS- Maximum number of books to process (default: 1200)FORCE_DOWNLOAD- Force re-download of archive (default: false)TO_DIR- Output directory (default:./output)DW_DIR- Download directory (default:./download)