Document processing with hierarchical structure. Currently supports PDF documents with text extraction, automatic document management system detection (WordPress Download Manager, CivicWeb, DocuShare), and file caching. Uses @happyvertical/spider for web page analysis and @happyvertical/pdf for PDF text extraction.
npm install @happyvertical/documents
# or
pnpm add @happyvertical/documentsPublished to GitHub Packages (
npm.pkg.github.com). Requires@happyvertical/files,@happyvertical/pdf,@happyvertical/spider, and@happyvertical/utilsas workspace dependencies.
import { fetchDocument } from '@happyvertical/documents';
// Process a local PDF
const doc = await fetchDocument('file:///path/to/report.pdf');
for (const part of doc.parts) {
console.log(part.title);
console.log(part.content);
}
// Fetch a remote PDF (auto-detected from URL extension)
const remote = await fetchDocument('https://example.com/report.pdf');
console.log(remote.parts[0].content);When fetching web URLs, the package uses @happyvertical/spider to detect document management systems and extract direct PDF links:
// WordPress Download Manager URL — spider detects the PDF link automatically
const doc = await fetchDocument(
'https://example.com/download/meeting-minutes/',
{ scraper: 'basic', spider: 'dom' }
);const doc = await fetchDocument('https://example.com/download?id=123', {
type: 'application/pdf',
});const doc = await fetchDocument('https://example.com/report.pdf', {
cacheDir: './my-cache',
cache: true,
cacheExpiry: 600_000, // 10 minutes
});Main factory function. Detects document format, selects the appropriate processor, and returns structured content.
- url
string— Document URL or file path (file://,http://,https://) - options
FetchDocumentOptions— See below - Returns
Promise<Document> - Throws if no processor is available for the detected MIME type
| Option | Type | Default | Description |
|---|---|---|---|
type |
string |
auto-detected | Override MIME type detection |
extractImages |
boolean |
true |
Extract images from document (stub — currently returns []) |
runOcr |
boolean |
true for PDFs |
Run OCR on extracted images (stub) |
cacheDir |
string |
OS temp dir | Directory for caching downloaded files |
cache |
boolean |
true |
Enable/disable spider fetch caching |
cacheExpiry |
number |
300000 |
Cache expiry in milliseconds |
scraper |
'basic' | 'crawlee' |
'basic' |
Scraper type for content extraction |
spider |
'simple' | 'dom' | 'crawlee' |
'dom' |
Spider adapter for fetching web pages |
headers |
Record<string, string> |
— | Custom HTTP headers for spider requests |
timeout |
number |
30000 |
Request timeout in milliseconds |
maxDuration |
number |
— | Max scraping time in milliseconds |
maxInteractions |
number |
— | Max interactions for advanced scrapers |
Base document handler. Manages downloading, caching, and local file path resolution. Used internally by processors; can also be used directly via Document.create(url, options).
Implements DocumentProcessor. Extracts text from PDF files, validates PDF headers (detects HTML cache poisoning), and caches processed results.
Extracts a human-readable title from a URL by parsing the filename, removing extensions, and decoding URL-encoded characters.
interface Document {
url: string;
type: string;
parts: DocumentPart[];
metadata?: Record<string, any>;
}
interface DocumentPart {
id: string;
title: string;
content: string;
type: 'text' | 'html' | 'markdown';
images?: DocumentImage[];
metadata?: Record<string, any>;
parts?: DocumentPart[];
}
interface DocumentImage {
id: string;
url: string;
localPath?: string;
altText?: string;
ocrText?: string;
position?: number;
metadata?: { width?: number; height?: number; format?: string };
}
interface DocumentProcessor {
process(url: string, options?: FetchDocumentOptions): Promise<Document>;
supports(type: string): boolean;
}MIT