The @happyvertical/pdf package provides comprehensive tools for working with PDF documents in Node.js environments, combining direct text extraction with intelligent OCR fallback. It focuses on:
- Text Extraction: Direct extraction from text-based PDFs using unpdf
- OCR Integration: Automatic fallback to OCR for image-based/scanned PDFs via @happyvertical/ocr
- Smart Analysis: Document analysis with processing strategy recommendations (text/ocr/hybrid)
- Metadata Extraction: Comprehensive PDF metadata (title, author, dates, encryption status)
- Image Extraction: Extract images for OCR processing or display
- Error Resilience: Graceful handling of corrupted, malformed, or encrypted PDFs
This package is particularly useful for AI agents that need to analyze document content, extract information from diverse PDF formats, and intelligently handle both text-based and image-based documents.
Expert Agent Expertise: When working with this package, always proactively check the latest documentation for foundational libraries (unpdf and @happyvertical/ocr) as they frequently add new features, performance improvements, and language support that can enhance PDF processing solutions.
- Node.js: ✅ Fully implemented with unpdf + OCR integration
- Browser:
⚠️ Planned for future releases (PDF.js provider stubbed but not implemented) - Environment Detection: ✅ Automatic provider selection based on runtime
import { getPDFReader } from '@happyvertical/pdf';
// Get a PDF reader with automatic provider selection
const reader = await getPDFReader();
// Get reader with specific configuration
const reader = await getPDFReader({
provider: 'auto', // 'auto', 'unpdf', 'pdfjs'
enableOCR: true, // Enable OCR fallback
timeout: 30000, // Processing timeout
maxFileSize: 50 * 1024 * 1024 // 50MB limit
});
// NEW: Analyze PDF before processing for optimal strategy
const info = await reader.getInfo('/path/to/document.pdf');
console.log('PDF Analysis:', {
pageCount: info.pageCount,
hasEmbeddedText: info.hasEmbeddedText,
recommendedStrategy: info.recommendedStrategy,
estimatedProcessingTime: info.estimatedProcessingTime
});
// Process based on analysis recommendations
if (info.recommendedStrategy === 'text') {
// Text-based PDF - fast extraction without OCR
const text = await reader.extractText('/path/to/document.pdf', { skipOCRFallback: true });
} else if (info.recommendedStrategy === 'ocr') {
// Image-based PDF - OCR required
const text = await reader.extractText('/path/to/document.pdf'); // Will use OCR
} else {
// Hybrid approach - try text first, OCR fallback
const text = await reader.extractText('/path/to/document.pdf');
}
// Traditional approach (still works, but less efficient)
const text = await reader.extractText('/path/to/document.pdf', {
pages: [1, 2, 3], // Specific pages
mergePages: true, // Merge into single string
preserveFormatting: true // Preserve formatting
});import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Quick document analysis without expensive processing
const info = await reader.getInfo('/path/to/document.pdf');
console.log('Document Analysis:', {
pageCount: info.pageCount,
hasEmbeddedText: info.hasEmbeddedText, // Can extract text directly
hasImages: info.hasImages, // Contains images
recommendedStrategy: info.recommendedStrategy,// 'text', 'ocr', or 'hybrid'
ocrRequired: info.ocrRequired, // Definitely needs OCR
estimatedTextLength: info.estimatedTextLength,// Rough text content size
estimatedProcessingTime: info.estimatedProcessingTime,
title: info.title, // Basic metadata
author: info.author
});
// Make processing decisions based on analysis
if (info.recommendedStrategy === 'text') {
console.log('✅ Text-based PDF - fast extraction available');
const text = await reader.extractText('/path/to/document.pdf', { skipOCRFallback: true });
} else if (info.recommendedStrategy === 'ocr') {
console.log('🔍 Image-based PDF - OCR processing required');
console.log(`Estimated time: ${info.estimatedProcessingTime?.ocrProcessing}`);
const text = await reader.extractText('/path/to/document.pdf');
} else {
console.log('🔄 Hybrid PDF - contains both text and images');
const text = await reader.extractText('/path/to/document.pdf');
}
// Batch processing with smart routing
const pdfFiles = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];
for (const pdfFile of pdfFiles) {
const info = await reader.getInfo(pdfFile);
console.log(`${pdfFile}: ${info.recommendedStrategy} (${info.pageCount} pages)`);
if (info.recommendedStrategy === 'text') {
// Fast path for text-based PDFs
const text = await reader.extractText(pdfFile, { skipOCRFallback: true });
} else {
// Slower path for image-based PDFs
const text = await reader.extractText(pdfFile);
}
}The @happyvertical/pdf package supports configuration via environment variables using the HAVE_PDF_{FIELD} naming pattern. Environment variables are automatically loaded and merged with user-provided options, with user options always taking precedence.
| Environment Variable | Type | Description | Example |
|---|---|---|---|
HAVE_PDF_ENABLE_OCR |
boolean | Enable OCR fallback for image-based PDFs | true, false, 1, 0, yes, no |
HAVE_PDF_TIMEOUT |
number | Processing timeout in milliseconds | 30000, 60000 |
HAVE_PDF_PROVIDER |
string | PDF provider to use | unpdf, pdfjs, auto |
HAVE_PDF_MAX_FILE_SIZE |
number | Maximum file size in bytes | 52428800 (50MB) |
import { getPDFReader } from '@happyvertical/pdf';
// Set environment variables (in .env file or shell)
// HAVE_PDF_ENABLE_OCR=true
// HAVE_PDF_TIMEOUT=30000
// HAVE_PDF_PROVIDER=unpdf
// Create reader using environment variables
const reader = await getPDFReader();
// Uses: enableOCR=true, timeout=30000, provider='unpdf'
// Override environment variables with user options
const customReader = await getPDFReader({
timeout: 60000, // Overrides HAVE_PDF_TIMEOUT
provider: 'auto', // Overrides HAVE_PDF_PROVIDER
});
// Uses: enableOCR=true (from env), timeout=60000 (user), provider='auto' (user)
// Boolean environment variables support multiple formats
// HAVE_PDF_ENABLE_OCR=true → true
// HAVE_PDF_ENABLE_OCR=1 → true
// HAVE_PDF_ENABLE_OCR=yes → true
// HAVE_PDF_ENABLE_OCR=false → false
// HAVE_PDF_ENABLE_OCR=0 → falseConfiguration is resolved in the following order (highest to lowest priority):
- User-provided options (passed directly to
getPDFReader()) - Environment variables (
HAVE_PDF_{FIELD}) - Default values (defined in the package)
// Example with priority demonstration
process.env.HAVE_PDF_TIMEOUT = '30000';
process.env.HAVE_PDF_ENABLE_OCR = 'true';
// User option overrides env var
const reader = await getPDFReader({ timeout: 60000 });
// Result: timeout=60000 (user), enableOCR=true (env)Environment variables are automatically converted to the correct type based on the schema:
- boolean: Accepts
true,false,1,0,yes,no(case-insensitive) - number: Converts string to number, logs warning if conversion fails
- string: Used as-is without conversion
// Invalid type conversions are handled gracefully
process.env.HAVE_PDF_TIMEOUT = 'not-a-number';
const reader = await getPDFReader();
// Logs warning, skips invalid value, uses default- Use environment variables for deployment configuration: Set default values per environment (dev, staging, prod)
- Use user options for runtime configuration: Override defaults based on specific use cases
- Validate critical settings: Check capabilities and dependencies after creating reader
- Document environment variables: Add comments in
.env.examplefiles
// Good: Deployment-level config via env vars, runtime overrides via options
// .env file:
// HAVE_PDF_ENABLE_OCR=true
// HAVE_PDF_TIMEOUT=30000
// Runtime usage:
const reader = await getPDFReader();
const deps = await reader.checkDependencies();
if (!deps.available) {
console.error('PDF processing unavailable:', deps.error);
}
// Special case: disable OCR for fast processing
const fastReader = await getPDFReader({ enableOCR: false });import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Extract metadata
const metadata = await reader.extractMetadata('/path/to/document.pdf');
console.log(metadata.title, metadata.author, metadata.pageCount);
// Extract images for further processing
const images = await reader.extractImages('/path/to/document.pdf');
console.log(`Found ${images.length} images`);
// Check reader capabilities
const capabilities = await reader.checkCapabilities();
console.log('OCR available:', capabilities.canPerformOCR);
console.log('Supported languages:', capabilities.ocrLanguages);import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Extract images first
const images = await reader.extractImages('/path/to/scanned.pdf');
// Perform OCR on extracted images
const ocrResult = await reader.performOCR(images, {
language: 'eng', // Language code
confidenceThreshold: 60, // Minimum confidence (0-100)
outputFormat: 'text', // 'text', 'json', 'hocr'
improveResolution: true // Enhance image quality
});
console.log('Extracted text:', ocrResult.text);
console.log('Average confidence:', ocrResult.confidence);
console.log('Detailed detections:', ocrResult.detections);import {
getPDFReader,
getAvailableProviders,
isProviderAvailable,
getProviderInfo
} from '@happyvertical/pdf';
// Check available providers in current environment
const providers = getAvailableProviders();
console.log('Available providers:', providers); // ['unpdf'] in Node.js, [] in other environments
// Check specific provider availability
const isUnpdfAvailable = isProviderAvailable('unpdf');
// Get detailed provider information
const providerInfo = await getProviderInfo('unpdf');
console.log('Provider capabilities:', providerInfo.capabilities);
console.log('Dependencies status:', providerInfo.dependencies);
// Create reader (auto-selects unpdf in Node.js)
const reader = await getPDFReader({ provider: 'auto' }); // Recommended approach
// Force specific provider (Node.js only currently)
try {
const unpdfReader = await getPDFReader({ provider: 'unpdf' });
} catch (error) {
console.error('Provider not available:', error.message);
}import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Extract images first
const images = await reader.extractImages('/path/to/multilingual.pdf');
if (images.length > 0) {
// Process multilingual documents (language support depends on OCR provider)
const result = await reader.performOCR(images, {
language: 'eng', // English - most widely supported
confidenceThreshold: 70
});
console.log('OCR Text:', result.text);
console.log('Confidence:', result.confidence);
// Check available languages through capabilities
const capabilities = await reader.checkCapabilities();
if (capabilities.ocrLanguages) {
console.log('Available OCR languages:', capabilities.ocrLanguages);
}
}import { getPDFReader } from '@happyvertical/pdf';
try {
const reader = await getPDFReader();
// Check dependencies before processing
const deps = await reader.checkDependencies();
if (!deps.available) {
console.warn('Some dependencies missing:', deps.error);
console.log('Details:', deps.details);
}
// Process with error handling
const text = await reader.extractText('/path/to/document.pdf');
if (!text) {
console.log('No text extracted - may be image-based PDF');
}
} catch (error) {
if (error.name === 'PDFDependencyError') {
console.error('Dependency issue:', error.message);
} else if (error.name === 'PDFUnsupportedError') {
console.error('Unsupported operation:', error.message);
} else {
console.error('General PDF error:', error);
}
}// These functions are deprecated but maintained for backward compatibility
import {
extractTextFromPDF, // Use reader.extractText() instead
extractImagesFromPDF, // Use reader.extractImages() instead
performOCROnImages, // Use reader.performOCR() instead
checkOCRDependencies // Use reader.checkDependencies() instead
} from '@happyvertical/pdf';
// IMPORTANT: Migrate to new factory-based approach for better features and performance
const reader = await getPDFReader();
const text = await reader.extractText('/path/to/document.pdf');The package uses a provider pattern with environment-aware selection:
src/
├── index.ts # Main entry point with legacy compatibility
├── shared/
│ ├── types.ts # Comprehensive TypeScript interfaces and error classes
│ ├── base.ts # BasePDFReader abstract class (ENOTSUP pattern)
│ └── factory.ts # getPDFReader() factory with auto-detection
├── node/
│ ├── unpdf.ts # UnpdfProvider - direct PDF processing
│ └── combined.ts # CombinedNodeProvider - unpdf + OCR integration
└── browser/
├── pdfjs.ts # PDF.js provider (planned, not implemented)
├── combined.ts # Browser combined provider (planned)
└── factory.ts # Browser factory (planned)
-
BasePDFReader (
src/shared/base.ts)- Abstract base class for all providers
- Default implementations throw
PDFUnsupportedError - Provides helper methods:
normalizeSource(),validatePDFData(),normalizePages(), etc. - Pattern: Only override methods the provider supports
-
UnpdfProvider (
src/node/unpdf.ts)- Direct PDF processing using unpdf library
- Handles: text extraction, metadata, image extraction
- Does NOT support OCR (throws PDFUnsupportedError)
- Lazy-loads unpdf to minimize bundle size
-
CombinedNodeProvider (
src/node/combined.ts)- Primary Node.js provider - delegates to UnpdfProvider + @happyvertical/ocr
- Intelligent fallback: tries text extraction first, then OCR if needed
extractText()with automatic OCR fallback (unlessskipOCRFallback: true)- Combines capabilities from both unpdf and OCR providers
-
Factory Functions (
src/shared/factory.ts)getPDFReader()- Main entry point, auto-detects environmentgetAvailableProviders()- Returns available providers for current environmentisProviderAvailable()- Check specific provider availabilitygetProviderInfo()- Get detailed provider capabilities and statusinitializeProviders()- Warm up providers (called on module load)
All types are defined in src/shared/types.ts:
- Interfaces:
PDFReader,PDFMetadata,PDFImage,PDFInfo,PDFCapabilities,ExtractTextOptions - Error Classes:
PDFError,PDFUnsupportedError,PDFDependencyError - Re-exports from @happyvertical/ocr:
OCROptions,OCRResult,DependencyCheckResult
-
unpdf (external, npm)
- Version: ^1.0.6
- Purpose: PDF parsing, text extraction, metadata, image extraction
- Node.js only, lazy-loaded for performance
- Used by:
UnpdfProvider
-
@happyvertical/ocr (internal workspace)
- Version: workspace:*
- Purpose: OCR processing with multiple provider support
- Provides: OCR factory, language support, image preprocessing
- Used by:
CombinedNodeProvider
- Node.js 18+ (Node.js 24+ recommended)
- Memory: 2GB+ recommended for OCR processing
- OCR Dependencies: Managed by @happyvertical/ocr (see @happyvertical/ocr documentation)
import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Check all dependencies
const deps = await reader.checkDependencies();
console.log('Available:', deps.available);
console.log('Details:', deps.details); // { unpdf: true, ocr: true, ocrProviders: 1 }
// Check specific capabilities
const capabilities = await reader.checkCapabilities();
console.log('Can extract text:', capabilities.canExtractText);
console.log('Can perform OCR:', capabilities.canPerformOCR);
console.log('OCR languages:', capabilities.ocrLanguages);
// Graceful degradation
if (!capabilities.canPerformOCR) {
console.warn('OCR not available - text-based PDFs only');
}-
Factory Pattern for Provider Selection
- Always use
getPDFReader()- never instantiate providers directly - Factory handles environment detection and provider initialization
- Supports explicit provider selection for testing:
{ provider: 'unpdf' }
- Always use
-
ENOTSUP Error Pattern (borrowed from @happyvertical/files)
- Base class methods throw
PDFUnsupportedErrorby default - Providers only override methods they support
- Consistent error handling across all unsupported operations
- Base class methods throw
-
Lazy Loading for Performance
- unpdf is lazy-loaded via dynamic import in
loadUnpdf() - OCR factory is created but not initialized until first use
- Minimizes startup time and bundle size
- unpdf is lazy-loaded via dynamic import in
-
Intelligent Fallback Strategy
CombinedNodeProvider.extractText()tries direct extraction first- Falls back to OCR only if no text found and
skipOCRFallback !== true - Logs fallback attempts for debugging: "No direct text found, attempting OCR fallback..."
-
Source Normalization
normalizeSource()converts file paths, ArrayBuffer, Uint8Array to Buffer (Node.js)- File reading handled in provider overrides (not base class)
- Validates PDF magic bytes:
%PDF-
// Always use typed error classes
import { PDFError, PDFUnsupportedError, PDFDependencyError } from '@happyvertical/pdf';
// Throwing errors
throw new PDFDependencyError('unpdf', 'Failed to load library');
throw new PDFUnsupportedError('extractImages');
throw new PDFError('Invalid PDF data', 'EINVAL');
// Catching errors
try {
const text = await reader.extractText(source);
} catch (error) {
if (error instanceof PDFDependencyError) {
// Handle missing dependencies
} else if (error instanceof PDFUnsupportedError) {
// Handle unsupported operations
} else if (error instanceof PDFError) {
// Handle general PDF errors
}
}
// Graceful degradation - return null/empty instead of throwing
return null; // extractText() when no text found
return []; // extractImages() when no images found- 1-based indexing (following PDF conventions)
normalizePages(pages, totalPages)handles validation and filteringisValidPageNumber(pageNumber, totalPages)checks validity- If
pagesoption is undefined, extract all pages
// Controlled by mergePages option in ExtractTextOptions
mergePageTexts(pageTexts, mergePages?: boolean): string
// mergePages: true → pages.join(' ') # Continuous reading
// mergePages: false → pages.join('\n\n') # Preserve page boundaries (default)When adding a new provider (e.g., PDF.js for browsers):
- Extend
BasePDFReaderin appropriate directory (src/browser/) - Set
protected name = 'provider-name'for error messages - Override only supported methods (let others throw PDFUnsupportedError)
- Implement
normalizeSource()if environment has specific file handling - Implement
checkCapabilities()andcheckDependencies() - Update factory logic in
src/shared/factory.ts - Add environment detection logic
- Update exports in
src/index.ts
-
Lazy Loading
private unpdf: any = null; private async loadUnpdf() { if (this.unpdf) return this.unpdf; this.unpdf = await import('unpdf'); return this.unpdf; }
-
Page Sampling for Analysis
// getInfo() samples first 3 pages to determine strategy const pagesToSample = Math.min(3, pageCount); // Scales estimates to full document
-
Parallel Processing
// Process pages sequentially for predictable memory usage for (const pageNum of pagesToExtract) { const text = await extractPageText(pageNum); pageTexts.push(text); }
-
Direct RGB Data Path
// UnpdfProvider passes raw RGB data to OCR (no conversion overhead) if (image.channels === 3 && image.width && image.height) { format = 'rgb'; // OCR recognizes optimal path }
import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
const text = await reader.extractText('/path/to/document.pdf');
if (text) {
console.log(`Extracted ${text.length} characters`);
} else {
console.log('No text found - may be image-based PDF');
}import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Analyze first to determine optimal strategy
const info = await reader.getInfo('/path/to/document.pdf');
console.log(`Strategy: ${info.recommendedStrategy}`); // 'text', 'ocr', or 'hybrid'
console.log(`Pages: ${info.pageCount}`);
console.log(`Has embedded text: ${info.hasEmbeddedText}`);
console.log(`OCR required: ${info.ocrRequired}`);
// Process based on recommendation
if (info.recommendedStrategy === 'text') {
// Fast path - no OCR needed
const text = await reader.extractText(source, { skipOCRFallback: true });
} else if (info.recommendedStrategy === 'ocr') {
// OCR required path
console.log(`Expected time: ${info.estimatedProcessingTime.ocrProcessing}`);
const text = await reader.extractText(source); // Will use OCR
} else {
// Hybrid approach
const text = await reader.extractText(source);
}import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Extract images first
const images = await reader.extractImages('/path/to/scanned.pdf');
console.log(`Found ${images.length} images`);
if (images.length > 0) {
// Perform OCR
const ocrResult = await reader.performOCR(images, {
language: 'eng',
confidenceThreshold: 70,
improveResolution: true
});
console.log('OCR Text:', ocrResult.text);
console.log('Confidence:', ocrResult.confidence);
if (ocrResult.confidence < 60) {
console.warn('Low confidence OCR result - review manually');
}
}import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
const pdfFiles = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];
for (const pdfFile of pdfFiles) {
try {
// Analyze first for routing
const info = await reader.getInfo(pdfFile);
console.log(`${pdfFile}: ${info.recommendedStrategy} (${info.pageCount} pages)`);
// Process with appropriate strategy
if (info.recommendedStrategy === 'text') {
const text = await reader.extractText(pdfFile, { skipOCRFallback: true });
console.log(`✅ ${pdfFile}: ${text?.length || 0} chars (fast path)`);
} else {
const text = await reader.extractText(pdfFile);
console.log(`✅ ${pdfFile}: ${text?.length || 0} chars (OCR path)`);
}
} catch (error) {
console.error(`❌ ${pdfFile}:`, error.message);
continue; // Skip to next document
}
}import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
const metadata = await reader.extractMetadata('/path/to/document.pdf');
console.log('Document Information:');
console.log(` Title: ${metadata.title || 'Unknown'}`);
console.log(` Author: ${metadata.author || 'Unknown'}`);
console.log(` Pages: ${metadata.pageCount}`);
console.log(` Created: ${metadata.creationDate?.toLocaleDateString()}`);
console.log(` Modified: ${metadata.modificationDate?.toLocaleDateString()}`);
console.log(` Encrypted: ${metadata.encrypted ? 'Yes' : 'No'}`);
console.log(` Producer: ${metadata.producer || 'Unknown'}`);import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Check available OCR languages first
const capabilities = await reader.checkCapabilities();
console.log('Available OCR languages:', capabilities.ocrLanguages);
// Process with multiple languages
const images = await reader.extractImages('/path/to/multilingual.pdf');
const result = await reader.performOCR(images, {
language: 'eng+chi_sim+deu', // English + Chinese Simplified + German
confidenceThreshold: 60 // Lower threshold for multi-language
});
console.log('Multilingual text:', result.text);import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Check dependencies before processing
const deps = await reader.checkDependencies();
const caps = await reader.checkCapabilities();
if (!deps.available) {
console.error('PDF processing not available:', deps.error);
process.exit(1);
}
console.log('Available features:');
console.log(` Text extraction: ${caps.canExtractText ? '✅' : '❌'}`);
console.log(` Metadata: ${caps.canExtractMetadata ? '✅' : '❌'}`);
console.log(` Images: ${caps.canExtractImages ? '✅' : '❌'}`);
console.log(` OCR: ${caps.canPerformOCR ? '✅' : '❌'}`);
if (caps.canPerformOCR) {
console.log(` OCR languages: ${caps.ocrLanguages?.join(', ')}`);
} else {
console.warn('⚠️ OCR not available - text-based PDFs only');
}src/
├── factory.test.ts # Factory function tests
├── extraction.test.ts # Text/image extraction tests
├── metadata.test.ts # Metadata extraction tests
├── ocr-integration.test.ts # OCR integration tests
├── capabilities.test.ts # Capability checking tests
├── error-handling.test.ts # Error scenario tests
└── legacy.test.ts # Backward compatibility testsnpm test # Run all tests
npm run test:watch # Watch mode for development
# Specific test suites
npx vitest run --grep "factory" # Factory tests
npx vitest run --grep "extraction" # Extraction tests
npx vitest run --grep "ocr" # OCR tests
# Extended timeout for OCR tests
npx vitest run --testTimeout 60000npm run build # Build Node.js bundle
npm run build:watch # Watch mode for development
npm run clean # Clean dist/ and docs/
npm run clean:all # Clean everything including node_modules
npm run dev # Run build:watch + test:watch in parallelnpm run docs # Generate markdown docs to docs/
npm run docs:watch # Watch mode for docs generationIssue: Factory auto-detection relies on process.versions.node and window/document globals.
// May fail in edge environments (Cloudflare Workers, Deno, etc.)
const reader = await getPDFReader({ provider: 'auto' });
// Better: Explicitly specify provider when environment is known
const reader = await getPDFReader({ provider: 'unpdf' }); // Node.jsDefault Behavior: extractText() automatically falls back to OCR if no text found.
// This may trigger expensive OCR unexpectedly
const text = await reader.extractText(imagePDF); // Could take 30+ seconds
// Better: Check with getInfo() first for predictable performance
const info = await reader.getInfo(imagePDF);
if (info.recommendedStrategy === 'ocr') {
console.log('Warning: OCR required, may take time');
}
// Or: Disable OCR fallback explicitly
const text = await reader.extractText(imagePDF, { skipOCRFallback: true });PDF Convention: Pages use 1-based indexing.
// WRONG - page 0 doesn't exist
const text = await reader.extractText(pdf, { pages: [0, 1, 2] }); // Page 0 ignored
// CORRECT - pages 1, 2, 3
const text = await reader.extractText(pdf, { pages: [1, 2, 3] });Pattern: Methods return null or [] for "nothing found" (not errors).
// extractText() returns null when no text found
const text = await reader.extractText(imagePDF);
if (text === null) {
console.log('No text extracted - not an error');
}
// extractImages() returns empty array when no images
const images = await reader.extractImages(textPDF);
if (images.length === 0) {
console.log('No images found - not an error');
}Issue: Large PDFs are loaded entirely into memory (no streaming yet).
// Risk: 200MB PDF will use 200MB+ RAM
const text = await reader.extractText('huge-document.pdf');
// Better: Check file size first
const info = await reader.getInfo('huge-document.pdf');
if (info.fileSize && info.fileSize > 100 * 1024 * 1024) {
console.warn('Large PDF - may cause memory issues');
}
// Consider: Process page-by-page for very large documents
const text = await reader.extractText(pdf, { pages: [1, 2, 3] });Issue: unpdf has loose type definitions (any in many places).
// Type safety is limited with unpdf internals
private unpdf: any = null; // Can't strongly type unpdf objects
// Workaround: Trust but verify with runtime checks
const pdf = await unpdf.getDocumentProxy(buffer);
if (!pdf || typeof pdf.numPages !== 'number') {
throw new Error('Invalid PDF object from unpdf');
}Issue: OCR language availability depends on @happyvertical/ocr provider selection.
// Check languages before processing
const capabilities = await reader.checkCapabilities();
console.log('Available:', capabilities.ocrLanguages); // May be ['eng'] or ['eng', 'chi_sim', ...]
// Don't assume language is available
const hasGerman = capabilities.ocrLanguages?.includes('deu');
if (!hasGerman) {
console.warn('German OCR not available');
}Current Limitation: Password-protected PDFs are not supported.
// Will fail with invalid PDF data
const text = await reader.extractText('encrypted.pdf');
// Better: Check metadata first
const info = await reader.getInfo('document.pdf');
if (info.encrypted) {
console.error('PDF is encrypted - cannot process');
}Behavior: Page-level errors are logged but don't stop processing.
// If page 2 fails, pages 1 and 3 still processed
const text = await reader.extractText(pdf, { pages: [1, 2, 3] });
// Check console for warnings like: "Failed to extract text from page 2"
// The extracted text will have empty string for page 2
// This maintains page order but may be unexpectedCurrent Status: Browser providers are stubbed but not implemented.
// This will throw in browser environments
const reader = await getPDFReader(); // Error: Unable to detect environment
// Workaround: Use only in Node.js environments
if (typeof process !== 'undefined' && process.versions?.node) {
const reader = await getPDFReader();
}Deprecated but Not Removed: Legacy functions still work but should be avoided.
// OLD - Deprecated (still works but don't use in new code)
import { extractTextFromPDF } from '@happyvertical/pdf';
const text = await extractTextFromPDF('/path/to/pdf');
// NEW - Preferred pattern
import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
const text = await reader.extractText('/path/to/pdf');Behavior: getInfo() only samples first 3 pages for performance.
// getInfo() samples pages 1-3, estimates for full document
const info = await reader.getInfo('1000-page-document.pdf');
console.log(info.estimatedTextLength); // Estimated, not exact
// For 100% accurate text length:
const text = await reader.extractText('document.pdf');
console.log(text.length); // ExactThe @happyvertical/pdf package generates comprehensive API documentation in both HTML and markdown formats using TypeDoc:
HTML Documentation (recommended for browsing):
- Generated in
docs/directory for public website - Full API reference with interactive navigation
- Cross-linked type definitions and examples
- Accessible via development server at
http://localhost:3030/
Markdown Documentation (great for development):
- Generated in
packages/pdf/docs/directory - Markdown format perfect for IDE integration
- Accessible via development server at
http://localhost:3030/packages/pdf/
# Generate documentation for this package
npm run docs
# Generate and watch for changes during development
npm run docs:watch
# Start development server to browse documentation
npm run dev # Serves docs at http://localhost:3030Documentation is automatically generated during the build process and can be viewed alongside development:
- During Development: Use
npm run docs:watchto regenerate docs as you code - Local Browsing: Access HTML docs at
http://localhost:3030/or markdown athttp://localhost:3030/packages/pdf/ - IDE Integration: Point your editor to
packages/pdf/docs/for offline markdown reference
The documentation includes complete API coverage, usage examples, and cross-references to related HAVE SDK packages.
Always reference the latest documentation when implementing PDF processing solutions, as these libraries frequently add new features, performance improvements, and language support:
- unpdf: npm Package | GitHub Repository
- Modern, serverless-optimized PDF processing
- Check for new extraction features and performance improvements
- Monitor for additional format support and edge case handling
- @happyvertical/ocr: Internal workspace package providing OCR capabilities
- Managed through workspace dependencies
- Supports multiple OCR providers (tesseract.js, EasyOCR)
- See @happyvertical/ocr package documentation for detailed OCR capabilities
- PDF.js: Official Documentation | GitHub Repository
- Mozilla's PDF rendering engine (planned for future browser support)
- Will provide browser-native PDF processing capabilities
When working with @happyvertical/pdf:
- Always check latest documentation before implementing solutions using WebFetch tool
- Verify OCR language support - libraries frequently add new languages
- Review performance optimizations - OCR and PDF processing improvements are ongoing
- Check dependency requirements - system requirements may change with updates
- Monitor breaking changes in major version updates
- Test cross-platform compatibility - ensure solutions work in target environments
Example workflow:
// Before implementing OCR solutions, check latest capabilities
const reader = await getPDFReader();
const capabilities = await reader.checkCapabilities();
const supportedLanguages = capabilities.ocrLanguages;
// Then implement with current best practices
const result = await reader.performOCR(images, {
language: 'eng', // Use verified supported language
confidenceThreshold: 70,
improveResolution: true
});This package provides enterprise-grade PDF processing capabilities designed for scalable AI agent workflows across multiple environments.
// Primary factory function
import { getPDFReader } from '@happyvertical/pdf';
// Type definitions
import type {
PDFReader,
PDFMetadata,
PDFImage,
PDFInfo,
PDFCapabilities,
ExtractTextOptions,
OCROptions,
OCRResult
} from '@happyvertical/pdf';
// Error classes
import {
PDFError,
PDFUnsupportedError,
PDFDependencyError
} from '@happyvertical/pdf';
// Utility functions
import {
getAvailableProviders,
isProviderAvailable,
getProviderInfo
} from '@happyvertical/pdf';const reader = await getPDFReader();
// Document analysis (lightweight, fast)
const info = await reader.getInfo(source);
// Returns: PDFInfo with recommendedStrategy, pageCount, hasEmbeddedText, etc.
// Text extraction (with automatic OCR fallback)
const text = await reader.extractText(source, options?);
// Returns: string | null
// Metadata extraction
const metadata = await reader.extractMetadata(source);
// Returns: PDFMetadata (always returns object, never null)
// Image extraction
const images = await reader.extractImages(source);
// Returns: PDFImage[] (empty array if no images)
// OCR processing
const ocrResult = await reader.performOCR(images, options?);
// Returns: OCRResult with text, confidence, detections
// Capability checking
const capabilities = await reader.checkCapabilities();
// Returns: PDFCapabilities
// Dependency validation
const deps = await reader.checkDependencies();
// Returns: DependencyCheckResult// ExtractTextOptions
{
pages?: number[]; // [1, 2, 3] - 1-based indexing
mergePages?: boolean; // true = ' ', false = '\n\n' (default)
preserveFormatting?: boolean; // Preserve original formatting
includeMetadata?: boolean; // Include metadata in extraction
skipOCRFallback?: boolean; // Disable automatic OCR fallback
}
// OCROptions
{
language?: string; // 'eng', 'eng+chi_sim', etc.
confidenceThreshold?: number; // 0-100, filter low confidence
outputFormat?: string; // 'text', 'json', 'hocr'
improveResolution?: boolean; // Enable image preprocessing
}
// PDFReaderOptions
{
provider?: 'unpdf' | 'pdfjs' | 'auto'; // Default: 'auto'
enableOCR?: boolean; // Default: true
defaultOCROptions?: OCROptions;
maxFileSize?: number; // In bytes
timeout?: number; // In milliseconds
}1. Start → Call getInfo()
↓
2. Check info.recommendedStrategy
↓
├─ 'text' → Use extractText({ skipOCRFallback: true }) for fast path
│
├─ 'ocr' → Warn about processing time, use extractText() (will use OCR)
│
└─ 'hybrid' → Use extractText() (tries text first, OCR if needed)
3. Handle null/empty results gracefully
↓
4. Check confidence for OCR results (< 60 = low confidence)
packages/pdf/
├── src/
│ ├── index.ts # Main entry + legacy exports
│ ├── shared/
│ │ ├── types.ts # All interfaces & error classes
│ │ ├── base.ts # BasePDFReader abstract class
│ │ └── factory.ts # getPDFReader() + utilities
│ └── node/
│ ├── unpdf.ts # Direct PDF processing
│ └── combined.ts # unpdf + OCR (primary)
├── package.json # Dependencies & scripts
├── vite.config.ts # Build configuration
├── tsconfig.json # TypeScript configuration
├── CLAUDE.md # This file
└── README.md # Public documentation
| Operation | Speed | Memory | Notes |
|---|---|---|---|
getInfo() |
Fast (< 1s) | Low | Samples first 3 pages |
extractText() text-based |
Fast (< 5s) | Medium | Direct unpdf extraction |
extractText() image-based |
Slow (10-60s+) | High | OCR fallback triggered |
extractMetadata() |
Fast (< 1s) | Low | Lightweight metadata only |
extractImages() |
Medium (1-10s) | Medium | Depends on image count |
performOCR() |
Slow (5-30s+) | High | Depends on image count/size |
// 1. Check provider availability
console.log('Providers:', getAvailableProviders()); // Should include 'unpdf' in Node.js
// 2. Verify dependencies
const reader = await getPDFReader();
const deps = await reader.checkDependencies();
console.log('Available:', deps.available, deps.details);
// 3. Check capabilities
const caps = await reader.checkCapabilities();
console.log('OCR:', caps.canPerformOCR, caps.ocrLanguages);
// 4. Validate PDF file
const info = await reader.getInfo('/path/to/pdf');
console.log('Valid:', info.pageCount > 0);
console.log('Encrypted:', info.encrypted);
// 5. Monitor memory usage
console.log('Memory:', process.memoryUsage());
const text = await reader.extractText(largePDF);
console.log('After extraction:', process.memoryUsage());