A powerful Python library for advanced PDF processing with focus on image extraction and conversion capabilities.
- PDF to PNG Conversion: Convert PDF pages to high-quality PNG images with customizable DPI
- Image Extraction: Extract individual images from PDF pages using advanced computer vision algorithms
- Flood Fill Algorithm: Intelligent image detection and separation using morphological operations
- Size and Aspect Ratio Filtering: Filter extracted images based on customizable criteria
- Base64 Encoding: Convert PDF pages to base64-encoded PNG strings for web applications
- Command Line Interface: Easy-to-use CLI for batch processing and automation
- Type Safety: Full type hints and Pydantic validation for robust data handling
- Comprehensive Testing: Extensive test suite with 28+ test cases
pip install pdfalchemygit clone https://github.com/jainparul9814/pdfalchemy.git
cd pdfalchemy
# Install base dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .
# Install development dependencies (optional)
pip install -r requirements-dev.txtfrom pdfalchemy import PDFProcessor, PNGConversionInput, ImageExtractionInput
# Initialize processor
processor = PDFProcessor()
# Convert PDF to PNG
with open("document.pdf", "rb") as f:
pdf_bytes = f.read()
png_input = PNGConversionInput(
pdf_bytes=pdf_bytes,
dpi=300, # High resolution
first_page=1,
last_page=5
)
png_result = processor.to_png(png_input)
print(f"Converted {png_result.total_pages} pages")
# Extract images from PNG
for i, png_bytes in enumerate(png_result.png_images):
extraction_input = ImageExtractionInput(
png_bytes=png_bytes,
min_width=50,
min_height=50,
flood_fill_threshold=0.2,
noise_reduction=True
)
extraction_result = processor.extract_images_from_png(extraction_input)
print(f"Page {i+1}: Extracted {extraction_result.total_images} images")# Convert PDF to PNG images
pdfalchemy to-png document.pdf --output ./images/ --dpi 300
# Convert specific pages (range, list, or single page)
pdfalchemy to-png document.pdf --pages 1-5 --dpi 200
pdfalchemy to-png document.pdf --pages 1,3,5 --dpi 200
pdfalchemy to-png document.pdf --pages 3 --dpi 200
# Convert to base64 for web applications
pdfalchemy to-base64 document.pdf --dpi 200 --output images.json
# Extract individual images from PDF pages
pdfalchemy extract-images document.pdf --output ./extracted/ --min-size 100x100
# Extract images with custom filters
pdfalchemy extract-images document.pdf --min-width 50 --max-width 800 --aspect-ratio 0.5-2.0
# Extract images with advanced options
pdfalchemy extract-images document.pdf \
--output ./extracted/ \
--dpi 300 \
--pages 1-5 \
--min-size 100x100 \
--max-size 800x600 \
--aspect-ratio 0.5-2.0 \
--threshold 0.15 \
--format json \
--summary
# Get help for any command
pdfalchemy extract-images --helpfrom pdfalchemy import PDFProcessor, ImageExtractionInput
processor = PDFProcessor()
# Configure image extraction with specific criteria
extraction_input = ImageExtractionInput(
png_bytes=png_bytes,
min_width=100, # Minimum width in pixels
min_height=100, # Minimum height in pixels
max_width=800, # Maximum width in pixels
max_height=600, # Maximum height in pixels
min_aspect_ratio=0.5, # Minimum aspect ratio (width/height)
max_aspect_ratio=2.0, # Maximum aspect ratio
flood_fill_threshold=0.15, # Threshold for flood fill algorithm
noise_reduction=True, # Enable noise reduction
separate_connected_regions=True # Separate connected regions
)
result = processor.extract_images_from_png(extraction_input)
print(f"Extracted {result.total_images} images")
print(f"Filtered out {result.filtered_count} images")
print(f"Processing time: {result.processing_time_ms:.2f} ms")from pathlib import Path
from pdfalchemy import PDFProcessor, PNGConversionInput
processor = PDFProcessor()
pdf_files = Path("./pdfs/").glob("*.pdf")
for pdf_file in pdf_files:
print(f"Processing {pdf_file}")
with open(pdf_file, "rb") as f:
pdf_bytes = f.read()
png_input = PNGConversionInput(
pdf_bytes=pdf_bytes,
dpi=200
)
result = processor.to_png(png_input)
print(f" Converted {result.total_pages} pages")from pdfalchemy import PDFProcessor, PNGConversionInput
processor = PDFProcessor()
with open("document.pdf", "rb") as f:
pdf_bytes = f.read()
png_input = PNGConversionInput(
pdf_bytes=pdf_bytes,
dpi=200
)
# Get base64 encoded PNG images
base64_images = processor.to_png_base64(png_input)
# Use in web applications
for i, base64_str in enumerate(base64_images):
html_img_tag = f'<img src="data:image/png;base64,{base64_str}" alt="Page {i+1}">'
print(html_img_tag)PDFAlchemy provides a powerful command-line interface for batch processing and automation.
pdfalchemy to-png <pdf_file> [options]Options:
--output, -o: Output directory for PNG files--dpi: DPI resolution (default: 200, range: 72-1200)--pages: Page range (e.g., '1-5', '1,3,5', or '3')
Examples:
# Convert all pages
pdfalchemy to-png document.pdf --output ./images/
# Convert specific pages with high resolution
pdfalchemy to-png document.pdf --dpi 300 --pages 1-5 --output ./high_res/
# Convert single page
pdfalchemy to-png document.pdf --pages 3 --output ./single_page/pdfalchemy to-base64 <pdf_file> [options]Options:
--output, -o: Output file for base64 data (JSON format)--dpi: DPI resolution (default: 200)--pages: Page range (e.g., '1-5', '1,3,5', or '3')
Examples:
# Convert to base64 for web applications
pdfalchemy to-base64 document.pdf --dpi 200 --output images.json
# Convert specific pages
pdfalchemy to-base64 document.pdf --pages 1-3 --output selected_pages.jsonpdfalchemy extract-images <pdf_file> [options]Basic Options:
--output, -o: Output directory for extracted images--dpi: DPI resolution for conversion (default: 200)--pages: Page range (e.g., '1-5', '1,3,5', or '3')
Size Filtering:
--min-size: Minimum size in pixels (e.g., '100x100')--max-size: Maximum size in pixels (e.g., '800x600')--min-width: Minimum width in pixels--min-height: Minimum height in pixels--max-width: Maximum width in pixels--max-height: Maximum height in pixels
Advanced Filtering:
--aspect-ratio: Aspect ratio range (e.g., '0.5-2.0')--threshold: Flood fill threshold (0.0-1.0, default: 0.1)--no-noise-reduction: Disable noise reduction--no-separate-regions: Disable connected region separation--sort-order: Sort order for extracted images ('top-bottom', 'left-right', 'reading-order', default: 'top-bottom')
Output Options:
--format: Output format ('png' or 'json', default: 'png')--summary: Show detailed extraction summary
Examples:
# Basic image extraction
pdfalchemy extract-images document.pdf --output ./extracted/ --min-size 100x100
# Advanced filtering with custom sort order
pdfalchemy extract-images document.pdf \
--output ./filtered/ \
--min-width 50 \
--max-width 800 \
--aspect-ratio 0.5-2.0 \
--threshold 0.15 \
--sort-order reading-order
# JSON output with summary
pdfalchemy extract-images document.pdf \
--output ./json_output/ \
--format json \
--summary \
--pages 1-5
# High-resolution extraction with custom filters
pdfalchemy extract-images document.pdf \
--dpi 300 \
--output ./high_res_extracted/ \
--min-size 200x200 \
--max-size 1200x800 \
--aspect-ratio 0.8-1.5 \
--threshold 0.2 \
--no-noise-reductionThe --pages option supports multiple formats:
- Range:
1-5(pages 1 through 5) - List:
1,3,5(pages 1, 3, and 5) - Single:
3(page 3 only)
- Saves individual PNG files for each extracted image
- File naming:
page_001_image_001.png,page_001_image_002.png, etc. - Suitable for visual inspection and further processing
- Saves all extracted images as base64-encoded data in a JSON file
- Includes metadata: page number, image index, size in bytes
- Suitable for web applications and programmatic access
The --sort-order parameter controls how extracted images are ordered:
top-bottom(default): Sort by y-coordinate first (top to bottom), then by x-coordinate (left to right)left-right: Sort by x-coordinate first (left to right), then by y-coordinate (top to bottom)reading-order: Group images by approximate rows and sort each row left-to-right, then sort rows top-to-bottom
- Use appropriate DPI: Higher DPI provides better quality but increases processing time
- Filter early: Use size and aspect ratio filters to reduce processing overhead
- Batch processing: Process multiple files in scripts for automation
- Memory management: For large PDFs, consider processing page ranges
PDFAlchemy uses Pydantic models for configuration and validation. All input and output models include comprehensive validation and type checking.
pdf_bytes: PDF data as byte arraydpi: Resolution in DPI (72-1200, default: 200)first_page: First page to convert (1-indexed, optional)last_page: Last page to convert (1-indexed, optional)
png_images: List of PNG images as byte arraystotal_pages: Total number of pages converteddpi_used: DPI used for conversionpage_range: Page range converted (e.g., '1-5')total_size_bytes: Total size of all PNG images
png_bytes: PNG image data as byte arraymin_width/min_height: Minimum dimensions for extracted imagesmax_width/max_height: Maximum dimensions for extracted imagesmin_aspect_ratio/max_aspect_ratio: Aspect ratio constraintsflood_fill_threshold: Threshold for flood fill algorithm (0.0-1.0)noise_reduction: Enable noise reductionseparate_connected_regions: Attempt to separate connected regionssort_order: Sort order for extracted images ('top-bottom', 'left-right', 'reading-order')
extracted_images: List of base64 encoded extracted imagestotal_images: Total number of extracted imagesfiltered_count: Number of images filtered outprocessing_time_ms: Processing time in millisecondstotal_size_bytes: Total size of all extracted images
git clone https://github.com/jainparul9814/pdfalchemy.git
cd pdfalchemy
# Install all dependencies
pip install -r requirements-dev.txt
# Install in development mode
pip install -e .
# Setup pre-commit hooks
pre-commit install# Run all tests
pytest
# Run core tests with verbose output
pytest tests/test_core.py -v
# Run with coverage
pytest --cov=src.pdfalchemy.core --cov-report=term-missing# Format code
black src/ tests/
# Sort imports
isort src/ tests/
# Type checking
mypy src/
# Linting
flake8 src/ tests/Check the sample_test_scripts/ directory for working examples:
python sample_test_scripts/test_image_extraction.py# Clean previous builds
rm -rf dist/ build/ *.egg-info
# Build the package
python -m build
# Upload to PyPI
python -m twine upload dist/*Note: Make sure you have the required build tools installed:
pip install build twinepydantic>=2.0.0: Data validation and settings managementpdf2image>=1.16.0: PDF to image conversionopencv-python>=4.8.0: Computer vision for image processingPillow>=9.0.0: Image processingnumpy>=1.21.0: Numerical computing
pytest>=7.0.0: Testing frameworkblack>=23.0.0: Code formattingisort>=5.12.0: Import sortingflake8>=6.0.0: Lintingmypy>=1.0.0: Type checking
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Author: Parul Jain (jainparul9814@gmail.com)
See CHANGELOG.md for a list of changes and version history.