PdfPy is a Python utility that splits PDF documents into chapters using bookmark hierarchy, style-based detection, OCR fallback (for scanned PDFs), or manual page selection.
# Install as editable package
pip install -e .
# Split a PDF automatically
pdfpy path/to/document.pdf
# Or using the scripts
./scripts/run_auto.sh path/to/document.pdf- Smart Detection: Automatically identifies chapters using PDF bookmarks (TOC).
- Style Fallback: If bookmarks are missing, it uses font size and regex patterns to find chapter titles.
- OCR Fallback: Optional OCR path for scanned/image-based PDFs (
--ocr). - Dynamic OCR Extraction: Configurable OCR regex list plus first-page fallback mode for scans without clear chapter headings.
- Configurable: Fine-tune detection rules in
src/pdfpy/chapters_config.mdwithout touching the code. - Manual Mode: Explicitly define split points for precise control.
- Merge Option: Consolidate detected sections into a single clean PDF.
- Drag & Drop: Windows-ready batch files in
scripts/for zero-command usage.
-
Clone the repository:
git clone https://github.com/FelixCAxO/Pdfpy.git cd Pdfpy -
Install as editable package:
pip install -e . -
Optional OCR setup (for scanned PDFs):
pip install -e ".[ocr]"Also install the local Tesseract OCR binary and ensure it is available on your system PATH.
If installed as a package, use the pdfpy command:
# Automatic mode (bookmarks -> style fallback)
pdfpy path/to/your/document.pdf
# Automatic mode + OCR fallback for scanned/image PDFs
pdfpy path/to/your/document.pdf --ocr
# Manual mode (comma-separated start pages)
pdfpy path/to/your/document.pdf --manual "5,12,45"scripts/run_auto.bat: Drag a PDF here to split it automatically.scripts/run_manual.bat: Drag a PDF here to be prompted for manual split pages.
scripts/run_auto.sh:./scripts/run_auto.sh path/to/document.pdfscripts/run_manual.sh:./scripts/run_manual.sh path/to/document.pdf
Heuristic detection settings are managed in src/pdfpy/chapters_config.md:
CHAPTER_REGEX: Regex pattern for style-based title detection (e.g.,^Chapter \d+).MIN_FONT_SIZE: Minimum font size to consider as a style title.MUST_BE_BOLD: Require bold font for style-based title detection (true/false).OCR_REGEXES: OCR regex list separated by||(used in scanned PDF mode).OCR_FALLBACK_TO_FIRST_PAGE: If OCR finds no chapter regex match, still split from page 1 (true/false).OCR_RENDER_DPI: OCR rendering DPI (e.g.,300,400,600).
This project is licensed under the MIT License - see the LICENSE file for details.