Skip to content

Conversation

@kimlee87
Copy link
Collaborator

@kimlee87 kimlee87 commented Oct 29, 2025

Split strategy

For each input image:

  1. use PaddleOCR to find text dection on the image
  2. compute signal (column-wise projection) to find vertical split points
    2.1) based on the text detection result to mask text areas, masked areas are white, background is black
    2.2) compute column-wise projection on the masked image, i.e. mean of pixel values per column (signal)
  3. find vertical split points based on the signal
    3.1) use Dynamic Programming to find vertical breakpoints of significant gaps
    3.2) find refined points between those breakpoints that their signal near zero (black), i.e. no text there
    3.3) only consider points near the center to ensure we have num_segments - 1 splits
  4. split into vertical segments at those refined points; always covers full width
  5. save segments as <img_name>_pX.jpg

In case the imge is completely skewed, I would suggest we have an extra step for preprocessing before any other steps.

Tasks

  • create a splitting script to run with a JSON config file via CLI (split_pages.py)
  • add tests for the splitting script (still very simple at the moment)
  • add helper functions (+test) to handle config file (config_handler.py)
  • add helper function (+test) for opening/saving images (utils.py)
  • a demo in jupyter notebook (notebooks/page_splitting.ipynb)

Usage

The script can run with command line:

python split_pages.py [-c config_file.json -t "unique_tag"]

If config file and tag are not specified, use default values:

  • default config: ecpo_eynollah/config/default_config.json
  • default tag: ts{YYYYMMDD-HHMMSS}_h{hostname}

Note on results

  • Segment's name: {img_name}_p{i}_{unique_tag}.jpg
  • Each segment has same size with the original image, only the segment area is visible
  • After each run, you will find in the output dir:
    • images of all segments
    • a copy of the used config file
    • a copy of the used splitting script
    • a log csv file, containing file name - number of segments - position of the split(s) - whether the image is split by middle line (fallback cases)

Note on choosing config values

  • If an image is split at incorrect positions, please adjust the values of "num_segments" and "close_threshold"
  • To get 3 segments from an image with gutter text and small gaps flanking the gutter, "num_segments" in the config should be 5, i.e. left page - small gap before gutter - gutter - small gap after gutter - right page. These small gaps will be omitted while splitting.

Jingbao images: (corresponding config files are uploaded)

Data overview

  • Images from 1920 are 2-page images, no text in gutters
  • Images from 1930 and 1930 contain text gutters, so they are expected to be split into 3 segments. However:
    • Some images contain curved gutters.
    • Results from PaddleOCR Text Dection on some images are incorrect

Current results

  • For 1920, all correct, no fallback cases (no split by middle vertical line)
  • For 1930 and 1939:
    • With num_segments=5, we can successfully separate the gutters in some images. However, the remaining cases are incorrect: the gutter is grouped with one side, while the opposite side is split into 2 parts.
    • Therefore, I settled on num_segments=3, which produces 2 main segments, with the gutter included on one side.
    • In some images, the cut slightly goes through the texts as these texts were not recognized by PaddleOCR
    • In summary
      • 1930: 4/20 incorrect (minor text cuts due to the Text Detection step), no fallback cases
      • 1939: 1/40 incorrect (gutter is cut in the middle), no fallback cases
        • The incorrect case is jb_3796_1939-04-22_0006to0007.png. To correct this, change the close_threshold to 3 (see last section in the demo notebook)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a page splitting algorithm that uses PaddleOCR and Dynamic Programming to detect gutters and split multi-page scanned images into individual page segments.

Key changes:

  • Implements a CLI-based page splitting script using PaddleOCR for text detection and ruptures library for breakpoint detection via Dynamic Programming
  • Adds configuration management system with JSON schema validation
  • Provides comprehensive test coverage for utility functions and configuration handling

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
ecpo_eynollah/split_pages.py Main splitting algorithm with OCR-based gutter detection and Dynamic Programming for finding split points
ecpo_eynollah/config_handler.py Configuration loading, validation, and saving utilities with schema-based validation
ecpo_eynollah/utils.py Image I/O utilities and unique tag generation for output files
ecpo_eynollah/config/default_config.json Default configuration with gutter detection parameters
ecpo_eynollah/config/config_schema.json JSON schema for validating configuration structure and values
tests/test_utils.py Test coverage for utility functions including image operations and tag generation
tests/test_config_handler.py Comprehensive tests for configuration validation, loading, and updating
tests/conftest.py Test helper function for file retrieval
tests/test_ecpo_eynollah.py Removed placeholder test file
pyproject.toml Added dependencies for PaddleOCR, OpenCV, ruptures, and JSON schema validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kimlee87 kimlee87 marked this pull request as ready for review November 13, 2025 10:52
@kimlee87 kimlee87 requested a review from dokempf November 13, 2025 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants