Split image into page segments #6

kimlee87 · 2025-10-29T16:00:44Z

Split strategy

For each input image:

use PaddleOCR to find text dection on the image
compute signal (column-wise projection) to find vertical split points
2.1) based on the text detection result to mask text areas, masked areas are white, background is black
2.2) compute column-wise projection on the masked image, i.e. mean of pixel values per column (signal)
find vertical split points based on the signal
3.1) use Dynamic Programming to find vertical breakpoints of significant gaps
3.2) find refined points between those breakpoints that their signal near zero (black), i.e. no text there
3.3) only consider points near the center to ensure we have num_segments - 1 splits
split into vertical segments at those refined points; always covers full width
save segments as <img_name>_pX.jpg

In case the imge is completely skewed, I would suggest we have an extra step for preprocessing before any other steps.

Tasks

create a splitting script to run with a JSON config file via CLI (split_pages.py)
add tests for the splitting script (still very simple at the moment)
add helper functions (+test) to handle config file (config_handler.py)
add helper function (+test) for opening/saving images (utils.py)
a demo in jupyter notebook (notebooks/page_splitting.ipynb)

Usage

The script can run with command line:

python split_pages.py [-c config_file.json -t "unique_tag"]

If config file and tag are not specified, use default values:

default config: ecpo_eynollah/config/default_config.json
default tag: ts{YYYYMMDD-HHMMSS}_h{hostname}

Note on results

Segment's name: {img_name}_p{i}_{unique_tag}.jpg
Each segment has same size with the original image, only the segment area is visible
After each run, you will find in the output dir:
- images of all segments
- a copy of the used config file
- a copy of the used splitting script
- a log csv file, containing file name - number of segments - position of the split(s) - whether the image is split by middle line (fallback cases)

Note on choosing config values

If an image is split at incorrect positions, please adjust the values of "num_segments" and "close_threshold"
To get 3 segments from an image with gutter text and small gaps flanking the gutter, "num_segments" in the config should be 5, i.e. left page - small gap before gutter - gutter - small gap after gutter - right page. These small gaps will be omitted while splitting.

Jingbao images: (corresponding config files are uploaded)

Data overview

Images from 1920 are 2-page images, no text in gutters
Images from 1930 and 1930 contain text gutters, so they are expected to be split into 3 segments. However:
- Some images contain curved gutters.
- Results from PaddleOCR Text Dection on some images are incorrect

Current results

For 1920, all correct, no fallback cases (no split by middle vertical line)
For 1930 and 1939:
- With num_segments=5, we can successfully separate the gutters in some images. However, the remaining cases are incorrect: the gutter is grouped with one side, while the opposite side is split into 2 parts.
- Therefore, I settled on num_segments=3, which produces 2 main segments, with the gutter included on one side.
- In some images, the cut slightly goes through the texts as these texts were not recognized by PaddleOCR
- In summary
  - 1930: 4/20 incorrect (minor text cuts due to the Text Detection step), no fallback cases
  - 1939: 1/40 incorrect (gutter is cut in the middle), no fallback cases
    - The incorrect case is jb_3796_1939-04-22_0006to0007.png. To correct this, change the close_threshold to 3 (see last section in the demo notebook)

…results

Copilot

Pull Request Overview

This PR implements a page splitting algorithm that uses PaddleOCR and Dynamic Programming to detect gutters and split multi-page scanned images into individual page segments.

Key changes:

Implements a CLI-based page splitting script using PaddleOCR for text detection and ruptures library for breakpoint detection via Dynamic Programming
Adds configuration management system with JSON schema validation
Provides comprehensive test coverage for utility functions and configuration handling

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
ecpo_eynollah/split_pages.py	Main splitting algorithm with OCR-based gutter detection and Dynamic Programming for finding split points
ecpo_eynollah/config_handler.py	Configuration loading, validation, and saving utilities with schema-based validation
ecpo_eynollah/utils.py	Image I/O utilities and unique tag generation for output files
ecpo_eynollah/config/default_config.json	Default configuration with gutter detection parameters
ecpo_eynollah/config/config_schema.json	JSON schema for validating configuration structure and values
tests/test_utils.py	Test coverage for utility functions including image operations and tag generation
tests/test_config_handler.py	Comprehensive tests for configuration validation, loading, and updating
tests/conftest.py	Test helper function for file retrieval
tests/test_ecpo_eynollah.py	Removed placeholder test file
pyproject.toml	Added dependencies for PaddleOCR, OpenCV, ruptures, and JSON schema validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ecpo_eynollah/config_handler.py

ecpo_eynollah/utils.py

ecpo_eynollah/config/config_schema.json

ecpo_eynollah/split_pages.py

ecpo_eynollah/utils.py

ecpo_eynollah/split_pages.py

ecpo_eynollah/config_handler.py

tests/test_config_handler.py

ecpo_eynollah/config_handler.py

kimlee87 added 18 commits October 28, 2025 17:53

split page using opencv, bad results

7fcf33d

diable first step, convert img to morp before finding gutter, better …

939f0f4

…results

add config handler

18c14e4

add utils

409212f

detect gutter with config file

1fb27ac

set b&w converting as an option

f5208b6

edit docs

f755147

do not save intermediate images by default

95222f7

trace fallback cases

152ea4a

add unique tag to segment file name

4306605

feat: split page with TextDetection from PaddleOCR

93e93e6

feat: add gutter size to config

800b46a

move sample from 3p to 2p

b8e4927

update param name and default value value, update demo notebook

a09bb33

add text overlay on images in demo notebook

5602baa

update comment

d3a6ef5

update demo notebook

ef0bc31

delete comment in notebook

0032e08

kimlee87 requested a review from Copilot November 12, 2025 14:09

Copilot started reviewing on behalf of kimlee87 November 12, 2025 14:10 View session

Copilot finished reviewing on behalf of kimlee87 November 12, 2025 14:12

Copilot AI reviewed Nov 12, 2025

View reviewed changes

kimlee87 added 6 commits November 12, 2025 15:17

add comment to notebook

664e79d

add tests and config for jingbao

124a154

update tests and update cells for installing matplotlib in demo notebook

196efc1

add Jingbao special case to demo notebook

b3a6f69

fix typos and potential bugs pointed out by Copilot, add test

0957a53

switch to higher versions of paddle* to avoid langchain problem

21bd8fe

kimlee87 marked this pull request as ready for review November 13, 2025 10:52

kimlee87 requested a review from dokempf November 13, 2025 10:57

change gutter_size to segment_size in demo notebook

b0a67c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split image into page segments #6

Split image into page segments #6

Uh oh!

kimlee87 commented Oct 29, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Split image into page segments #6

Are you sure you want to change the base?

Split image into page segments #6

Uh oh!

Conversation

kimlee87 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Split strategy

Tasks

Usage

Note on results

Note on choosing config values

Jingbao images: (corresponding config files are uploaded)

Data overview

Current results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kimlee87 commented Oct 29, 2025 •

edited

Loading