PDF Hierarchical Outline Extractor (H1/H2/H3)

Overview

This project is designed as part of the Round 1A: Understand Your Document challenge. The goal is to transform a PDF into a clean, structured outline by extracting its title and hierarchical headings (H1, H2, H3).

Unlike naive approaches, this solution is not solely font-size based; it combines boldness, position, indentation, and script type to make more accurate heading classifications. It works offline, supports multilingual documents (including Japanese, German, and more), and handles complex formatting — making it reliable, fast, and generalizable.

The output is a structured JSON format that can be further processed in various applications. It processes PDFs of up to 50 pages in under 10 seconds with no external dependencies (no models, OCR, or cloud APIs).

🔹 Key Features

Title Extraction
Headings: H1, H2, H3 (with levels and page numbers)
Nested Headings: e.g., 1., 1.2.3 style
Multilingual Support: Japanese, Arabic, German, etc.
Accurate Classification: Font size + boldness + indentation + position
Positional Heuristics: x/y layout-aware detection
Noise Removal: Removes bullets, footers, URLs, page numbers
Curriculum Filtering: Filters academic list items
Multi-line Title Normalization
Runs Offline: ≤10s for 50-page PDFs
Docker Ready
CPU Only: No GPU, no models

Approach

This solution uses a hybrid rule-based method combining typography, layout features, and language cues to extract heading hierarchies reliably from PDFs. It is fast, offline, and language-aware.

Text Block Extraction

Uses PyMuPDF to parse each page and extract:

text (full line content)
font_size (max from all spans)
bbox (bounding box: x0, y0, x1, y1)
is_bold (derived from font flags)

Each line is treated as a candidate heading.

Intelligent Heading Classification (H1–H2–H3)

Instead of just relying on font size, the classification combines:

Feature	Role
Font size	Relative to body and title fonts
Boldness	Emphasis for section headers
Line length	Short lines more likely headings
Indentation	Left-aligned = higher-level header
Page position	Top of page = H1/H2 likely
Unicode type	Handles Japanese, Arabic, etc.

Heuristics:

H1: Large font or bold + short + left-aligned
H2: Moderate font and/or bold
H3: Smaller or paragraph-like headings

Lines are filtered out if they contain:

URLs, emails, phone numbers
Only symbols/bullets
Short meaningless tokens

Multilingual Script Awareness

Detects non-Latin characters via Unicode blocks
Uses NFKC normalization to unify scripts
Special support for:
- Japanese: e.g., lines ending with 。, ．
- Arabic, Cyrillic, German

Offline and Fast

No OCR
No cloud/model downloads
Entirely offline
Processes 50-page PDFs in <10s on CPU

Smart Heading Classification

Feature	Purpose
Font Size	Estimates importance
Boldness	Identifies headers
Line Length	Short = more likely heading
Indentation (x0)	Left-aligned = higher-level heading
Position (y0)	Top of page = likely header
Unicode Script	Multilingual support
List Pattern	Skips bullets, curriculum lines
Multilevel Format	Recognizes 1.2.3-style headings

Multilingual Support

Handles PDFs in:

Japanese (漢字, ひらがな)
Arabic, German, Cyrillic
No OCR or external model required

Noise & Metadata Filtering

Removes:

URLs, emails, phone numbers
Page footers (e.g., “Page 3 of 10”)
Symbols (●, •, –, etc.)
Short irrelevant tokens
Academic list lines (e.g., “3 credits of Biology”)
Repeated or boilerplate lines

Dependencies

Python 3.7+
PyMuPDF

Install via:

pip install pymupdf

Usage

python extract_outline.py --input ./samples/ --output ./outlines/ --debug

Arguments

Flag	Description
`-i`, `--input`	Directory with input PDF files
`-o`, `--output`	Directory to write JSON outputs
`--debug`	(Optional) Show debug info

Example

python extract_outline.py --input ./samples/ --output ./outlines/ --debug

Output Format

{ "title": "Document Title",

"outline": [

{ "level": "H1", "text": "Main Section", "page": 1 },

{ "level": "H2", "text": "Subsection", "page": 1 },

{ "level": "H3", "text": "Supporting text or paragraph", "page": 1 }

]

}

Sample Results

Japanese LaTeX Manual

{

"title": "第1章",

"outline": [

{ "level": "H1", "text": "babel", "page": 1 },

{ "level": "H2", "text": "japaneseパッケージ", "page": 1 },

{ "level": "H3", "text": "日本語による見出し語と日付を出力するための...", "page": 1 }

]

}

English Resume

{

"title": "MYSELF",

"outline": [

{ "level": "H2", "text": "Professional Experience", "page": 1 },

{ "level": "H3", "text": "Solved 700+ problems in Codeforces...", "page": 1 }

]

}

Strengths and Pro Tips

Works on multilingual PDFs

Filters noisy/garbage headings

Auto-detects clean document title

Fast, lightweight, and portable

No external dependencies

Generalizable for resumes, papers, books

Docker Support

Constraints Met Works on AMD64 (x86_64)

Offline-only (no internet/cloud)

No ML/OCR dependencies

≤200MB footprint

CPU-only

≤10s for 50-page PDFs

Build Docker Image

docker build --platform linux/amd64 -t pdfoutline:submission .

Run (Expected Execution)

docker run --rm
-v $(pwd)/input:/app/input
-v $(pwd)/output:/app/output
--network none
pdfoutline:submission

Project Structure

├── main.py # Main script (entry point)

├── README.md # This file

├── input/ # Input PDFs

├── output/ # Output JSONs

├── Dockerfile # Docker definition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Hierarchical Outline Extractor (H1/H2/H3)

Overview

🔹 Key Features

Approach

Text Block Extraction

Intelligent Heading Classification (H1–H2–H3)

Multilingual Script Awareness

Offline and Fast

Smart Heading Classification

Multilingual Support

Noise & Metadata Filtering

Dependencies

Install via:

Usage

Arguments

Example

Output Format

Sample Results

Japanese LaTeX Manual

English Resume

Strengths and Pro Tips

Docker Support

Build Docker Image

Run (Expected Execution)

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
input		input
output		output
Dockerfile		Dockerfile
README.md		README.md
extract_outline.py		extract_outline.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PDF Hierarchical Outline Extractor (H1/H2/H3)

Overview

🔹 Key Features

Approach

Text Block Extraction

Intelligent Heading Classification (H1–H2–H3)

Multilingual Script Awareness

Offline and Fast

Smart Heading Classification

Multilingual Support

Noise & Metadata Filtering

Dependencies

Install via:

Usage

Arguments

Example

Output Format

Sample Results

Japanese LaTeX Manual

English Resume

Strengths and Pro Tips

Docker Support

Build Docker Image

Run (Expected Execution)

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages