Skip to content

YugVarshney/Adobe_Hackathon_Round_1A

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Hierarchical Outline Extractor (H1/H2/H3)

Overview

This project is designed as part of the Round 1A: Understand Your Document challenge. The goal is to transform a PDF into a clean, structured outline by extracting its title and hierarchical headings (H1, H2, H3).

Unlike naive approaches, this solution is not solely font-size based; it combines boldness, position, indentation, and script type to make more accurate heading classifications. It works offline, supports multilingual documents (including Japanese, German, and more), and handles complex formatting — making it reliable, fast, and generalizable.

The output is a structured JSON format that can be further processed in various applications. It processes PDFs of up to 50 pages in under 10 seconds with no external dependencies (no models, OCR, or cloud APIs).


🔹 Key Features

  • Title Extraction
  • Headings: H1, H2, H3 (with levels and page numbers)
  • Nested Headings: e.g., 1., 1.2.3 style
  • Multilingual Support: Japanese, Arabic, German, etc.
  • Accurate Classification: Font size + boldness + indentation + position
  • Positional Heuristics: x/y layout-aware detection
  • Noise Removal: Removes bullets, footers, URLs, page numbers
  • Curriculum Filtering: Filters academic list items
  • Multi-line Title Normalization
  • Runs Offline: ≤10s for 50-page PDFs
  • Docker Ready
  • CPU Only: No GPU, no models

Approach

This solution uses a hybrid rule-based method combining typography, layout features, and language cues to extract heading hierarchies reliably from PDFs. It is fast, offline, and language-aware.


Text Block Extraction

Uses PyMuPDF to parse each page and extract:

  • text (full line content)
  • font_size (max from all spans)
  • bbox (bounding box: x0, y0, x1, y1)
  • is_bold (derived from font flags)

Each line is treated as a candidate heading.


Intelligent Heading Classification (H1–H2–H3)

Instead of just relying on font size, the classification combines:

Feature Role
Font size Relative to body and title fonts
Boldness Emphasis for section headers
Line length Short lines more likely headings
Indentation Left-aligned = higher-level header
Page position Top of page = H1/H2 likely
Unicode type Handles Japanese, Arabic, etc.

Heuristics:

  • H1: Large font or bold + short + left-aligned
  • H2: Moderate font and/or bold
  • H3: Smaller or paragraph-like headings

Lines are filtered out if they contain:

  • URLs, emails, phone numbers
  • Only symbols/bullets
  • Short meaningless tokens

Multilingual Script Awareness

  • Detects non-Latin characters via Unicode blocks
  • Uses NFKC normalization to unify scripts
  • Special support for:
    • Japanese: e.g., lines ending with ,
    • Arabic, Cyrillic, German

Offline and Fast

  • No OCR
  • No cloud/model downloads
  • Entirely offline
  • Processes 50-page PDFs in <10s on CPU

Smart Heading Classification

Feature Purpose
Font Size Estimates importance
Boldness Identifies headers
Line Length Short = more likely heading
Indentation (x0) Left-aligned = higher-level heading
Position (y0) Top of page = likely header
Unicode Script Multilingual support
List Pattern Skips bullets, curriculum lines
Multilevel Format Recognizes 1.2.3-style headings

Multilingual Support

Handles PDFs in:

  • Japanese (漢字, ひらがな)
  • Arabic, German, Cyrillic
  • No OCR or external model required

Noise & Metadata Filtering

Removes:

  • URLs, emails, phone numbers
  • Page footers (e.g., “Page 3 of 10”)
  • Symbols (●, •, –, etc.)
  • Short irrelevant tokens
  • Academic list lines (e.g., “3 credits of Biology”)
  • Repeated or boilerplate lines

Dependencies

  • Python 3.7+
  • PyMuPDF

Install via:

pip install pymupdf

Usage

python extract_outline.py --input ./samples/ --output ./outlines/ --debug

Arguments

Flag Description
-i, --input Directory with input PDF files
-o, --output Directory to write JSON outputs
--debug (Optional) Show debug info

Example

python extract_outline.py --input ./samples/ --output ./outlines/ --debug

Output Format

{ "title": "Document Title",

"outline": [

{ "level": "H1", "text": "Main Section", "page": 1 },

{ "level": "H2", "text": "Subsection", "page": 1 },

{ "level": "H3", "text": "Supporting text or paragraph", "page": 1 }

]

}

Sample Results

Japanese LaTeX Manual

{

"title": "第1章",

"outline": [

{ "level": "H1", "text": "babel", "page": 1 },

{ "level": "H2", "text": "japaneseパッケージ", "page": 1 },

{ "level": "H3", "text": "日本語による見出し語と日付を出力するための...", "page": 1 }

]

}

English Resume

{

"title": "MYSELF",

"outline": [

{ "level": "H2", "text": "Professional Experience", "page": 1 },

{ "level": "H3", "text": "Solved 700+ problems in Codeforces...", "page": 1 }

]

}

Strengths and Pro Tips

Works on multilingual PDFs

Filters noisy/garbage headings

Auto-detects clean document title

Fast, lightweight, and portable

No external dependencies

Generalizable for resumes, papers, books

Docker Support

Constraints Met Works on AMD64 (x86_64)

Offline-only (no internet/cloud)

No ML/OCR dependencies

≤200MB footprint

CPU-only

≤10s for 50-page PDFs

Build Docker Image

docker build --platform linux/amd64 -t pdfoutline:submission .

Run (Expected Execution)

docker run --rm
-v $(pwd)/input:/app/input
-v $(pwd)/output:/app/output
--network none
pdfoutline:submission

Project Structure

├── main.py # Main script (entry point)

├── README.md # This file

├── input/ # Input PDFs

├── output/ # Output JSONs

├── Dockerfile # Docker definition

About

PDF Hierarchical Outline Extractor (H1/H2/H3) – Offline tool to extract clean, structured outlines (Title, H1–H3) from PDFs using typography, layout, and multilingual heuristics. Fast, accurate, and Docker-ready with no ML/OCR dependencies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors