This project is designed as part of the Round 1A: Understand Your Document challenge. The goal is to transform a PDF into a clean, structured outline by extracting its title and hierarchical headings (H1, H2, H3).
Unlike naive approaches, this solution is not solely font-size based; it combines boldness, position, indentation, and script type to make more accurate heading classifications. It works offline, supports multilingual documents (including Japanese, German, and more), and handles complex formatting — making it reliable, fast, and generalizable.
The output is a structured JSON format that can be further processed in various applications. It processes PDFs of up to 50 pages in under 10 seconds with no external dependencies (no models, OCR, or cloud APIs).
- Title Extraction
- Headings: H1, H2, H3 (with levels and page numbers)
- Nested Headings: e.g.,
1.,1.2.3style - Multilingual Support: Japanese, Arabic, German, etc.
- Accurate Classification: Font size + boldness + indentation + position
- Positional Heuristics: x/y layout-aware detection
- Noise Removal: Removes bullets, footers, URLs, page numbers
- Curriculum Filtering: Filters academic list items
- Multi-line Title Normalization
- Runs Offline: ≤10s for 50-page PDFs
- Docker Ready
- CPU Only: No GPU, no models
This solution uses a hybrid rule-based method combining typography, layout features, and language cues to extract heading hierarchies reliably from PDFs. It is fast, offline, and language-aware.
Uses PyMuPDF to parse each page and extract:
text(full line content)font_size(max from all spans)bbox(bounding box: x0, y0, x1, y1)is_bold(derived from font flags)
Each line is treated as a candidate heading.
Instead of just relying on font size, the classification combines:
| Feature | Role |
|---|---|
| Font size | Relative to body and title fonts |
| Boldness | Emphasis for section headers |
| Line length | Short lines more likely headings |
| Indentation | Left-aligned = higher-level header |
| Page position | Top of page = H1/H2 likely |
| Unicode type | Handles Japanese, Arabic, etc. |
Heuristics:
- H1: Large font or bold + short + left-aligned
- H2: Moderate font and/or bold
- H3: Smaller or paragraph-like headings
Lines are filtered out if they contain:
- URLs, emails, phone numbers
- Only symbols/bullets
- Short meaningless tokens
- Detects non-Latin characters via Unicode blocks
- Uses NFKC normalization to unify scripts
- Special support for:
- Japanese: e.g., lines ending with
。,. - Arabic, Cyrillic, German
- Japanese: e.g., lines ending with
- No OCR
- No cloud/model downloads
- Entirely offline
- Processes 50-page PDFs in <10s on CPU
| Feature | Purpose |
|---|---|
| Font Size | Estimates importance |
| Boldness | Identifies headers |
| Line Length | Short = more likely heading |
| Indentation (x0) | Left-aligned = higher-level heading |
| Position (y0) | Top of page = likely header |
| Unicode Script | Multilingual support |
| List Pattern | Skips bullets, curriculum lines |
| Multilevel Format | Recognizes 1.2.3-style headings |
Handles PDFs in:
- Japanese (漢字, ひらがな)
- Arabic, German, Cyrillic
- No OCR or external model required
Removes:
- URLs, emails, phone numbers
- Page footers (e.g., “Page 3 of 10”)
- Symbols (●, •, –, etc.)
- Short irrelevant tokens
- Academic list lines (e.g., “3 credits of Biology”)
- Repeated or boilerplate lines
- Python 3.7+
- PyMuPDF
pip install pymupdf
python extract_outline.py --input ./samples/ --output ./outlines/ --debug
| Flag | Description |
|---|---|
-i, --input |
Directory with input PDF files |
-o, --output |
Directory to write JSON outputs |
--debug |
(Optional) Show debug info |
python extract_outline.py --input ./samples/ --output ./outlines/ --debug
{ "title": "Document Title",
"outline": [
{ "level": "H1", "text": "Main Section", "page": 1 },
{ "level": "H2", "text": "Subsection", "page": 1 },
{ "level": "H3", "text": "Supporting text or paragraph", "page": 1 }
]
}
{
"title": "第1章",
"outline": [
{ "level": "H1", "text": "babel", "page": 1 },
{ "level": "H2", "text": "japaneseパッケージ", "page": 1 },
{ "level": "H3", "text": "日本語による見出し語と日付を出力するための...", "page": 1 }
]
}
{
"title": "MYSELF",
"outline": [
{ "level": "H2", "text": "Professional Experience", "page": 1 },
{ "level": "H3", "text": "Solved 700+ problems in Codeforces...", "page": 1 }
]
}
Works on multilingual PDFs
Filters noisy/garbage headings
Auto-detects clean document title
Fast, lightweight, and portable
No external dependencies
Generalizable for resumes, papers, books
Constraints Met Works on AMD64 (x86_64)
Offline-only (no internet/cloud)
No ML/OCR dependencies
≤200MB footprint
CPU-only
≤10s for 50-page PDFs
docker build --platform linux/amd64 -t pdfoutline:submission .
docker run --rm
-v $(pwd)/input:/app/input
-v $(pwd)/output:/app/output
--network none
pdfoutline:submission
├── main.py # Main script (entry point)
├── README.md # This file
├── input/ # Input PDFs
├── output/ # Output JSONs
├── Dockerfile # Docker definition