-
Notifications
You must be signed in to change notification settings - Fork 20
feat: PDF annotation extraction for reviewer feedback #205
Description
Summary
Add PDF annotation extraction capability to scitex-io, enabling structured extraction of highlights, comments, sticky notes, strikethrough, and underline annotations from PDF files. Primary use case: processing reviewer feedback on scientific manuscripts.
Reference: ~/proj/todo/scitex/16_PDF_ANNOTATION_EXTRACTION.md
Research Findings
Library Comparison
Four Python libraries were evaluated for PDF annotation extraction:
| Library | Version | Annotation Support | Underlying Text | Author/Date | Performance (100 iter) |
|---|---|---|---|---|---|
| PyMuPDF (fitz) | 1.26.7 | Full (all types via page.annots()) |
Yes (get_text('words', clip=rect)) |
Yes | 7.0ms/call |
| pypdf | 6.5.0 | Full (raw /Annots dict access) |
No (manual rect lookup needed) | Partial | 4.8ms/call |
| pdfminer.six | 20251107 | Basic (raw dict, needs resolve1()) |
No | Partial | Not benchmarked |
| pdfplumber | 0.11.8 | Inherits from pdfminer | No | Partial | Not benchmarked |
Recommendation: PyMuPDF (fitz)
PyMuPDF is the recommended backend for the following reasons:
- Best API for annotations:
page.annots()returns typedAnnotobjects with.type,.info,.rect-- no manual PDF dict traversal needed. - Underlying text extraction:
page.get_text('words', clip=annot.rect)retrieves the text under markup annotations (highlights, strikeouts, underlines). No other library provides this in a single call. - Already a dependency: scitex-io already uses fitz as its primary PDF backend (
_select_backenddefaults to it). - Rich annotation type coverage: Supports all PDF annotation types (Highlight=8, Text/StickyNote=0, StrikeOut=11, Underline=9, Squiggly=10, FreeText=2, Ink=15, etc.).
- Structured metadata: Each annotation exposes
content,title(author),creationDate,modDate,subject, color, and rect coordinates.
pypdf is slightly faster per-call but lacks the critical get_text(clip=rect) feature, requiring a separate text extraction pass and manual coordinate matching.
Tested Extraction Output
From a synthetic annotated PDF, PyMuPDF extracted:
Annotation 1: Type=(8, 'Highlight'), Author='Reviewer1', Content='This needs revision'
Marked text: "This is sample text for annotation testing."
Annotation 2: Type=(0, 'Text'), Author='Reviewer2', Content='Please clarify this point'
Annotation 3: Type=(11, 'StrikeOut'), Author='Reviewer1', Content='Remove this sentence'
Marked text: "Second line with more content to highlight."
Annotation 4: Type=(9, 'Underline'), Author='Reviewer2'
Marked text: "This is sample text for an"
Key annotation type mapping (PDF spec)
| Type ID | Name | Use in Review |
|---|---|---|
| 0 | Text (Sticky Note) | Reviewer comments |
| 2 | FreeText | Inline comments |
| 8 | Highlight | Marking important text |
| 9 | Underline | Emphasis |
| 10 | Squiggly | Suggested edits |
| 11 | StrikeOut | Text to remove |
Proposed API Design
Option A: New mode in stx.io.load (recommended)
# Extract annotations only
annotations = stx.io.load("reviewed.pdf", mode="annotations")
# Returns: List[Dict] with keys: type, page, rect, content, author, date, marked_text
# Include annotations in full extraction
data = stx.io.load("reviewed.pdf", mode="full", annotations=True)
# data.annotations -> List[Dict]Option B: Standalone function
from scitex_io import extract_annotations
annotations = extract_annotations("reviewed.pdf")Proposed output schema
[
{
"type": "Highlight", # str: annotation type name
"type_id": 8, # int: PDF annotation type constant
"page": 0, # int: zero-indexed page number
"rect": [72.0, 87.0, 404.0, 106.0], # list: bounding box [x0, y0, x1, y1]
"content": "This needs revision", # str: annotation text/comment
"marked_text": "Sample text...", # str: underlying document text (markup types only)
"author": "Reviewer1", # str: annotation author
"created": "D:20260327...", # str: creation date (if available)
"modified": "D:20260327...", # str: modification date (if available)
"color": [1.0, 1.0, 0.0], # list: RGB color (if available)
},
...
]Integration Points
-
scitex-io (
_load_modules/_pdf.py): Addmode="annotations"andannotations=Truekwarg to_extract_full/_extract_scientific. Implementation goes in a new_pdf_annotation_extractors.pymodule following the existing pattern. -
scitex-writer: A future
writer_import_annotationsMCP tool could map extracted annotations to manuscript sections for revision tracking. This depends on (1) being implemented first.
Implementation Notes
- The existing
_pdf.pyloader already imports from_pdf_utils,_pdf_text_extractors, and_pdf_content_extractors. A new_pdf_annotation_extractors.pyfollows this pattern cleanly. - No new dependencies required -- PyMuPDF is already used.
- For markup annotations (Highlight, Underline, StrikeOut, Squiggly),
page.get_text('words', clip=annot.rect)extracts the underlying text. - Popup annotations (type 16) should be filtered out -- they are UI artifacts, not user-created annotations.
- The
annot.infodict providestitle(= author in PDF spec),content,creationDate,modDate.
Priority
Medium -- useful for journal revision workflow but not blocking current submissions.