Skip to content

feat: PDF annotation extraction for reviewer feedback #205

@ywatanabe1989

Description

@ywatanabe1989

Summary

Add PDF annotation extraction capability to scitex-io, enabling structured extraction of highlights, comments, sticky notes, strikethrough, and underline annotations from PDF files. Primary use case: processing reviewer feedback on scientific manuscripts.

Reference: ~/proj/todo/scitex/16_PDF_ANNOTATION_EXTRACTION.md

Research Findings

Library Comparison

Four Python libraries were evaluated for PDF annotation extraction:

Library Version Annotation Support Underlying Text Author/Date Performance (100 iter)
PyMuPDF (fitz) 1.26.7 Full (all types via page.annots()) Yes (get_text('words', clip=rect)) Yes 7.0ms/call
pypdf 6.5.0 Full (raw /Annots dict access) No (manual rect lookup needed) Partial 4.8ms/call
pdfminer.six 20251107 Basic (raw dict, needs resolve1()) No Partial Not benchmarked
pdfplumber 0.11.8 Inherits from pdfminer No Partial Not benchmarked

Recommendation: PyMuPDF (fitz)

PyMuPDF is the recommended backend for the following reasons:

  1. Best API for annotations: page.annots() returns typed Annot objects with .type, .info, .rect -- no manual PDF dict traversal needed.
  2. Underlying text extraction: page.get_text('words', clip=annot.rect) retrieves the text under markup annotations (highlights, strikeouts, underlines). No other library provides this in a single call.
  3. Already a dependency: scitex-io already uses fitz as its primary PDF backend (_select_backend defaults to it).
  4. Rich annotation type coverage: Supports all PDF annotation types (Highlight=8, Text/StickyNote=0, StrikeOut=11, Underline=9, Squiggly=10, FreeText=2, Ink=15, etc.).
  5. Structured metadata: Each annotation exposes content, title (author), creationDate, modDate, subject, color, and rect coordinates.

pypdf is slightly faster per-call but lacks the critical get_text(clip=rect) feature, requiring a separate text extraction pass and manual coordinate matching.

Tested Extraction Output

From a synthetic annotated PDF, PyMuPDF extracted:

Annotation 1: Type=(8, 'Highlight'), Author='Reviewer1', Content='This needs revision'
  Marked text: "This is sample text for annotation testing."

Annotation 2: Type=(0, 'Text'), Author='Reviewer2', Content='Please clarify this point'

Annotation 3: Type=(11, 'StrikeOut'), Author='Reviewer1', Content='Remove this sentence'
  Marked text: "Second line with more content to highlight."

Annotation 4: Type=(9, 'Underline'), Author='Reviewer2'
  Marked text: "This is sample text for an"

Key annotation type mapping (PDF spec)

Type ID Name Use in Review
0 Text (Sticky Note) Reviewer comments
2 FreeText Inline comments
8 Highlight Marking important text
9 Underline Emphasis
10 Squiggly Suggested edits
11 StrikeOut Text to remove

Proposed API Design

Option A: New mode in stx.io.load (recommended)

# Extract annotations only
annotations = stx.io.load("reviewed.pdf", mode="annotations")
# Returns: List[Dict] with keys: type, page, rect, content, author, date, marked_text

# Include annotations in full extraction
data = stx.io.load("reviewed.pdf", mode="full", annotations=True)
# data.annotations -> List[Dict]

Option B: Standalone function

from scitex_io import extract_annotations
annotations = extract_annotations("reviewed.pdf")

Proposed output schema

[
    {
        "type": "Highlight",           # str: annotation type name
        "type_id": 8,                  # int: PDF annotation type constant
        "page": 0,                     # int: zero-indexed page number
        "rect": [72.0, 87.0, 404.0, 106.0],  # list: bounding box [x0, y0, x1, y1]
        "content": "This needs revision",     # str: annotation text/comment
        "marked_text": "Sample text...",      # str: underlying document text (markup types only)
        "author": "Reviewer1",                # str: annotation author
        "created": "D:20260327...",           # str: creation date (if available)
        "modified": "D:20260327...",          # str: modification date (if available)
        "color": [1.0, 1.0, 0.0],            # list: RGB color (if available)
    },
    ...
]

Integration Points

  1. scitex-io (_load_modules/_pdf.py): Add mode="annotations" and annotations=True kwarg to _extract_full/_extract_scientific. Implementation goes in a new _pdf_annotation_extractors.py module following the existing pattern.

  2. scitex-writer: A future writer_import_annotations MCP tool could map extracted annotations to manuscript sections for revision tracking. This depends on (1) being implemented first.

Implementation Notes

  • The existing _pdf.py loader already imports from _pdf_utils, _pdf_text_extractors, and _pdf_content_extractors. A new _pdf_annotation_extractors.py follows this pattern cleanly.
  • No new dependencies required -- PyMuPDF is already used.
  • For markup annotations (Highlight, Underline, StrikeOut, Squiggly), page.get_text('words', clip=annot.rect) extracts the underlying text.
  • Popup annotations (type 16) should be filtered out -- they are UI artifacts, not user-created annotations.
  • The annot.info dict provides title (= author in PDF spec), content, creationDate, modDate.

Priority

Medium -- useful for journal revision workflow but not blocking current submissions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions