feat: PDF annotation extraction for reviewer feedback

## Summary

Add PDF annotation extraction capability to scitex-io, enabling structured extraction of highlights, comments, sticky notes, strikethrough, and underline annotations from PDF files. Primary use case: processing reviewer feedback on scientific manuscripts.

Reference: `~/proj/todo/scitex/16_PDF_ANNOTATION_EXTRACTION.md`

## Research Findings

### Library Comparison

Four Python libraries were evaluated for PDF annotation extraction:

| Library | Version | Annotation Support | Underlying Text | Author/Date | Performance (100 iter) |
|---------|---------|-------------------|-----------------|-------------|----------------------|
| **PyMuPDF (fitz)** | 1.26.7 | Full (all types via `page.annots()`) | Yes (`get_text('words', clip=rect)`) | Yes | 7.0ms/call |
| **pypdf** | 6.5.0 | Full (raw `/Annots` dict access) | No (manual rect lookup needed) | Partial | 4.8ms/call |
| **pdfminer.six** | 20251107 | Basic (raw dict, needs `resolve1()`) | No | Partial | Not benchmarked |
| **pdfplumber** | 0.11.8 | Inherits from pdfminer | No | Partial | Not benchmarked |

### Recommendation: PyMuPDF (fitz)

PyMuPDF is the recommended backend for the following reasons:

1. **Best API for annotations**: `page.annots()` returns typed `Annot` objects with `.type`, `.info`, `.rect` -- no manual PDF dict traversal needed.
2. **Underlying text extraction**: `page.get_text('words', clip=annot.rect)` retrieves the text under markup annotations (highlights, strikeouts, underlines). No other library provides this in a single call.
3. **Already a dependency**: scitex-io already uses fitz as its primary PDF backend (`_select_backend` defaults to it).
4. **Rich annotation type coverage**: Supports all PDF annotation types (Highlight=8, Text/StickyNote=0, StrikeOut=11, Underline=9, Squiggly=10, FreeText=2, Ink=15, etc.).
5. **Structured metadata**: Each annotation exposes `content`, `title` (author), `creationDate`, `modDate`, `subject`, color, and rect coordinates.

pypdf is slightly faster per-call but lacks the critical `get_text(clip=rect)` feature, requiring a separate text extraction pass and manual coordinate matching.

### Tested Extraction Output

From a synthetic annotated PDF, PyMuPDF extracted:

```
Annotation 1: Type=(8, 'Highlight'), Author='Reviewer1', Content='This needs revision'
  Marked text: "This is sample text for annotation testing."

Annotation 2: Type=(0, 'Text'), Author='Reviewer2', Content='Please clarify this point'

Annotation 3: Type=(11, 'StrikeOut'), Author='Reviewer1', Content='Remove this sentence'
  Marked text: "Second line with more content to highlight."

Annotation 4: Type=(9, 'Underline'), Author='Reviewer2'
  Marked text: "This is sample text for an"
```

### Key annotation type mapping (PDF spec)

| Type ID | Name | Use in Review |
|---------|------|---------------|
| 0 | Text (Sticky Note) | Reviewer comments |
| 2 | FreeText | Inline comments |
| 8 | Highlight | Marking important text |
| 9 | Underline | Emphasis |
| 10 | Squiggly | Suggested edits |
| 11 | StrikeOut | Text to remove |

## Proposed API Design

### Option A: New mode in `stx.io.load` (recommended)

```python
# Extract annotations only
annotations = stx.io.load("reviewed.pdf", mode="annotations")
# Returns: List[Dict] with keys: type, page, rect, content, author, date, marked_text

# Include annotations in full extraction
data = stx.io.load("reviewed.pdf", mode="full", annotations=True)
# data.annotations -> List[Dict]
```

### Option B: Standalone function

```python
from scitex_io import extract_annotations
annotations = extract_annotations("reviewed.pdf")
```

### Proposed output schema

```python
[
    {
        "type": "Highlight",           # str: annotation type name
        "type_id": 8,                  # int: PDF annotation type constant
        "page": 0,                     # int: zero-indexed page number
        "rect": [72.0, 87.0, 404.0, 106.0],  # list: bounding box [x0, y0, x1, y1]
        "content": "This needs revision",     # str: annotation text/comment
        "marked_text": "Sample text...",      # str: underlying document text (markup types only)
        "author": "Reviewer1",                # str: annotation author
        "created": "D:20260327...",           # str: creation date (if available)
        "modified": "D:20260327...",          # str: modification date (if available)
        "color": [1.0, 1.0, 0.0],            # list: RGB color (if available)
    },
    ...
]
```

## Integration Points

1. **scitex-io** (`_load_modules/_pdf.py`): Add `mode="annotations"` and `annotations=True` kwarg to `_extract_full`/`_extract_scientific`. Implementation goes in a new `_pdf_annotation_extractors.py` module following the existing pattern.

2. **scitex-writer**: A future `writer_import_annotations` MCP tool could map extracted annotations to manuscript sections for revision tracking. This depends on (1) being implemented first.

## Implementation Notes

- The existing `_pdf.py` loader already imports from `_pdf_utils`, `_pdf_text_extractors`, and `_pdf_content_extractors`. A new `_pdf_annotation_extractors.py` follows this pattern cleanly.
- No new dependencies required -- PyMuPDF is already used.
- For markup annotations (Highlight, Underline, StrikeOut, Squiggly), `page.get_text('words', clip=annot.rect)` extracts the underlying text.
- Popup annotations (type 16) should be filtered out -- they are UI artifacts, not user-created annotations.
- The `annot.info` dict provides `title` (= author in PDF spec), `content`, `creationDate`, `modDate`.

## Priority

Medium -- useful for journal revision workflow but not blocking current submissions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PDF annotation extraction for reviewer feedback #205

Summary

Research Findings

Library Comparison

Recommendation: PyMuPDF (fitz)

Tested Extraction Output

Key annotation type mapping (PDF spec)

Proposed API Design

Option A: New mode in `stx.io.load` (recommended)

Option B: Standalone function

Proposed output schema

Integration Points

Implementation Notes

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Library	Version	Annotation Support	Underlying Text	Author/Date	Performance (100 iter)
PyMuPDF (fitz)	1.26.7	Full (all types via `page.annots()`)	Yes (`get_text('words', clip=rect)`)	Yes	7.0ms/call
pypdf	6.5.0	Full (raw `/Annots` dict access)	No (manual rect lookup needed)	Partial	4.8ms/call
pdfminer.six	20251107	Basic (raw dict, needs `resolve1()`)	No	Partial	Not benchmarked
pdfplumber	0.11.8	Inherits from pdfminer	No	Partial	Not benchmarked

Type ID	Name	Use in Review
0	Text (Sticky Note)	Reviewer comments
2	FreeText	Inline comments
8	Highlight	Marking important text
9	Underline	Emphasis
10	Squiggly	Suggested edits
11	StrikeOut	Text to remove

feat: PDF annotation extraction for reviewer feedback #205

Description

Summary

Research Findings

Library Comparison

Recommendation: PyMuPDF (fitz)

Tested Extraction Output

Key annotation type mapping (PDF spec)

Proposed API Design

Option A: New mode in stx.io.load (recommended)

Option B: Standalone function

Proposed output schema

Integration Points

Implementation Notes

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Option A: New mode in `stx.io.load` (recommended)