Skip to content

Refactor groundmark as thin wrapper on anchorite#1

Open
lgruen-vcgs wants to merge 2 commits intomainfrom
initial-implementation
Open

Refactor groundmark as thin wrapper on anchorite#1
lgruen-vcgs wants to merge 2 commits intomainfrom
initial-implementation

Conversation

@lgruen-vcgs
Copy link
Collaborator

@lgruen-vcgs lgruen-vcgs commented Mar 13, 2026

Summary

Rewrites groundmark (forked from gemini-ocr) as a batteries-included wrapper around anchorite. Give it a PDF + model string, get annotated Markdown with embedded bounding box coordinates.

What groundmark provides (concrete anchorite provider implementations):

  • PydanticAIMarkdownProvider — PDF-to-Markdown via any Pydantic AI model, with a line-number workaround for Claude's content filter
  • PdfplumberAnchorProvider — line-level bbox extraction via pdfplumber (MIT), with table detection and ligature decomposition

What anchorite provides (delegated entirely):

  • Smith-Waterman alignment, annotation (<span> injection), stripping, fuzzy quote resolution

Key changes from gemini-ocr

  • Replace Gemini SDK → Pydantic AI (any LLM via model string)
  • Replace Document AI → pdfplumber (local, no API dependency, MIT licensed)
  • Replace inline alignment/annotation/resolution code → anchorite
  • Delete Airflow scripts, GCS support, settings.py, run_ocr.py
  • Delete Sphinx docs (4 source files don't need a docs site)
  • Single async process() entry point with chunking support
  • Debug visualizer using pypdf Highlight annotations
  • Reset version to 0.1.0

What's left (4 source files)

File Lines Role
process.py 79 Entry point — wires providers into anchorite.process_document()
markdown.py 94 MarkdownProvider — Pydantic AI + prompt + line-number workaround
parse.py 85 AnchorProvider — pdfplumber extraction with table detection
visualize.py 165 Debug CLI — overlays raw vs aligned boxes on PDF

Test plan

  • uv run pytest tests/ -x -q — 6 tests pass
  • uv run ruff check src/ tests/ — clean
  • Verified visualize.py runs against a real PDF with the anchorite fix branch
  • End-to-end test with flowa (next step after this lands)

@lgruen-vcgs lgruen-vcgs requested a review from folded March 13, 2026 06:47
@lgruen-vcgs lgruen-vcgs marked this pull request as draft March 13, 2026 14:31
@lgruen-vcgs lgruen-vcgs marked this pull request as ready for review March 14, 2026 05:40
Update LICENSE copyright to Centre for Population Genomics, add new
logo, and rewrite README to reflect the provider-agnostic scope
(Pydantic AI + docling-parse replacing Gemini + Document AI).
@lgruen-vcgs lgruen-vcgs force-pushed the initial-implementation branch from 922ea3c to 5dd2260 Compare March 14, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant