populationgenomics · lgruen-vcgs · Mar 12, 2026 · Mar 14, 2026
diff --git a/.github/workflows/build.yaml b/.github/workflows/build.yaml
@@ -19,7 +19,7 @@ jobs:
       - name: Set up Python
         uses: actions/setup-python@v5
         with:
-          python-version: '3.10'
+          python-version: '3.11'
 
       - name: Update uv lock
         run: uv lock

diff --git a/.github/workflows/lint.yaml b/.github/workflows/lint.yaml
@@ -4,22 +4,18 @@ on: [push]
 jobs:
   lint:
     runs-on: ubuntu-latest
-    defaults:
-      run:
-        shell: bash -l {0}
-
     steps:
     - uses: actions/checkout@v4
 
-    - uses: actions/setup-python@v4
+    - uses: actions/setup-python@v5
       with:
         python-version: '3.11'
 
-    - name: Install packages
-      run: pip install -r requirements-dev.txt
+    - name: Install uv
+      uses: astral-sh/setup-uv@v5
 
-    - name: Install pre-commit hooks
-      run: pre-commit install --install-hooks
+    - name: Install dependencies
+      run: uv sync --group dev
 
     - name: Run pre-commit
-      run: pre-commit run --all-files
+      run: uv run pre-commit run --all-files
diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml
@@ -17,7 +17,7 @@ jobs:
       - name: Set up Python
         uses: actions/setup-python@v5
         with:
-          python-version: '3.10'
+          python-version: '3.11'
 
       - name: Update uv lock
         run: uv lock
@@ -55,7 +55,7 @@ jobs:
     needs: [build]
     environment:
       name: pypi
-      url: https://pypi.org/p/gemini-ocr
+      url: https://pypi.org/p/groundmark
     permissions:
       id-token: write # Required for trusted publishing
     steps:

diff --git a/.gitignore b/.gitignore
@@ -9,6 +9,15 @@ wheels/
 # Virtual environments
 .venv
 
+# Tool caches
+.mypy_cache/
+.pytest_cache/
+.ruff_cache/
+
+# Specs (local working docs)
+specs/
+
 # Output and Data
 output.md
 *.pdf
+!tests/data/*.pdf
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -22,11 +22,6 @@ repos:
       - id: markdownlint
         exclude: 'tests/fixtures'
 
-  - repo: https://github.com/populationgenomics/pre-commits
-    rev: "e37928f761f17d54aca5cedf93848b40ec7cff26"
-    hooks:
-      - id: cpg-id-checker
-
   - repo: https://github.com/astral-sh/ruff-pre-commit
     # Ruff version.
     rev: v0.14.1
@@ -50,4 +45,4 @@ repos:
             --non-interactive,
             --config-file=./pyproject.toml
           ]
-        additional_dependencies: [types-PyYAML==6.0.4, types-toml]
+        additional_dependencies: []
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2025 Tobias Sargeant
+Copyright (c) 2025 Centre for Population Genomics
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -1,93 +1,86 @@
-# Gemini OCR
+# groundmark
 
-<img src="https://raw.githubusercontent.com/folded/gemini-ocr/main/docs/source/_static/gemini-ocr.svg" alt="gemini-ocr" width="200">
+<img src="groundmark.webp" alt="groundmark" width="200">
 
-## Traceable Generative Markdown for PDFs
+## Grounded Markdown for PDFs
 
-Gemini OCR is a library designed to convert PDF documents into clean, semantic Markdown while maintaining precise traceability back to the source coordinates. It bridges the gap between the readability of Generative AI (Gemini, Document AI Chunking) and the grounded accuracy of traditional OCR (Google Document AI).
+**groundmark is a thin, batteries-included wrapper around [anchorite](https://github.com/populationgenomics/anchorite).** It provides concrete implementations of anchorite's provider protocols — [Pydantic AI](https://ai.pydantic.dev/) for LLM-based Markdown generation and [pdfplumber](https://github.com/jsvine/pdfplumber) for bounding box extraction — so you can go from PDF bytes to annotated Markdown in a single call. All the heavy lifting (Smith-Waterman alignment, annotation, stripping, quote resolution) lives in anchorite.
 
-## Key Features
-
-- **Generative Markdown**: Uses Google's Gemini Pro or Document AI Layout models to generate human-readable Markdown with proper structure (headers, tables, lists).
-- **Precision Traceability**: Aligns the generated Markdown text back to the original PDF coordinates using detailed OCR data from Google Document AI.
-- **Reverse-Alignment Algorithm**: Implements a robust "reverse-alignment" strategy that starts with the readable text and finds the corresponding bounding boxes, ensuring the Markdown is the ground truth for content.
-- **Confidence Metrics**: (New) Includes coverage metrics to quantify how much of the Markdown content is successfully backed by OCR data.
-- **Pagination Support**: Automatically handles PDF page splitting and merging logic.
+Give it a PDF and a model string, get back Markdown with embedded bounding box coordinates that trace every text span back to its location in the source PDF.
 
 ## Architecture
 
-The library processes documents in two parallel streams:
-
-1. **Semantic Stream**: The PDF is sent to a Generative AI model (e.g., Gemini 2.5 Flash) to produce a clean Markdown representation.
-2. **Positional Stream**: The PDF is sent to Google Document AI to extract raw bounding boxes and text segments.
-
-These two streams are then merged using a custom alignment engine (`seq_smith` + `bbox_alignment.py`) which:
+The library processes documents in two streams that are then merged:
 
-1. Normalizes both text sources.
-2. Identifies "anchor" comparisons for reliable alignment.
-3. Computes a global alignment using the anchors to constrain the search space.
-4. Identifies significant gaps or mismatches.
-5. Recursively re-aligns mismatched regions until a high-quality alignment is achieved.
+1. **Semantic Stream**: The PDF is sent to an LLM (via Pydantic AI) to produce clean Markdown with `<!--page-->` markers between pages.
+2. **Positional Stream**: The PDF is parsed locally by pdfplumber to extract line-level text segments and their bounding boxes.
+3. **Alignment**: Smith-Waterman alignment (via anchorite) maps each parsed line to its position in the Markdown, constrained by page boundaries.
+4. **Annotation**: Bounding box coordinates are injected as HTML span attributes:
 
-**Key Features:**
-
-- **Robust to Cleanliness Issues:** Handles extra headers/footers, watermarks, and noisy OCR artifacts.
-- **Scale-Invariant:** Recursion ensures even small missed sections in large documents are recovered.
+   ```html
+   <span data-bbox="120,45,180,890" data-page="3">The patient presented with</span>
+   ```
 
 ## Quick Start
 
 ```python
 import asyncio
-from pathlib import Path
-from gemini_ocr import gemini_ocr, settings
+import groundmark as gm
 
 async def main():
-    # Configure settings
-    ocr_settings = settings.Settings(
-        project="my-gcp-project",
-        location="us",
-        gcp_project_id="my-gcp-project",
-        layout_processor_id="projects/.../processors/...",
-        ocr_processor_id="projects/.../processors/...",
-        mode=settings.OcrMode.GEMINI,
-    )
-
-    file_path = Path("path/to/document.pdf")
-
-    # Process the document
-    result = await gemini_ocr.process_document(ocr_settings, file_path)
-
-    # Access results
+    pdf_bytes = open("document.pdf", "rb").read()
+
+    config = gm.Config(model="bedrock:au.anthropic.claude-sonnet-4-6")
+
+    # PDF -> annotated Markdown (one call)
+    result = await gm.process(pdf_bytes, config)
     print(f"Coverage: {result.coverage_percent:.2%}")
+    print(result.annotated_markdown[:500])
+
+    # Strip for LLM consumption
+    stripped = gm.strip(result.annotated_markdown)
+    # stripped.plain_text: clean Markdown with spans removed
+    # stripped.validation_map: list of (start, end, Anchor) ranges
 
-    # Get annotated HTML-compatible Markdown
-    annotated_md = result.annotate()
-    print(annotated_md[:500])  # View first 500 chars
+    # Resolve verbatim quotes to PDF coordinates
+    resolved = gm.resolve(result.annotated_markdown, ["the patient presented with"])
+    # -> {"the patient presented with": [(page, BBox), ...]}
 
 if __name__ == "__main__":
     asyncio.run(main())
 ```
 
+## Debug Visualizer
+
+The included visualizer overlays extracted bounding boxes onto the source PDF, useful for diagnosing alignment issues. Blue highlights show raw extracted boxes from pdfplumber; red highlights show aligned boxes from the annotated Markdown.
+
+```bash
+python -m groundmark.visualize input.pdf output.pdf --model "bedrock:au.anthropic.claude-sonnet-4-6"
+
+# Or with cached Markdown:
+python -m groundmark.visualize input.pdf output.pdf --markdown cached.md
+```
+
+![Visualizer output showing blue (raw) and red (aligned) bounding box overlays](visualize_example.jpg)
+
+*Screenshot from Santoro et al., "Health outcomes and drug utilisation in children with Noonan syndrome: a European cohort study," Orphanet J Rare Dis 20:76 (2025). [doi:10.1186/s13023-025-03594-7](https://doi.org/10.1186/s13023-025-03594-7). CC-BY 4.0.*
+
 ## Configuration
 
-The `gemini_ocr.settings.Settings` class controls the behavior:
-
-| Parameter                        | Type      | Description                                                      |
-| :------------------------------- | :-------- | :--------------------------------------------------------------- |
-| `project`                        | `str`     | GCP Project Name                                                 |
-| `location`                       | `str`     | GCP Location (e.g., `us`, `eu`)                                  |
-| `gcp_project_id`                 | `str`     | GCP Project ID (might be same as `project`)                      |
-| `layout_processor_id`            | `str`     | Document AI Processor ID for Layout (if using `DOCUMENTAI` mode) |
-| `ocr_processor_id`               | `str`     | Document AI Processor ID for OCR (required for bounding boxes)   |
-| `mode`                           | `OcrMode` | `GEMINI` (default), `DOCUMENTAI`, or `DOCLING`                   |
-| `gemini_model_name`              | `str`     | Gemini model to use (default: `gemini-2.5-flash`)                |
-| `alignment_uniqueness_threshold` | `float`   | Min score ratio for unique match (default: `0.5`)                |
-| `alignment_min_overlap`          | `float`   | Min overlap fraction for valid match (default: `0.9`)            |
-| `include_bboxes`                 | `bool`    | Whether to perform alignment (default: `True`)                   |
-| `markdown_page_batch_size`       | `int`     | Pages per batch for Markdown generation (default: `10`)          |
-| `ocr_page_batch_size`            | `int`     | Pages per batch for OCR (default: `10`)                          |
-| `num_jobs`                       | `int`     | Max concurrent jobs (default: `10`)                              |
-| `cache_dir`                      | `str`     | Directory to store API response cache (default: `.docai_cache`)  |
+### Timeouts
+
+The LLM call for PDF-to-Markdown conversion can take several minutes for large documents, especially with Opus on Bedrock. Timeout defaults by provider:
+
+| Provider | Default | Environment Variable |
+|----------|---------|---------------------|
+| Bedrock (boto3) | 300s | `AWS_READ_TIMEOUT` |
+| Anthropic (httpx) | 600s | — (use `ModelSettings(timeout=...)`) |
+
+For Bedrock with Opus, 300s may not be enough. Set a higher timeout:
+
+```bash
+export AWS_READ_TIMEOUT=600
+```
 
 ## License
 

diff --git a/docs/Makefile b/docs/Makefile
diff --git a/docs/make.bat b/docs/make.bat
diff --git a/docs/requirements.txt b/docs/requirements.txt