feat: PDF document importer by ion-kitty · Pull Request #186 · seed-hypermedia/seed

ion-kitty · 2026-02-10T23:11:58Z

Summary

Add PDF import support to the Seed desktop app, following the same architecture as the existing Markdown and LaTeX importers.

What it does

Import PDF files and directories via the existing import dialog
Extract text from PDFs using pdfjs-dist
Detect headings via font size clustering (most common size = body text, larger sizes → h1/h2/h3)
Preserve styling — bold, italic, and monospace from font metadata
Detect lists — bullet (•, -, –, etc.) and numbered (1., 2., a., etc.)
Code blocks — consecutive monospace lines grouped into code blocks
Paragraph merging — multi-line text joined with hyphen-aware word joining
Heading hierarchy — blocks organized under their parent headings
Title extraction — largest font on first page used as document title

Architecture

Follows the same pattern as MarkdownToBlocks.ts and LatexToBlocks.ts:

PdfToBlocks.ts — core converter (PdfToBlocks() and extractPdfTitle())
IPC handlers in main.ts for native file/directory dialog
Preload bridge in preload.ts
AppContext extended with openPdfFiles/openPdfDirectories
Import dialog buttons for PDF file and directory import

Test coverage (12 tests)

Empty PDFs, simple paragraphs, heading detection
Bold/italic/monospace styling preservation
Bullet and numbered list detection
Multi-page PDFs, heading hierarchy organization
Title extraction, scanned PDF (no text) graceful handling

Dependencies

pdfjs-dist ^3.11.174 — PDF text extraction
pdf-lib ^1.17.1 — test PDF generation (devDependency)

Add a markdown API that returns SHM document content as plain text/markdown when any URL is requested with a .md extension. This makes Seed Hypermedia content accessible to AI agents, CLI tools, and bots without needing to parse HTML or install the desktop app. Features: - GET any-path.md returns text/markdown - Supports document and comment resources - Handles all block types (Paragraph, Heading, Code, Image, Embed, etc.) - Text annotations (bold, italic, links, code) converted to markdown syntax - Optional YAML frontmatter via ?frontmatter query parameter - Version pinning via ?v=bafy2bz... query parameter - X-Hypermedia-Id and X-Hypermedia-Version response headers - Handles nested block structures (lists, blockquotes) - IPFS media URLs converted to gateway URLs Example: curl https://hyper.media/cli-guide.md curl https://hyper.media/guides/publishing.md?frontmatter

- Fix mention name resolution: @mentions now resolve to display names instead of [@ - Add embed block inlining: Load destination document content and inject into markdown - Add query block resolution placeholder: Convert query blocks to comments with query text - Make markdown generation async to support name resolution and content loading Issues addressed: 1. Mention name resolution in getAnnotationMarker() 2. Embed blocks now inline content from destination documents 3. Query blocks show query text instead of generic comment 4. All functions updated to async for proper resolution

- Add caching for embed content to avoid repeated fetches - Add caching for account names to reduce duplicate lookups - Improve mention annotation handling (don't show link syntax for mentions) - Enhance query block resolution with better placeholder text - Fix embed block indentation and caching These changes address Eric's concern about slow .md page resolution by: 1. Caching repeated content lookups 2. Avoiding redundant grpc calls 3. Optimizing the annotation processing flow

… handling - Link annotations pointing to hm:// accounts (mentions) now resolve the account display name instead of showing [@](hm://...) - Embed annotations use standard link syntax instead of broken @mention - Fixes Eric's review feedback on PR seed-hypermedia#181

- Query blocks now execute actual queries via serverUniversalClient and render results as markdown lists of links - Added prewarmEmbedCache() to resolve all embeds and account mentions in parallel before sequential markdown generation - Eliminates sequential N+1 fetch pattern for documents with many embeds

Add PDF import support to the desktop app, following the same architecture as the existing Markdown and LaTeX importers. Core converter (PdfToBlocks.ts): - Extracts text from PDFs using pdfjs-dist - Detects headings via font size clustering (most common size = body, larger sizes mapped to h1/h2/h3) - Preserves bold, italic, and monospace styling from font metadata - Detects bullet and numbered lists from text patterns - Groups monospace lines into code blocks - Merges multi-line text into paragraphs with hyphen-aware joining - Organizes blocks into heading hierarchy - Exports extractPdfTitle() for document title extraction Desktop app integration: - IPC handlers for file and directory selection (main.ts) - Preload bridge for PDF file/directory open (preload.ts) - AppContext type and provider extended with openPdfFiles/openPdfDirectories - Import dialog updated with PDF file and directory buttons - Import flow handles PDF content through the same confirm dialog Test coverage (12 tests): - Empty PDFs, simple paragraphs, heading detection - Bold/italic/monospace styling preservation - Bullet and numbered list detection - Multi-page PDFs, heading hierarchy - Title extraction, scanned PDF (no text) handling Dependencies: - pdfjs-dist ^3.11.174 (text extraction) - pdf-lib ^1.17.1 (test PDF generation, devDependency) - vitest config and jsdom polyfills for DOMMatrix/Path2D

ion-kitty added 6 commits February 8, 2026 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PDF document importer#186

feat: PDF document importer#186
ion-kitty wants to merge 6 commits intoseed-hypermedia:mainfrom
ion-kitty:feat/pdf-importer

ion-kitty commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ion-kitty commented Feb 10, 2026

Summary

What it does

Architecture

Test coverage (12 tests)

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant