Skip to content

feat: PDF document importer#186

Open
ion-kitty wants to merge 6 commits intoseed-hypermedia:mainfrom
ion-kitty:feat/pdf-importer
Open

feat: PDF document importer#186
ion-kitty wants to merge 6 commits intoseed-hypermedia:mainfrom
ion-kitty:feat/pdf-importer

Conversation

@ion-kitty
Copy link

Summary

Add PDF import support to the Seed desktop app, following the same architecture as the existing Markdown and LaTeX importers.

What it does

  • Import PDF files and directories via the existing import dialog
  • Extract text from PDFs using pdfjs-dist
  • Detect headings via font size clustering (most common size = body text, larger sizes → h1/h2/h3)
  • Preserve styling — bold, italic, and monospace from font metadata
  • Detect lists — bullet (•, -, –, etc.) and numbered (1., 2., a., etc.)
  • Code blocks — consecutive monospace lines grouped into code blocks
  • Paragraph merging — multi-line text joined with hyphen-aware word joining
  • Heading hierarchy — blocks organized under their parent headings
  • Title extraction — largest font on first page used as document title

Architecture

Follows the same pattern as MarkdownToBlocks.ts and LatexToBlocks.ts:

  • PdfToBlocks.ts — core converter (PdfToBlocks() and extractPdfTitle())
  • IPC handlers in main.ts for native file/directory dialog
  • Preload bridge in preload.ts
  • AppContext extended with openPdfFiles/openPdfDirectories
  • Import dialog buttons for PDF file and directory import

Test coverage (12 tests)

  • Empty PDFs, simple paragraphs, heading detection
  • Bold/italic/monospace styling preservation
  • Bullet and numbered list detection
  • Multi-page PDFs, heading hierarchy organization
  • Title extraction, scanned PDF (no text) graceful handling

Dependencies

  • pdfjs-dist ^3.11.174 — PDF text extraction
  • pdf-lib ^1.17.1 — test PDF generation (devDependency)

Add a markdown API that returns SHM document content as plain text/markdown
when any URL is requested with a .md extension.

This makes Seed Hypermedia content accessible to AI agents, CLI tools, and
bots without needing to parse HTML or install the desktop app.

Features:
- GET any-path.md returns text/markdown
- Supports document and comment resources
- Handles all block types (Paragraph, Heading, Code, Image, Embed, etc.)
- Text annotations (bold, italic, links, code) converted to markdown syntax
- Optional YAML frontmatter via ?frontmatter query parameter
- Version pinning via ?v=bafy2bz... query parameter
- X-Hypermedia-Id and X-Hypermedia-Version response headers
- Handles nested block structures (lists, blockquotes)
- IPFS media URLs converted to gateway URLs

Example:
  curl https://hyper.media/cli-guide.md
  curl https://hyper.media/guides/publishing.md?frontmatter
- Fix mention name resolution: @mentions now resolve to display names instead of [@
- Add embed block inlining: Load destination document content and inject into markdown
- Add query block resolution placeholder: Convert query blocks to comments with query text
- Make markdown generation async to support name resolution and content loading

Issues addressed:
1. Mention name resolution in getAnnotationMarker()
2. Embed blocks now inline content from destination documents
3. Query blocks show query text instead of generic comment
4. All functions updated to async for proper resolution
- Add caching for embed content to avoid repeated fetches
- Add caching for account names to reduce duplicate lookups
- Improve mention annotation handling (don't show link syntax for mentions)
- Enhance query block resolution with better placeholder text
- Fix embed block indentation and caching

These changes address Eric's concern about slow .md page resolution by:
1. Caching repeated content lookups
2. Avoiding redundant grpc calls
3. Optimizing the annotation processing flow
… handling

- Link annotations pointing to hm:// accounts (mentions) now resolve
  the account display name instead of showing [@](hm://...)
- Embed annotations use standard link syntax instead of broken @mention
- Fixes Eric's review feedback on PR seed-hypermedia#181
- Query blocks now execute actual queries via serverUniversalClient
  and render results as markdown lists of links
- Added prewarmEmbedCache() to resolve all embeds and account mentions
  in parallel before sequential markdown generation
- Eliminates sequential N+1 fetch pattern for documents with many embeds
Add PDF import support to the desktop app, following the same architecture
as the existing Markdown and LaTeX importers.

Core converter (PdfToBlocks.ts):
- Extracts text from PDFs using pdfjs-dist
- Detects headings via font size clustering (most common size = body,
  larger sizes mapped to h1/h2/h3)
- Preserves bold, italic, and monospace styling from font metadata
- Detects bullet and numbered lists from text patterns
- Groups monospace lines into code blocks
- Merges multi-line text into paragraphs with hyphen-aware joining
- Organizes blocks into heading hierarchy
- Exports extractPdfTitle() for document title extraction

Desktop app integration:
- IPC handlers for file and directory selection (main.ts)
- Preload bridge for PDF file/directory open (preload.ts)
- AppContext type and provider extended with openPdfFiles/openPdfDirectories
- Import dialog updated with PDF file and directory buttons
- Import flow handles PDF content through the same confirm dialog

Test coverage (12 tests):
- Empty PDFs, simple paragraphs, heading detection
- Bold/italic/monospace styling preservation
- Bullet and numbered list detection
- Multi-page PDFs, heading hierarchy
- Title extraction, scanned PDF (no text) handling

Dependencies:
- pdfjs-dist ^3.11.174 (text extraction)
- pdf-lib ^1.17.1 (test PDF generation, devDependency)
- vitest config and jsdom polyfills for DOMMatrix/Path2D
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant