Skip to content

fix: extract plain-text URLs from Google Docs HTML export#20

Merged
gvonness-apolitical merged 1 commit intomainfrom
fix/extract-plain-text-urls-from-google-docs
Feb 3, 2026
Merged

fix: extract plain-text URLs from Google Docs HTML export#20
gvonness-apolitical merged 1 commit intomainfrom
fix/extract-plain-text-urls-from-google-docs

Conversation

@gvonness-apolitical
Copy link
Collaborator

Summary

  • Google Docs HTML exports can contain bare plain-text URLs alongside hyperlinked ones, and may split URLs across <span> elements — both cases were previously missed
  • Replaced the two-pass extraction (hrefs first, then plain-text) with a single-pass strategy that resolves anchor hrefs inline, strips HTML tags, and scans once
  • Preserves document order for all extracted URLs regardless of whether they were linked or bare

Test plan

  • Added test: mixed content with both linked and bare URLs — all extracted
  • Added test: plain-text URLs in HTML with no Tumblr hrefs — still extracted
  • Added test: URL split across <span> tags — reassembled and extracted
  • Added test: interleaved linked and bare URLs preserve document order
  • All 236 existing + new tests pass
  • Clean production build

Google Docs can contain a mix of hyperlinked and bare plain-text URLs,
and may split plain-text URLs across <span> elements. The previous
two-pass approach skipped plain-text extraction when hrefs were found,
and didn't strip HTML tags before regex matching.

Replace the two-pass approach with a single-pass strategy: resolve
anchor hrefs inline as plain text, strip remaining tags, then scan
once — preserving document order for both linked and bare URLs.
@gvonness-apolitical gvonness-apolitical merged commit cf98816 into main Feb 3, 2026
2 checks passed
@gvonness-apolitical gvonness-apolitical deleted the fix/extract-plain-text-urls-from-google-docs branch February 3, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant