Skip to content

Latest commit

 

History

History
80 lines (58 loc) · 6.54 KB

File metadata and controls

80 lines (58 loc) · 6.54 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Python CLI tool that fetches web content, extracts it as markdown, and creates formatted Google Docs. Uses a hybrid extraction pipeline: parallel aiohttp + Playwright fetching, BeautifulSoup HTML cleaning, content scoring/pruning, and multi-strategy extraction (trafilatura, multi-div, CSS-targeted) to maximize content quality.

Commands

# Run the tool
python fetch_markdown.py "https://example.com/article"

# Multiple URLs
python fetch_markdown.py url1 url2 url3

# With options
python fetch_markdown.py --no-clean --pruning-threshold 0.6 --min-words 30 url1

# Recursive mode: crawl all sub-pages into a single tabbed Google Doc
python fetch_markdown.py --recursive "https://docs.example.com/guide"

# Recursive mode with page limit (skips confirmation prompt)
python fetch_markdown.py --recursive --max-pages 20 "https://docs.example.com/guide"

# Install dependencies (pip)
pip install -r requirements.txt

# Install dependencies (conda)
conda install -c conda-forge aiohttp tqdm google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client markdown-it-py
conda install anaconda::beautifulsoup4
pip install trafilatura playwright lxml html2text

# Install Playwright browser
playwright install chromium

Architecture

The pipeline flows: URLs -> Parallel Fetch -> HTML Clean -> Content Prune -> Multi-Strategy Extract -> Markdown Post-Process -> Google Doc Creation

Module Responsibilities

  • fetch_markdown.py — Main CLI entry point and orchestrator. Contains main() async loop, extract_url_content() (extraction without doc creation), process_url() (extraction + doc creation), main_recursive() (recursive mode orchestrator), batch retry logic, CLI argument parsing, interactive mode with mode selection. ExtractionConfig is created from CLI args and passed through the call chain (no global state).
  • extraction.pyExtractionConfig dataclass, fetch_html() for aiohttp fetching, extract_with_multi_div(), extract_with_css_selectors(), and apply_extraction_pipeline() for the clean/prune pipeline.
  • playwright_fetch.pyfetch_with_playwright() using a shared BrowserContext (created once in main()), and smart_wait_for_content() with combined CSS selector waiting.
  • google_drive.pycreate_google_doc() (tags docs with appProperties for cross-mode dedup), _build_doc_title_cache_sync(), find_standalone_docs_for_urls_sync(), find_all_tabbed_base_urls_sync(), and sanitize_doc_title(). Google API services are created once in main() and passed as parameters.
  • title_extractor.pyextract_title_from_metadata() (trafilatura), extract_h1_title(), fallback_name_from_url().
  • auth.py — OAuth2 flow for Google APIs. Loads/refreshes credentials from token.json, falls back to browser-based OAuth. Exports get_docs_service(), get_drive_service(), find_folder_id().
  • html_cleaner.py — BeautifulSoup-based noise removal. clean_html_for_extraction() strips 60+ noise selectors (nav, ads, popups, etc.). extract_main_content() for targeted extraction. filter_short_blocks() for post-extraction markdown filtering.
  • content_filter.pyPruningContentFilter with ContentScorer that scores HTML elements on text density (40%), link density (30%), tag importance (20%), and class/ID patterns (10%). Configurable via FilterConfig dataclass.
  • docs_converter.pyMarkdownToDocsConverter class that parses markdown with markdown-it-py and builds Google Docs API batchUpdate requests. Handles headings, bold, italic, links, lists, code blocks. Uses reverse insertion strategy for correct index tracking.
  • recursive_crawler.py — URL discovery for recursive mode. discover_urls() tries sitemap.xml first (parsed with xml.etree.ElementTree), falls back to BFS link crawling with BeautifulSoup. is_within_prefix() enforces strict URL prefix boundaries. CrawlConfig dataclass for rate limiting and timeouts.
  • tabbed_doc.pycreate_tabbed_google_doc() creates a single Google Doc with one tab per page using the Docs API addDocumentTab request. _inject_tab_id() post-processes batchUpdate requests to target specific tabs.

Key Design Decisions

  • Best-of-12 extraction: Runs 6 strategies on each of 2 HTML sources (aiohttp + Playwright), selects the longest result. This maximizes content capture across diverse site structures.
  • Doc title cache: Built once per run via _build_doc_title_cache_sync() with a single Drive folder query, avoiding repeated API calls for duplicate detection.
  • Shared resources: Google API services and Playwright browser context are created once in main() and passed through the call chain. No global mutable state.
  • Single retry layer: Only batch-level retries in main() (MAX_RETRY_ROUNDS). Per-fetch retry loops were removed to simplify retry reasoning.
  • Concurrency: 15 parallel tasks for both aiohttp and Playwright (MAX_CONCURRENCY, PLAYWRIGHT_CONCURRENCY). Google API rate limits mean actual speedup is ~2x, not 15x.
  • Credentials files (credentials.json, token.json) are gitignored and live in project root. Required for Google API access.
  • Recursive mode: Discovers sub-pages via sitemap or BFS crawl, extracts content from each, and creates a single tabbed Google Doc. Uses extract_url_content() (shared with normal mode) for content extraction. User confirms page count after discovery (or --max-pages to skip prompt). Strict URL prefix enforcement prevents crawling outside the target path.
  • Cross-mode duplicate detection: Both modes tag Google Drive files with appProperties metadata (doc_mode, source_url/base_url). Normal mode skips URLs already covered by a recursive doc's base_url prefix. Recursive mode deletes standalone docs whose source_url matches a sub-page. Zero extra API calls for tagging (merged into existing files().update).

Constants

  • DRIVE_FOLDER_NAME = "Resources" — target Google Drive folder (in fetch_markdown.py)
  • TIMEOUT_SECS = 30 — aiohttp fetch timeout (in fetch_markdown.py)
  • PLAYWRIGHT_TIMEOUT = 45000 — Playwright navigation timeout (in playwright_fetch.py)
  • MAX_RETRY_ROUNDS = 3 — batch-level retry rounds for failed URLs (in fetch_markdown.py)

Prerequisites

  • Python 3.10+
  • Google Cloud project with Docs + Drive APIs enabled
  • credentials.json from Google Cloud OAuth (Desktop app type)
  • Playwright Chromium browser (playwright install chromium)