CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Python CLI tool that fetches web content, extracts it as markdown, and creates formatted Google Docs. Uses a hybrid extraction pipeline: parallel aiohttp + Playwright fetching, BeautifulSoup HTML cleaning, content scoring/pruning, and multi-strategy extraction (trafilatura, multi-div, CSS-targeted) to maximize content quality.

Commands

# Run the tool
python fetch_markdown.py "https://example.com/article"

# Multiple URLs
python fetch_markdown.py url1 url2 url3

# With options
python fetch_markdown.py --no-clean --pruning-threshold 0.6 --min-words 30 url1

# Recursive mode: crawl all sub-pages into a single tabbed Google Doc
python fetch_markdown.py --recursive "https://docs.example.com/guide"

# Recursive mode with page limit (skips confirmation prompt)
python fetch_markdown.py --recursive --max-pages 20 "https://docs.example.com/guide"

# Install dependencies (pip)
pip install -r requirements.txt

# Install dependencies (conda)
conda install -c conda-forge aiohttp tqdm google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client markdown-it-py
conda install anaconda::beautifulsoup4
pip install trafilatura playwright lxml html2text

# Install Playwright browser
playwright install chromium

Architecture

The pipeline flows: URLs -> Parallel Fetch -> HTML Clean -> Content Prune -> Multi-Strategy Extract -> Markdown Post-Process -> Google Doc Creation

Module Responsibilities

fetch_markdown.py — Main CLI entry point and orchestrator. Contains main() async loop, extract_url_content() (extraction without doc creation), process_url() (extraction + doc creation), main_recursive() (recursive mode orchestrator), batch retry logic, CLI argument parsing, interactive mode with mode selection. ExtractionConfig is created from CLI args and passed through the call chain (no global state).
extraction.py — ExtractionConfig dataclass, fetch_html() for aiohttp fetching, extract_with_multi_div(), extract_with_css_selectors(), and apply_extraction_pipeline() for the clean/prune pipeline.
playwright_fetch.py — fetch_with_playwright() using a shared BrowserContext (created once in main()), and smart_wait_for_content() with combined CSS selector waiting.
google_drive.py — create_google_doc() (tags docs with appProperties for cross-mode dedup), _build_doc_title_cache_sync(), find_standalone_docs_for_urls_sync(), find_all_tabbed_base_urls_sync(), and sanitize_doc_title(). Google API services are created once in main() and passed as parameters.
title_extractor.py — extract_title_from_metadata() (trafilatura), extract_h1_title(), fallback_name_from_url().
auth.py — OAuth2 flow for Google APIs. Loads/refreshes credentials from token.json, falls back to browser-based OAuth. Exports get_docs_service(), get_drive_service(), find_folder_id().
html_cleaner.py — BeautifulSoup-based noise removal. clean_html_for_extraction() strips 60+ noise selectors (nav, ads, popups, etc.). extract_main_content() for targeted extraction. filter_short_blocks() for post-extraction markdown filtering.
content_filter.py — PruningContentFilter with ContentScorer that scores HTML elements on text density (40%), link density (30%), tag importance (20%), and class/ID patterns (10%). Configurable via FilterConfig dataclass.
docs_converter.py — MarkdownToDocsConverter class that parses markdown with markdown-it-py and builds Google Docs API batchUpdate requests. Handles headings, bold, italic, links, lists, code blocks. Uses reverse insertion strategy for correct index tracking.
recursive_crawler.py — URL discovery for recursive mode. discover_urls() tries sitemap.xml first (parsed with xml.etree.ElementTree), falls back to BFS link crawling with BeautifulSoup. is_within_prefix() enforces strict URL prefix boundaries. CrawlConfig dataclass for rate limiting and timeouts.
tabbed_doc.py — create_tabbed_google_doc() creates a single Google Doc with one tab per page using the Docs API addDocumentTab request. _inject_tab_id() post-processes batchUpdate requests to target specific tabs.

Key Design Decisions

Best-of-12 extraction: Runs 6 strategies on each of 2 HTML sources (aiohttp + Playwright), selects the longest result. This maximizes content capture across diverse site structures.
Doc title cache: Built once per run via _build_doc_title_cache_sync() with a single Drive folder query, avoiding repeated API calls for duplicate detection.
Shared resources: Google API services and Playwright browser context are created once in main() and passed through the call chain. No global mutable state.
Single retry layer: Only batch-level retries in main() (MAX_RETRY_ROUNDS). Per-fetch retry loops were removed to simplify retry reasoning.
Concurrency: 15 parallel tasks for both aiohttp and Playwright (MAX_CONCURRENCY, PLAYWRIGHT_CONCURRENCY). Google API rate limits mean actual speedup is ~2x, not 15x.
Credentials files (credentials.json, token.json) are gitignored and live in project root. Required for Google API access.
Recursive mode: Discovers sub-pages via sitemap or BFS crawl, extracts content from each, and creates a single tabbed Google Doc. Uses extract_url_content() (shared with normal mode) for content extraction. User confirms page count after discovery (or --max-pages to skip prompt). Strict URL prefix enforcement prevents crawling outside the target path.
Cross-mode duplicate detection: Both modes tag Google Drive files with appProperties metadata (doc_mode, source_url/base_url). Normal mode skips URLs already covered by a recursive doc's base_url prefix. Recursive mode deletes standalone docs whose source_url matches a sub-page. Zero extra API calls for tagging (merged into existing files().update).

Constants

DRIVE_FOLDER_NAME = "Resources" — target Google Drive folder (in fetch_markdown.py)
TIMEOUT_SECS = 30 — aiohttp fetch timeout (in fetch_markdown.py)
PLAYWRIGHT_TIMEOUT = 45000 — Playwright navigation timeout (in playwright_fetch.py)
MAX_RETRY_ROUNDS = 3 — batch-level retry rounds for failed URLs (in fetch_markdown.py)

Prerequisites

Python 3.10+
Google Cloud project with Docs + Drive APIs enabled
credentials.json from Google Cloud OAuth (Desktop app type)
Playwright Chromium browser (playwright install chromium)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Commands

Architecture

Module Responsibilities

Key Design Decisions

Constants

Prerequisites

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Commands

Architecture

Module Responsibilities

Key Design Decisions

Constants

Prerequisites