Web Content Parser

A powerful Python CLI tool that fetches web content and automatically creates formatted Google Docs. Uses intelligent hybrid extraction combining fast HTTP requests with headless browser automation, enhanced with BeautifulSoup cleaning and content-aware pruning for maximum extraction quality.

Features

Core Capabilities

Intelligent dual extraction - runs both aiohttp and Playwright in parallel, uses longest result
Multi-strategy content extraction - 6 extraction strategies per HTML source (up to 12 attempts per URL)
BeautifulSoup HTML cleaning - removes 60+ noise patterns (ads, navigation, footers, popups)
Content-aware pruning - scores elements by text density, link ratio, and semantic importance
CSS-targeted extraction - priority-ordered selector matching for common content patterns
Concurrent processing - fetches up to 15 URLs in parallel, ~2x faster than sequential
Google Docs integration - automatically creates formatted documents in your Drive
Smart formatting - converts markdown to native Google Docs formatting (headings, bold, italic, links, lists)
Source + title header - inserts the document title (Docs TITLE style) and clickable source URL at the top
Drive reuse with cache - reuses existing Docs by title across nested folders with a one-time cache build

Content Extraction Pipeline

HTML Cleaning - BeautifulSoup removes scripts, styles, navigation, ads, popups, and hidden elements
Content Pruning - PruningContentFilter scores nodes and removes low-quality boilerplate
trafilatura - smart content detection with noise filtering
multi-div extraction - handles sites that split articles across multiple containers (e.g., Ars Technica)
CSS-targeted extraction - uses semantic selectors (article, main, .content, etc.)
Playwright browser automation - handles JavaScript-rendered content (Medium, LinkedIn, etc.)
Stealth mode - anti-detection headers and settings to bypass basic bot protection
Automatic method selection - always uses the extraction that gets the most content

Recursive Mode

Crawl entire documentation sites - provide a base URL, discover all sub-pages automatically
Sitemap-first discovery - tries sitemap.xml for fast, complete URL discovery
BFS crawl fallback - follows links from pages when no sitemap is available
Strict URL prefix enforcement - never crawls outside the target path (safe for sites like GitHub)
Single tabbed Google Doc - all pages combined into one doc with named tabs
User confirmation - shows discovered page count, lets you proceed, limit, or cancel
Cross-mode duplicate detection - deletes standalone docs that are now covered by the recursive doc

Reliability & UX

Interactive CLI mode - full-screen centered UI with mode selection (Normal/Recursive), loops after processing
Automatic retry - failed URLs retry up to 3 batch-level rounds
Automatic title extraction - from metadata, H1 headings, or URL fallback
Cross-mode duplicate detection - standalone docs are skipped if covered by a recursive doc, and vice versa
Real-time progress bar - shows completion status, speed, and extraction method used
Detailed logging - shows extraction method, content length, and document title for each URL
Configurable CLI - tune cleaning, pruning, and filtering via command-line flags
OAuth authentication - one-time browser login, then automatic with cached credentials
Resilient error handling - continues processing even if some URLs fail

Prerequisites

Python 3.10+
Google Cloud account (free)
Google Drive folder named "Resources" (or customize in code)
Playwright browsers (automatically installed)

Installation

1. Install dependencies

Using conda (recommended):

conda install -c conda-forge aiohttp tqdm google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client markdown-it-py
conda install anaconda::beautifulsoup4
pip install trafilatura playwright lxml html2text

Or using pip:

pip install -r requirements.txt

2. Install Playwright browsers

playwright install chromium

3. Set up Google Cloud OAuth

Go to Google Cloud Console
Create a new project (or select existing)
Enable these APIs:
- Google Docs API
- Google Drive API
Configure OAuth consent screen:
- User type: External
- Add your email as a test user
- Keep publishing status as Testing (no verification needed)
- Add scopes:
  - https://www.googleapis.com/auth/documents
  - https://www.googleapis.com/auth/drive
Create credentials:
- APIs & Services → Credentials
- Create Credentials → OAuth client ID
- Application type: Desktop app
- Download JSON and save as credentials.json in project root

4. Configure target folder (optional)

By default, docs are created in a folder named "Resources" in your Google Drive. To use a different folder:

Create/choose a folder in Google Drive
Edit DRIVE_FOLDER_NAME in fetch_markdown.py:
```
DRIVE_FOLDER_NAME = "Your Folder Name"
```

Usage

Interactive Mode (recommended)

Run the command with no arguments to enter interactive mode:

web-content-parser

This opens a full-screen UI starting with mode selection:

~
~
              Web Content Parser
              Type "exit" to quit.

              Select mode:
              [1] Normal    - process URLs individually
              [2] Recursive - crawl & combine into one doc

              > _
~
~

After selecting a mode, you enter the URL entry screen:

Normal mode: Type URLs one at a time, then "ready" to start processing
Recursive mode: Type a single base URL and it starts automatically
Type "exit" at any point to quit
After processing completes, press "q" to return to mode selection for more URLs
Ctrl+C and Ctrl+D are disabled — use "exit" to quit

Direct Mode

You can also pass URLs directly as command-line arguments:

# Normal mode (default)
web-content-parser "https://example.com/article1" "https://example.com/article2"

# Recursive mode
web-content-parser --recursive "https://docs.example.com/guide"

# Recursive mode with page limit (skips confirmation prompt)
web-content-parser --recursive --max-pages 20 "https://docs.example.com/guide"

CLI flags work with both modes:

# Interactive mode with cleaning disabled
web-content-parser --no-clean

# Direct mode with options
web-content-parser --pruning-threshold 0.6 --min-words 30 url1 url2

First run: Browser will open for Google authorization (one-time only)

Subsequent runs: Automatic authentication using cached credentials

What happens:

Authenticates with Google (uses cached token after first run)
Finds your "Resources" folder in Google Drive
Launches headless Chromium browser (Playwright)
Fetches content using BOTH methods in parallel:
- aiohttp - fast HTTP request (~1-2s)
- Playwright - headless browser with JavaScript execution (~5-8s)
Cleans HTML with BeautifulSoup:
- Removes scripts, styles, iframes, SVG, canvas
- Removes 60+ noise selectors (nav, footer, ads, popups, etc.)
- Removes hidden elements and empty containers
Prunes content with PruningContentFilter:
- Scores each element by text density, link ratio, tag importance
- Removes elements below quality threshold
Runs 6 extraction strategies on each HTML source:
- raw HTML + trafilatura
- cleaned HTML + trafilatura
- pruned HTML + trafilatura
- raw HTML + multi-div
- cleaned HTML + multi-div
- CSS-targeted extraction
Automatically uses the longest result from all 12 combinations
Filters short paragraphs (< 50 words by default)
Extracts article title (metadata → H1 → URL fallback)
Creates formatted Google Doc in your Resources folder
Returns shareable document URL with detailed extraction info

Example with multiple URLs:

python fetch_markdown.py \
  "https://arstechnica.com/tech-policy/2024/01/..." \
  "https://www.theverge.com/2024/1/15/..." \
  "https://techcrunch.com/2024/01/15/..."

Real-world batch processing (60 URLs):

python fetch_markdown.py \
  "https://www.a16z.news/p/state-of-consumer-ai-2025" \
  "https://www.growthunhinged.com/p/the-best-growth-tactics-of-2025" \
  ...  # 58 more URLs

Performance: 60 URLs in ~4.5 minutes (4.6s average per doc)

Output

Progress Output:

Locating Google Drive folder...
✓ Found 'Resources' folder

Fetching & Creating: 100%|████████████| 60/60 [04:36<00:00, 4.61s/doc]

[OK] || [aiohttp with multi-div] || 53,008 chars || "SpaceX's historic rocket landing" || https://arstechnica.com/... -> https://docs.google.com/document/d/...
[OK] || [playwright with trafilatura] || 9,527 chars || "Introducing GPT-5.2-Codex" || https://openai.com/... -> https://docs.google.com/document/d/...
[OK] || [aiohttp with trafilatura] || 56,708 chars || "Let a thousand societies bloom" || https://vitalik.eth.limo/... -> https://docs.google.com/document/d/...

✓ Done: 60/60 succeeded, 0 failed.

Log Format:

[OK] || [extraction_method] || content_length || "Document Title" || source_url -> google_doc_url

Extraction methods shown:

aiohttp with raw with trafilatura - Fast HTTP + raw HTML + trafilatura
aiohttp with cleaned with trafilatura - Fast HTTP + cleaned HTML + trafilatura
aiohttp with pruned with trafilatura - Fast HTTP + pruned HTML + trafilatura
playwright with css-targeted - Browser + CSS selector extraction
playwright with cleaned with multi-div - Browser + cleaned HTML + multi-container
...and more combinations

All Google Docs are created in your Resources folder with:

Native Google Docs formatting (not plain text)
Proper headings, bold, italic, links, lists, code blocks
Top of doc includes:
- Document title styled as Google Docs TITLE
- Clickable Source link on the next line
Content from whichever extraction method got the most text

Project Structure

web-content-parser/
├── fetch_markdown.py      # Main CLI entry point & orchestrator (normal + recursive modes)
├── extraction.py          # ExtractionConfig, HTML fetching, extraction strategies
├── playwright_fetch.py    # Playwright browser fetching & smart content waiting
├── google_drive.py        # Google Doc creation, caching, cross-mode dedup helpers
├── title_extractor.py     # Title extraction from metadata, H1, URL fallback
├── auth.py                # OAuth authentication & Google API clients
├── docs_converter.py      # Markdown → Google Docs formatting converter
├── html_cleaner.py        # BeautifulSoup HTML cleaning & noise removal
├── content_filter.py      # PruningContentFilter for content scoring
├── recursive_crawler.py   # URL discovery via sitemap + BFS crawl for recursive mode
├── tabbed_doc.py          # Tabbed Google Doc creation (one tab per page)
├── requirements.txt       # Python dependencies
├── credentials.json       # OAuth client secrets (git-ignored)
├── token.json             # User access tokens (git-ignored)
├── bin/
│   └── web-content-parser # Global command wrapper
└── README.md

CLI Options

web-content-parser [OPTIONS] [urls...]

If no URLs are provided, the tool enters interactive mode.

Option	Description	Default
`--no-clean`	Disable BeautifulSoup HTML cleaning	Enabled
`--no-prune`	Disable content scoring/pruning	Enabled
`--pruning-threshold FLOAT`	Score threshold for keeping content (0.0-1.0)	0.48
`--min-words INT`	Minimum words per markdown block	0 (disabled)
`--min-word-threshold INT`	Minimum words for pruning filter	10
`--no-dynamic-threshold`	Disable dynamic threshold adjustment	Enabled
`--recursive`	Crawl all sub-pages and combine into one tabbed Google Doc	Off
`--max-pages INT`	Max pages in recursive mode (skips confirmation prompt)	None (ask user)
`--crawl-delay FLOAT`	Delay between page fetches during crawling (seconds)	0.5

Examples

# Interactive mode (select Normal or Recursive, then enter URLs)
web-content-parser

# Interactive mode with options
web-content-parser --no-clean

# Direct mode with single URL
web-content-parser "https://example.com/article"

# Direct mode with multiple URLs
web-content-parser url1 url2 url3

# Recursive mode: crawl all sub-pages into a single tabbed doc
web-content-parser --recursive "https://docs.example.com/guide"

# Recursive mode with page limit
web-content-parser --recursive --max-pages 20 "https://docs.example.com/guide"

# More aggressive pruning (keeps less content)
web-content-parser --pruning-threshold 0.6 "https://example.com/article"

# Enable short paragraph filtering (remove blocks < 30 words)
web-content-parser --min-words 30 "https://example.com/article"

Configuration

Performance Settings

Concurrency: 15 parallel tasks (both aiohttp and Playwright)
- Note: While fetching runs fully in parallel, Google API operations have inherent rate limiting
- Actual speedup: ~2x faster than sequential processing (not 15x)
- Why: Google Docs/Drive APIs have connection pooling and rate limits that serialize some operations
- Benefit: Still significantly faster, and content fetching (the slowest part) is fully parallelized
Timeout: 30s for aiohttp, 45s for Playwright
Retries: 3 batch-level retry rounds for failed URLs
Target folder: "Resources" in Google Drive (customizable in fetch_markdown.py)
Startup cost: one-time Drive cache build (faster reuse on larger batches)

Extraction Strategy

Always runs both fetch methods (aiohttp + Playwright) in parallel
Cleans HTML with BeautifulSoup (60+ noise selectors removed)
Prunes content with scoring algorithm (text density, link ratio, tag importance)
Runs 6 extraction strategies on each HTML source (raw, cleaned, pruned × trafilatura, multi-div, CSS)
Compares all 12 combinations and uses the longest result
Filters short blocks (< 50 words by default)
No manual configuration needed - fully automatic

OAuth Scopes

documents - create and edit Google Docs
drive - access Drive folders and create files

Security

Sensitive files are excluded from git via .gitignore:

credentials.json - OAuth client secrets
token.json - User access/refresh tokens

Never commit these files to version control.

How It Works

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              INPUT: URLs                                         │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        PHASE 1: PARALLEL FETCHING                               │
│  ┌─────────────────────────────┐    ┌─────────────────────────────────────────┐ │
│  │      aiohttp (Static)       │    │         Playwright (Dynamic)            │ │
│  │  • Fast HTTP requests       │    │  • Headless Chrome browser              │ │
│  │  • No JS execution          │    │  • Executes JavaScript                  │ │
│  │  • Good for static sites    │    │  • Smart wait for content               │ │
│  └──────────────┬──────────────┘    │  • Scroll triggers for lazy-load        │ │
│                 │                   └──────────────────┬──────────────────────┘ │
│                 │                                      │                        │
│                 └──────────────┬───────────────────────┘                        │
│                                ▼                                                │
│                    Both run in PARALLEL                                         │
│                    (up to 15 concurrent)                                        │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                     PHASE 2: HTML CLEANING (BeautifulSoup)                      │
│                                                                                 │
│   Raw HTML ──► Remove:                                                          │
│                 • <script>, <style>, <noscript>, <iframe>                       │
│                 • HTML comments                                                 │
│                 • 60+ noise selectors:                                          │
│                   - nav, header, footer, aside                                  │
│                   - .sidebar, .menu, .navbar                                    │
│                   - .advertisement, .ad, .sponsored                             │
│                   - .social-share, .comments, .related-posts                    │
│                   - .popup, .modal, .cookie-notice                              │
│                 • Hidden elements (display:none, aria-hidden)                   │
│                 • Empty elements                                                │
│                                                                                 │
│   Output: cleaned_html                                                          │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                    PHASE 3: CONTENT PRUNING (PruningContentFilter)              │
│                                                                                 │
│   For each HTML element, compute score (0.0 - 1.0):                             │
│                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │  Score = (text_density × 0.4) + (1 - link_density) × 0.3 +              │   │
│   │          (tag_importance × 0.2) + (class_id_score × 0.1)                │   │
│   └─────────────────────────────────────────────────────────────────────────┘   │
│                                                                                 │
│   • text_density: ratio of text to total element size                          │
│   • link_density: ratio of link text to total text (high = navigation)         │
│   • tag_importance: article/main = 1.0, div = 0.5, nav = 0.1                   │
│   • class_id_score: "content/article" = boost, "sidebar/ad" = penalty          │
│                                                                                 │
│   Remove elements with score < threshold (default: 0.48)                        │
│                                                                                 │
│   Output: pruned_html                                                           │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                    PHASE 4: MULTI-STRATEGY EXTRACTION                           │
│                                                                                 │
│   Run 6 strategies on EACH HTML source (aiohttp + playwright):                  │
│                                                                                 │
│   ┌─────────────────┬────────────────────────────────────────────────────────┐  │
│   │ Strategy        │ Description                                            │  │
│   ├─────────────────┼────────────────────────────────────────────────────────┤  │
│   │ raw+trafilatura │ trafilatura on original HTML                           │  │
│   │ cleaned+trafil. │ trafilatura on cleaned HTML                            │  │
│   │ pruned+trafil.  │ trafilatura on pruned HTML                             │  │
│   │ raw+multi-div   │ Combine multiple content divs (for split articles)     │  │
│   │ cleaned+multi   │ multi-div on cleaned HTML                              │  │
│   │ css-targeted    │ CSS selectors: article, main, .content, etc.           │  │
│   └─────────────────┴────────────────────────────────────────────────────────┘  │
│                                                                                 │
│   Total: up to 12 extraction attempts per URL                                   │
│   (6 strategies × 2 HTML sources)                                               │
│                                                                                 │
│   Winner: Strategy with LONGEST content                                         │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                       PHASE 5: MARKDOWN POST-PROCESSING                         │
│                                                                                 │
│   • Filter short blocks: Remove paragraphs with < 50 words                      │
│   • Keep: Headers (#), code blocks (```), lists (- *), blockquotes (>)          │
│   • Extract title: metadata → first H1 → URL fallback                           │
│   • Add source URL and timestamp header                                         │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        PHASE 6: GOOGLE DOCS CREATION                            │
│                                                                                 │
│   • Check doc cache (avoid duplicates)                                          │
│   • Create Google Doc with formatted content                                    │
│   • Move to "Resources" folder                                                  │
│   • Return shareable URL                                                        │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                                 OUTPUT                                          │
│                                                                                 │
│   [OK] || [playwright with css-targeted] || 39,307 chars || "Title" || url      │
│         ↑                                        ↑                              │
│         │                                        │                              │
│    Best strategy                          Content length                        │
└─────────────────────────────────────────────────────────────────────────────────┘

Step-by-Step Breakdown

1. Authentication (`auth.py`):

First run: opens browser for OAuth consent
Saves tokens to token.json for automatic refresh
Creates authenticated Google Docs & Drive API clients

2. Parallel Fetching (`fetch_markdown.py`):

Launches headless Chromium browser (Playwright)
For each URL, runs both methods in parallel:
- aiohttp: Fast HTTP GET, static HTML (~1-2s)
- Playwright: Headless browser, executes JavaScript, waits for content (~5-8s)
Both complete simultaneously; tool waits for both results

3. HTML Cleaning (`html_cleaner.py`):

Removes non-content tags: <script>, <style>, <noscript>, <iframe>, <svg>
Removes HTML comments
Removes 60+ noise selectors:
- Semantic: nav, header, footer, aside
- Classes: .sidebar, .menu, .navbar, .advertisement, .social-share
- IDs: #comments, #sidebar, #cookie-notice
- ARIA roles: [role="navigation"], [role="banner"]
Removes hidden elements (display:none, aria-hidden="true")
Removes empty containers

4. Content Pruning (`content_filter.py`):

PruningContentFilter scores each element:
- Text density (40%): Ratio of text to element size
- Link density (30%): High link ratio = likely navigation
- Tag importance (20%): article = 1.0, div = 0.5, nav = 0.1
- Class/ID patterns (10%): "content" = boost, "sidebar" = penalty
Removes elements below threshold (default: 0.48)
Dynamic threshold adjustment based on content characteristics

5. Multi-Strategy Extraction:

Runs 6 strategies on each HTML source:
1. raw + trafilatura: Smart content detection on original HTML
2. cleaned + trafilatura: trafilatura on cleaned HTML
3. pruned + trafilatura: trafilatura on heavily pruned HTML
4. raw + multi-div: XPath pattern matching, combines split articles
5. cleaned + multi-div: multi-div on cleaned HTML
6. css-targeted: Priority CSS selectors (article, main, .content, etc.)
Compares all 12 results (6 strategies × 2 sources)
Uses whichever got the longest content

6. Markdown Post-Processing:

Filters short paragraphs (< 50 words by default)
Preserves headers, code blocks, lists, blockquotes
Extracts title from metadata, H1, or URL

7. Document Creation (`docs_converter.py`, `google_drive.py`, `tabbed_doc.py`):

Normal mode: Creates one Google Doc per URL, tagged with appProperties for cross-mode dedup
Recursive mode: Creates a single tabbed Google Doc (one tab per page), tagged with base URL
Builds a one-time Drive title cache (single folder query)
Reuses existing Docs by title if found; otherwise creates new
Parses markdown with markdown-it-py
Converts to Google Docs API batchUpdate requests
Applies formatting in single batch operation

8. Cross-Mode Duplicate Detection (`google_drive.py`):

Both modes tag docs with appProperties metadata (invisible to users, queryable via Drive API)
Normal mode skips URLs already covered by a recursive doc's base_url prefix
Recursive mode finds and deletes standalone docs whose source URL matches a sub-page
Zero extra API calls for tagging (merged into existing files().update call)

Why This Approach?

Hybrid Fetching (aiohttp + Playwright)

aiohttp wins ~80% of the time - faster for static sites
Playwright needed for ~20% - JavaScript-heavy sites (Medium, LinkedIn, etc.)
Parallel execution - total time ≈ slower method, not both added together
Some sites block Playwright - aiohttp bypasses simple bot detection

Multi-Stage Cleaning (BeautifulSoup + Pruning)

BeautifulSoup cleaning - removes obvious noise (scripts, ads, navigation)
PruningContentFilter - algorithmically scores and removes low-quality content
No manual rules needed - works automatically on any site
Configurable thresholds - tune aggressiveness via CLI flags

Multi-Strategy Extraction (6 strategies × 2 sources)

trafilatura - best for standard articles, smart noise filtering
multi-div - catches split-content articles trafilatura misses
CSS-targeted - reliable fallback using semantic HTML patterns
All complement each other - different strengths for different site structures
12 total attempts - maximum chance of successful extraction

Real-World Results

60 URLs tested: 100% success rate
Average: 4.6s per document
Longest article: 69,966 chars (Backlinko guide)
Mix: ~80% used aiohttp, ~20% needed Playwright
Ars Technica: 53,008 chars (20-page Google Doc) - multi-div won by 18.9x over trafilatura alone

Troubleshooting

"Folder 'Resources' not found"

Ensure the folder exists in your Google Drive
Check the folder name matches exactly (case-sensitive)
Verify you completed OAuth authorization successfully

"Access blocked: verification not completed"

Go to OAuth consent screen in Google Cloud Console
Add your email as a test user
Ensure publishing status is Testing (not Production)

"Invalid credentials"

Delete token.json and re-run to re-authenticate
Verify credentials.json is valid OAuth Desktop client

Content still truncated/incomplete

The tool uses maximum extraction with 12 attempts per URL:

Runs both aiohttp and Playwright in parallel
Cleans HTML with BeautifulSoup (60+ noise selectors)
Prunes with content scoring algorithm
Tries 6 extraction strategies on each HTML source
Uses the longest result automatically

If content is still incomplete, the site may:

Use client-side rendering that Playwright can't access
Block automated access with sophisticated bot detection
Require authentication/subscription (paywalled content)
Use infinite scroll or pagination (not yet supported)

Playwright errors

# Reinstall Playwright browsers
playwright install chromium

# Or install all browsers
playwright install

Slow performance

Default: 15 parallel tasks (both aiohttp + Playwright running)
Adjust MAX_CONCURRENCY and PLAYWRIGHT_CONCURRENCY in fetch_markdown.py
Lower for slower machines, higher for faster (max recommended: 20)

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
auth.py		auth.py
content_filter.py		content_filter.py
docs_converter.py		docs_converter.py
extraction.py		extraction.py
fetch_markdown.py		fetch_markdown.py
google_drive.py		google_drive.py
html_cleaner.py		html_cleaner.py
playwright_fetch.py		playwright_fetch.py
recursive_crawler.py		recursive_crawler.py
requirements.txt		requirements.txt
tabbed_doc.py		tabbed_doc.py
title_extractor.py		title_extractor.py

Folders and files

Latest commit

History

Repository files navigation

Web Content Parser

Features

Core Capabilities

Content Extraction Pipeline

Recursive Mode

Reliability & UX

Prerequisites

Installation

1. Install dependencies

2. Install Playwright browsers

3. Set up Google Cloud OAuth

4. Configure target folder (optional)

Usage

Interactive Mode (recommended)

Direct Mode

What happens:

Example with multiple URLs:

Real-world batch processing (60 URLs):

Output

Project Structure

CLI Options

Examples

Configuration

Performance Settings

Extraction Strategy

OAuth Scopes

Security

How It Works

Architecture Overview

Step-by-Step Breakdown

1. Authentication (auth.py):

2. Parallel Fetching (fetch_markdown.py):

3. HTML Cleaning (html_cleaner.py):

4. Content Pruning (content_filter.py):

5. Multi-Strategy Extraction:

6. Markdown Post-Processing:

7. Document Creation (docs_converter.py, google_drive.py, tabbed_doc.py):

8. Cross-Mode Duplicate Detection (google_drive.py):

Why This Approach?

Hybrid Fetching (aiohttp + Playwright)

Multi-Stage Cleaning (BeautifulSoup + Pruning)

Multi-Strategy Extraction (6 strategies × 2 sources)

Real-World Results

Troubleshooting

"Folder 'Resources' not found"

"Access blocked: verification not completed"

"Invalid credentials"

Content still truncated/incomplete

Playwright errors

Slow performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Authentication (`auth.py`):

2. Parallel Fetching (`fetch_markdown.py`):

3. HTML Cleaning (`html_cleaner.py`):

4. Content Pruning (`content_filter.py`):

7. Document Creation (`docs_converter.py`, `google_drive.py`, `tabbed_doc.py`):

8. Cross-Mode Duplicate Detection (`google_drive.py`):

Packages