Skip to content

wmjg-alt/web2text

Repository files navigation

web2text

A Python toolkit for extracting clean, formatted text from web pages. Handles bot detection, JavaScript rendering, and converts HTML to readable text.

What It Does

Takes any URL and returns clean text content:

  • Extracts all text content (visible and hidden)
  • Preserves structure (headings, lists, links with URLs)
  • Handles images (alt text), forms, tables
  • Bypasses bot detection with browser profiles
  • Optional JavaScript rendering for dynamic sites
  • Cookie support for authenticated access

Quick Start

# Clone and install
git clone https://github.com/...
cd web2text

# Create environment
conda create -n web2text python=3.12
conda activate web2text

# Install
pip install -r requirements.txt

# Test it works
pytest

Basic Usage

from web_to_text import parse

# Simple
text = parse("https://example.com")

# With bot detection bypass
text = parse("https://site.com", browser_profile="firefox_windows")

# With JavaScript rendering
text = parse("https://spa-site.com", enable_js=True)

# With cookies for authenticated access
text = parse("https://members.com", 
             browser_profile="firefox_windows",
             cookie_file="cookies.txt")

Installation as a Toolkit

To use in other projects without duplicating code:

# In any other project's environment
conda activate your_project_env
pip install -e /path/to/web2text

# Now import from anywhere
from web_to_text import parse

This creates a symlink - no code duplication, and changes in web2text apply everywhere.

Features

Browser Profiles

Mimics real browser headers to bypass bot detection (403 errors):

  • firefox_windows (recommended)
  • chrome_windows
  • safari_mac
  • See config/browser_profiles.json for all profiles

JavaScript Rendering

For sites that require JS (SPAs, dynamic content):

pip install playwright
playwright install chromium
text = parse(url, enable_js=True, wait_for_js=2.0)

Cookie Support

Export cookies from your browser for authenticated access:

  1. Install "cookies.txt" browser extension
  2. Navigate to site and log in
  3. Export cookies.txt
  4. Use in code:
text = parse(url, cookie_file="cookies.txt")

Configurable Formatting

Edit config/formatting_rules.json to customize how HTML elements are formatted:

{
  "a": "{text} ({href})",
  "img": "[Image: {alt}]",
  "h1": "\n\n=== {text} ===\n"
}

Project Structure

web2text/
├── config/                    # Configuration files
│   ├── browser_profiles.json  # Browser fingerprints
│   └── formatting_rules.json  # HTML formatting rules
├── web_to_text/              # Main package
│   ├── parser.py             # Main entry point
│   ├── fetcher.py            # HTTP handling
│   ├── fetcher_js.py         # JavaScript rendering
│   ├── formatter.py          # HTML to text
│   ├── browser_profile.py    # Bot detection bypass
│   ├── config_loader.py      # Configuration
│   └── exceptions.py         # Custom exceptions
├── tests/                    # Test suite
├── docs/                     # Additional documentation
│   ├── QUICKSTART.md
│   ├── ARCHITECTURE.md
│   ├── BOT_DETECTION.md
│   └── JS_RENDERING.md
└── setup.py                  # Package configuration

Documentation

Rebuilding from Scratch

If setting up on a new machine:

# 1. Clone repository
git clone https://github.com/...
cd web2text

# 2. Create conda environment
conda create -n web2text python=3.12
conda activate web2text

# 3. Install dependencies
pip install -r requirements.txt

# 5. Optional: Install playwright for JS rendering
pip install playwright
playwright install chromium

# 6. Verify installation
pytest                                    # Run tests
python -c "from web_to_text import parse; print('✓ Ready')"

Common Use Cases

Extracting Article Text

text = parse("https://news-site.com/article", 
             browser_profile="firefox_windows")

Scraping with Authentication

# Export cookies from logged-in browser session
text = parse("https://members-only.com/content",
             browser_profile="firefox_windows",
             cookie_file="cookies.txt")

Handling Dynamic Sites

text = parse("https://react-app.com",
             enable_js=True,
             wait_for_js=3.0)

Batch Processing

from web_to_text import WebParser

urls = ["site1.com", "site2.com", "site3.com"]

with WebParser(browser_profile="firefox_windows") as parser:
    for url in urls:
        text = parser.parse(url)
        process(text)

Testing

# Run all tests
pytest

# With coverage
pytest --cov=web_to_text

# Specific test file
pytest tests/test_parser.py -v

Requirements

  • Python 3.8+
  • See requirements.txt for dependencies
  • Optional: Playwright for JavaScript rendering

License

MIT License - do whatever you want with it.

Notes

  • Cookie files are in Netscape format (export from browser extensions)
  • Browser profiles help avoid 403 errors on protected sites
  • JavaScript rendering is slower but handles SPAs and dynamic content
  • All configurations are in config/ directory
  • Changes to installed package apply immediately (editable install)

About

Clean text extraction from web pages - handles bot detection, JS rendering, and HTML parsing

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages