web2text

A Python toolkit for extracting clean, formatted text from web pages. Handles bot detection, JavaScript rendering, and converts HTML to readable text.

What It Does

Takes any URL and returns clean text content:

Extracts all text content (visible and hidden)
Preserves structure (headings, lists, links with URLs)
Handles images (alt text), forms, tables
Bypasses bot detection with browser profiles
Optional JavaScript rendering for dynamic sites
Cookie support for authenticated access

Quick Start

# Clone and install
git clone https://github.com/...
cd web2text

# Create environment
conda create -n web2text python=3.12
conda activate web2text

# Install
pip install -r requirements.txt

# Test it works
pytest

Basic Usage

from web_to_text import parse

# Simple
text = parse("https://example.com")

# With bot detection bypass
text = parse("https://site.com", browser_profile="firefox_windows")

# With JavaScript rendering
text = parse("https://spa-site.com", enable_js=True)

# With cookies for authenticated access
text = parse("https://members.com", 
             browser_profile="firefox_windows",
             cookie_file="cookies.txt")

Installation as a Toolkit

To use in other projects without duplicating code:

# In any other project's environment
conda activate your_project_env
pip install -e /path/to/web2text

# Now import from anywhere
from web_to_text import parse

This creates a symlink - no code duplication, and changes in web2text apply everywhere.

Features

Browser Profiles

Mimics real browser headers to bypass bot detection (403 errors):

firefox_windows (recommended)
chrome_windows
safari_mac
See config/browser_profiles.json for all profiles

JavaScript Rendering

For sites that require JS (SPAs, dynamic content):

pip install playwright
playwright install chromium

text = parse(url, enable_js=True, wait_for_js=2.0)

Cookie Support

Export cookies from your browser for authenticated access:

Install "cookies.txt" browser extension
Navigate to site and log in
Export cookies.txt
Use in code:

text = parse(url, cookie_file="cookies.txt")

Configurable Formatting

Edit config/formatting_rules.json to customize how HTML elements are formatted:

{
  "a": "{text} ({href})",
  "img": "[Image: {alt}]",
  "h1": "\n\n=== {text} ===\n"
}

Project Structure

web2text/
├── config/                    # Configuration files
│   ├── browser_profiles.json  # Browser fingerprints
│   └── formatting_rules.json  # HTML formatting rules
├── web_to_text/              # Main package
│   ├── parser.py             # Main entry point
│   ├── fetcher.py            # HTTP handling
│   ├── fetcher_js.py         # JavaScript rendering
│   ├── formatter.py          # HTML to text
│   ├── browser_profile.py    # Bot detection bypass
│   ├── config_loader.py      # Configuration
│   └── exceptions.py         # Custom exceptions
├── tests/                    # Test suite
├── docs/                     # Additional documentation
│   ├── QUICKSTART.md
│   ├── ARCHITECTURE.md
│   ├── BOT_DETECTION.md
│   └── JS_RENDERING.md
└── setup.py                  # Package configuration

Documentation

QUICKSTART.md - 5-minute setup guide
BOT_DETECTION.md - Bypassing 403 errors with browser profiles
JS_RENDERING.md - When and how to use JavaScript rendering
ARCHITECTURE.md - Technical details and design

Rebuilding from Scratch

If setting up on a new machine:

# 1. Clone repository
git clone https://github.com/...
cd web2text

# 2. Create conda environment
conda create -n web2text python=3.12
conda activate web2text

# 3. Install dependencies
pip install -r requirements.txt

# 5. Optional: Install playwright for JS rendering
pip install playwright
playwright install chromium

# 6. Verify installation
pytest                                    # Run tests
python -c "from web_to_text import parse; print('✓ Ready')"

Common Use Cases

Extracting Article Text

text = parse("https://news-site.com/article", 
             browser_profile="firefox_windows")

Scraping with Authentication

# Export cookies from logged-in browser session
text = parse("https://members-only.com/content",
             browser_profile="firefox_windows",
             cookie_file="cookies.txt")

Handling Dynamic Sites

text = parse("https://react-app.com",
             enable_js=True,
             wait_for_js=3.0)

Batch Processing

from web_to_text import WebParser

urls = ["site1.com", "site2.com", "site3.com"]

with WebParser(browser_profile="firefox_windows") as parser:
    for url in urls:
        text = parser.parse(url)
        process(text)

Testing

# Run all tests
pytest

# With coverage
pytest --cov=web_to_text

# Specific test file
pytest tests/test_parser.py -v

Requirements

Python 3.8+
See requirements.txt for dependencies
Optional: Playwright for JavaScript rendering

License

MIT License - do whatever you want with it.

Notes

Cookie files are in Netscape format (export from browser extensions)
Browser profiles help avoid 403 errors on protected sites
JavaScript rendering is slower but handles SPAs and dynamic content
All configurations are in config/ directory
Changes to installed package apply immediately (editable install)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
docs		docs
tests		tests
web2text		web2text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_browser_profile.py		example_browser_profile.py
example_js_usage.py		example_js_usage.py
example_usage.py		example_usage.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web2text

What It Does

Quick Start

Basic Usage

Installation as a Toolkit

Features

Browser Profiles

JavaScript Rendering

Cookie Support

Configurable Formatting

Project Structure

Documentation

Rebuilding from Scratch

Common Use Cases

Extracting Article Text

Scraping with Authentication

Handling Dynamic Sites

Batch Processing

Testing

Requirements

License

Notes

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

web2text

What It Does

Quick Start

Basic Usage

Installation as a Toolkit

Features

Browser Profiles

JavaScript Rendering

Cookie Support

Configurable Formatting

Project Structure

Documentation

Rebuilding from Scratch

Common Use Cases

Extracting Article Text

Scraping with Authentication

Handling Dynamic Sites

Batch Processing

Testing

Requirements

License

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages