A Python toolkit for extracting clean, formatted text from web pages. Handles bot detection, JavaScript rendering, and converts HTML to readable text.
Takes any URL and returns clean text content:
- Extracts all text content (visible and hidden)
- Preserves structure (headings, lists, links with URLs)
- Handles images (alt text), forms, tables
- Bypasses bot detection with browser profiles
- Optional JavaScript rendering for dynamic sites
- Cookie support for authenticated access
# Clone and install
git clone https://github.com/...
cd web2text
# Create environment
conda create -n web2text python=3.12
conda activate web2text
# Install
pip install -r requirements.txt
# Test it works
pytestfrom web_to_text import parse
# Simple
text = parse("https://example.com")
# With bot detection bypass
text = parse("https://site.com", browser_profile="firefox_windows")
# With JavaScript rendering
text = parse("https://spa-site.com", enable_js=True)
# With cookies for authenticated access
text = parse("https://members.com",
browser_profile="firefox_windows",
cookie_file="cookies.txt")To use in other projects without duplicating code:
# In any other project's environment
conda activate your_project_env
pip install -e /path/to/web2text
# Now import from anywhere
from web_to_text import parseThis creates a symlink - no code duplication, and changes in web2text apply everywhere.
Mimics real browser headers to bypass bot detection (403 errors):
firefox_windows(recommended)chrome_windowssafari_mac- See
config/browser_profiles.jsonfor all profiles
For sites that require JS (SPAs, dynamic content):
pip install playwright
playwright install chromiumtext = parse(url, enable_js=True, wait_for_js=2.0)Export cookies from your browser for authenticated access:
- Install "cookies.txt" browser extension
- Navigate to site and log in
- Export cookies.txt
- Use in code:
text = parse(url, cookie_file="cookies.txt")Edit config/formatting_rules.json to customize how HTML elements are formatted:
{
"a": "{text} ({href})",
"img": "[Image: {alt}]",
"h1": "\n\n=== {text} ===\n"
}web2text/
├── config/ # Configuration files
│ ├── browser_profiles.json # Browser fingerprints
│ └── formatting_rules.json # HTML formatting rules
├── web_to_text/ # Main package
│ ├── parser.py # Main entry point
│ ├── fetcher.py # HTTP handling
│ ├── fetcher_js.py # JavaScript rendering
│ ├── formatter.py # HTML to text
│ ├── browser_profile.py # Bot detection bypass
│ ├── config_loader.py # Configuration
│ └── exceptions.py # Custom exceptions
├── tests/ # Test suite
├── docs/ # Additional documentation
│ ├── QUICKSTART.md
│ ├── ARCHITECTURE.md
│ ├── BOT_DETECTION.md
│ └── JS_RENDERING.md
└── setup.py # Package configuration
- QUICKSTART.md - 5-minute setup guide
- BOT_DETECTION.md - Bypassing 403 errors with browser profiles
- JS_RENDERING.md - When and how to use JavaScript rendering
- ARCHITECTURE.md - Technical details and design
If setting up on a new machine:
# 1. Clone repository
git clone https://github.com/...
cd web2text
# 2. Create conda environment
conda create -n web2text python=3.12
conda activate web2text
# 3. Install dependencies
pip install -r requirements.txt
# 5. Optional: Install playwright for JS rendering
pip install playwright
playwright install chromium
# 6. Verify installation
pytest # Run tests
python -c "from web_to_text import parse; print('✓ Ready')"text = parse("https://news-site.com/article",
browser_profile="firefox_windows")# Export cookies from logged-in browser session
text = parse("https://members-only.com/content",
browser_profile="firefox_windows",
cookie_file="cookies.txt")text = parse("https://react-app.com",
enable_js=True,
wait_for_js=3.0)from web_to_text import WebParser
urls = ["site1.com", "site2.com", "site3.com"]
with WebParser(browser_profile="firefox_windows") as parser:
for url in urls:
text = parser.parse(url)
process(text)# Run all tests
pytest
# With coverage
pytest --cov=web_to_text
# Specific test file
pytest tests/test_parser.py -v- Python 3.8+
- See
requirements.txtfor dependencies - Optional: Playwright for JavaScript rendering
MIT License - do whatever you want with it.
- Cookie files are in Netscape format (export from browser extensions)
- Browser profiles help avoid 403 errors on protected sites
- JavaScript rendering is slower but handles SPAs and dynamic content
- All configurations are in
config/directory - Changes to installed package apply immediately (editable install)