A fork of crawl4ai designed to experiment with and validate anti-bot techniques for protected web properties, specifically G2 (DataDome), Upwork (Cloudflare), and Indeed (custom anti-bot).
This fork proves that real system Chrome binaries combined with persistent profiles, headful mode, and Patchright's stealth patches can defeat modern bot detection without proxies, residential networks, or challenge-solving services. The goal is to establish a reusable pattern for crawling protected sites reliably.
- ✅ G2 (DataDome): Cold-session access validated
- ✅ Upwork (Cloudflare): Cold-session access validated
- ✅ Indeed (custom anti-bot): Cold-session access validated (homepage, search, job detail)
- ✅ Extraction proof: Upwork job listings extracted (title, URL, rate, experience level, description)
- ✅ Extraction proof: Indeed job details extracted (title, company, location, salary, job type, description)
- ✅ Ready for use: The presets and validation patterns are production-ready for custom implementations
- Python 3.10+
- Linux or macOS with real Chrome installed (
/usr/bin/google-chrome-stableor equivalent) uv(for package management)
# Clone this fork
git clone https://github.com/YB9/crawl4ai-patchright.git
cd crawl4ai-patchright
# Install as editable package
uv pip install -e .
# Download Patchright (required for stealth patches)
python -m patchright install --with-deps chromium# Cold-session validation (no warm-up needed)
uv run python validate_g2.py
# Review-page validation
uv run python validate_g2_reviews.py
# Assisted flow (if cold access is blocked — human-in-loop challenge solver)
uv run python validate_g2_assisted.py# Cold-session validation (homepage + job search)
uv run python validate_upwork.py
# Assisted flow (if login-gated pages need access)
uv run python validate_upwork_assisted.py
# Extraction proof (parse 5 job listings from search results)
uv run python extract_upwork.py --jobs 5# Cold-session validation (homepage + search + job detail)
uv run python validate_indeed.py
# Assisted flow (if bot challenge blocks cold access)
uv run python validate_indeed_assisted.py
# Extraction proof (search → detail pages → structured fields)
uv run python extract_indeed.py --jobs 5Each site has a reusable preset that encodes the minimum config needed to pass bot detection:
-
crawl4ai_patchright/g2_preset.pyReal Chrome binary,no_sandbox=False, persistent profile, headful mode. Used byvalidate_g2.pyand available for import:from crawl4ai_patchright.g2_preset import g2_browser_config, g2_run_config
-
crawl4ai_patchright/upwork_preset.pyIdentical strategy for Cloudflare-protected Upwork. Used byvalidate_upwork.py:from crawl4ai_patchright.upwork_preset import upwork_browser_config, upwork_run_config
-
crawl4ai_patchright/indeed_preset.pySame strategy for Indeed's custom anti-bot protection. Validates 3 URLs (homepage, search, job detail). Used byvalidate_indeed.py:from crawl4ai_patchright.indeed_preset import indeed_browser_config, indeed_run_config
| Setting | Value | Why |
|---|---|---|
browser_engine |
patchright |
Stealth patches mask automation signals |
chrome_channel |
"chrome" |
Real system Chrome, not bundled Chromium |
no_sandbox |
False |
Don't add --no-sandbox flag (triggers bot detector warning) |
headless |
False |
Headful mode scores higher with bot detectors |
use_persistent_context |
True |
Persistent profile accumulates trust over requests |
user_agent |
"" (empty) |
Let Chrome advertise its natural UA, not a spoofed one |
-
Cold validation (
validate_*.py): Tests if public pages load without warm-up- Scans for block signals (e.g., "just a moment", "cf-browser-verification", "verify you are human")
- Confirms success signals (e.g., "upwork", "g2.com", "indeed")
- Returns verdict: ✅ accessible, ⚠ weak, or ❌ blocked
-
Assisted validation (
validate_*_assisted.py): For sites requiring human interaction- RUN 1: Opens headful browser, pauses for user to solve challenge (if present)
- RUN 2: Re-crawls same URL, checking if cookies from Run 1 reuse successfully
- Proves persistent profiles carry trust across crawls
-
Extraction proof (
extract_*.py): Parses structured data from markdown- Confirms markdown is clean enough for regex extraction
- No LLM parsing needed — pure pattern matching on markdown blocks
import asyncio
from crawl4ai_patchright import AsyncWebCrawler
from crawl4ai_patchright.upwork_preset import upwork_browser_config, upwork_run_config
async def crawl_upwork():
browser_cfg = upwork_browser_config()
run_cfg = upwork_run_config()
url = "https://www.upwork.com/nx/search/jobs/?q=python+developer&sort=recency"
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(url=url, config=run_cfg)
print(result.markdown) # Clean, structured markdown
print(result.html) # Full HTML if needed
asyncio.run(crawl_upwork())from crawl4ai_patchright import AsyncWebCrawler
from crawl4ai_patchright.g2_preset import g2_browser_config, g2_run_config
async def crawl_g2():
browser_cfg = g2_browser_config()
run_cfg = g2_run_config()
url = "https://www.g2.com/products/notion/reviews"
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(url=url, config=run_cfg)
return result.markdown
asyncio.run(crawl_g2())import asyncio
from crawl4ai_patchright import AsyncWebCrawler
from crawl4ai_patchright.indeed_preset import indeed_browser_config, indeed_run_config
async def crawl_indeed():
browser_cfg = indeed_browser_config()
run_cfg = indeed_run_config()
url = "https://www.indeed.com/jobs?q=python+developer&l=remote"
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(url=url, config=run_cfg)
print(result.markdown) # Clean, structured markdown
print(result.html) # Full HTML if needed
asyncio.run(crawl_indeed())crawl4ai-patchright/
├── crawl4ai_patchright/
│ ├── g2_preset.py # G2 BrowserConfig + CrawlerRunConfig
│ ├── upwork_preset.py # Upwork BrowserConfig + CrawlerRunConfig
│ ├── indeed_preset.py # Indeed BrowserConfig + CrawlerRunConfig
│ └── ... # Rest of crawl4ai fork internals
├── validate_g2.py # Cold validation (public G2 pages)
├── validate_g2_reviews.py # G2 review page validation
├── validate_g2_assisted.py # Assisted flow (human challenge solver)
├── validate_upwork.py # Cold validation (Upwork homepage + search)
├── validate_upwork_assisted.py # Assisted flow
├── extract_upwork.py # Extraction proof (5 job listings)
├── validate_indeed.py # Cold validation (homepage + search + detail)
├── validate_indeed_assisted.py # Assisted flow (human challenge solver)
├── extract_indeed.py # Extraction proof (search → detail → fields)
└── README.md # This file
[✅ RESULT — Cold validation]
crawl.success: True
Block signals: None detected
Success signals: ['g2.com', 'reviews', 'software', 'categories', ...]
VERDICT: ✅ G2 accessible cold
[✅ RESULT — Cold validation]
crawl.success: True
Block signals: None detected
Success signals: ['upwork', 'job', 'hourly', 'fixed', 'budget', ...]
VERDICT: ✅ Upwork accessible cold
[✅ RESULT — Cold validation]
Homepage: crawl.success: True, signals: ['indeed', 'find jobs', 'job search', 'salaries', 'company reviews']
Job search: crawl.success: True, signals: ['indeed', 'jobs', 'salary', 'posted', 'apply', 'full-time', 'remote']
Job detail: crawl.success: True, signals: ['indeed', 'apply', 'job description', 'qualifications', 'full job description']
VERDICT: ✅ Indeed accessible cold (all 3 URLs)
[1] Full-Stack Python Developer
Company : Nitka Technologies
Location : Remote
Job type : Full-time
Schedule : 8 hour shift
Description : Nitka Technologies develops software for customers in the US and Europe...
[2] Python Developer
Company : Resource Innovations
Location : Boston, MA
Salary : $90,000 - $100,000 a year
Job type : Full-time
Description : Resource Innovations seeks a Django/Python Developer to join...
[3] Remote Python Developer
Company : LookFar Labs
Location : Washington, DC
Salary : $95,000 - $125,000 a year
Job type : Full-time
Schedule : Monday to Friday
Description : Our Python Developer will have the opportunity to build scalable...
EXTRACTION QUALITY: ✅ 5/5 jobs — title, company, location, job_type, description always present; salary, schedule optional
[1] Backend Engineer
Type/Rate: Hourly — $5.00 - $20.00
Level: Intermediate
Description: Multi-sensor intelligence platform, backend services...
[2] Full-stack AI/ML With Front End
Type/Rate: Fixed price — $361.00
Level: Intermediate
Description: Connect HTML Form, FastAPI Backend, Deploy Static Site...
[3] Experienced Full Stack Developer to Make me A Web App
Type/Rate: Fixed price — $500.00
Level: Expert
Description: Web app developer needed, portfolio required...
[4] Django Migration Expert Needed for Azure Deployment
Type/Rate: Fixed price — $400.00
Level: Entry Level
Description: Migrate Django from PythonAnywhere to Azure...
[5] Architect & Developer for Real-time Voice AI SaaS Project
Type/Rate: Hourly — $25.00 - $45.00
Level: Expert
Description: AI-SkyTalk flight radio simulator, voice AI training...
EXTRACTION QUALITY: ✅ 5/5 jobs, all fields present
- No Selenium grid support: Uses local Patchright + real Chrome only
- Headful mode required: Headless mode scores lower with bot detectors (use assisted flow if needed)
- Login-gated content: Public pages pass cold; full job details may require login (Upwork, Indeed)
- URL slug artifacts: Upwork injects
span-class-highlightmarkers in search-result URL slugs (cosmetic; routing still works) - Indeed detail pages: Job detail URLs are dynamically extracted from search results via
vjk=parameters; if search is blocked, detail validation is skipped
This fork is experimental. To contribute:
- Test new anti-bot techniques against G2, Upwork, or Indeed
- Document changes in preset docstrings
- Update block/success signals in validation scripts if detection patterns change
- Keep presets minimal — only add config that actually defeats bot detection
- Parent repo: crawl4ai
- Patchright: https://github.com/Aleph-Alpha/patchright
- DataDome docs: https://datadome.co/
- Cloudflare challenge detection: https://support.cloudflare.com/hc/en-us/articles/200170156
Same as parent repo (crawl4ai). See LICENSE in the parent repository.
Status: Fork validated for G2, Upwork, and Indeed. Production-ready presets and extraction patterns.