Skip to content

YB9/crawl4ai-patchright

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,223 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

crawl4ai-patchright

A fork of crawl4ai designed to experiment with and validate anti-bot techniques for protected web properties, specifically G2 (DataDome), Upwork (Cloudflare), and Indeed (custom anti-bot).

Purpose

This fork proves that real system Chrome binaries combined with persistent profiles, headful mode, and Patchright's stealth patches can defeat modern bot detection without proxies, residential networks, or challenge-solving services. The goal is to establish a reusable pattern for crawling protected sites reliably.

Current Status

  • G2 (DataDome): Cold-session access validated
  • Upwork (Cloudflare): Cold-session access validated
  • Indeed (custom anti-bot): Cold-session access validated (homepage, search, job detail)
  • Extraction proof: Upwork job listings extracted (title, URL, rate, experience level, description)
  • Extraction proof: Indeed job details extracted (title, company, location, salary, job type, description)
  • Ready for use: The presets and validation patterns are production-ready for custom implementations

Installation

Prerequisites

  • Python 3.10+
  • Linux or macOS with real Chrome installed (/usr/bin/google-chrome-stable or equivalent)
  • uv (for package management)

Steps

# Clone this fork
git clone https://github.com/YB9/crawl4ai-patchright.git
cd crawl4ai-patchright

# Install as editable package
uv pip install -e .

# Download Patchright (required for stealth patches)
python -m patchright install --with-deps chromium

Quick Start

Validate G2 Access

# Cold-session validation (no warm-up needed)
uv run python validate_g2.py

# Review-page validation
uv run python validate_g2_reviews.py

# Assisted flow (if cold access is blocked — human-in-loop challenge solver)
uv run python validate_g2_assisted.py

Validate Upwork Access

# Cold-session validation (homepage + job search)
uv run python validate_upwork.py

# Assisted flow (if login-gated pages need access)
uv run python validate_upwork_assisted.py

# Extraction proof (parse 5 job listings from search results)
uv run python extract_upwork.py --jobs 5

Validate Indeed Access

# Cold-session validation (homepage + search + job detail)
uv run python validate_indeed.py

# Assisted flow (if bot challenge blocks cold access)
uv run python validate_indeed_assisted.py

# Extraction proof (search → detail pages → structured fields)
uv run python extract_indeed.py --jobs 5

Architecture

Presets

Each site has a reusable preset that encodes the minimum config needed to pass bot detection:

  • crawl4ai_patchright/g2_preset.py Real Chrome binary, no_sandbox=False, persistent profile, headful mode. Used by validate_g2.py and available for import:

    from crawl4ai_patchright.g2_preset import g2_browser_config, g2_run_config
  • crawl4ai_patchright/upwork_preset.py Identical strategy for Cloudflare-protected Upwork. Used by validate_upwork.py:

    from crawl4ai_patchright.upwork_preset import upwork_browser_config, upwork_run_config
  • crawl4ai_patchright/indeed_preset.py Same strategy for Indeed's custom anti-bot protection. Validates 3 URLs (homepage, search, job detail). Used by validate_indeed.py:

    from crawl4ai_patchright.indeed_preset import indeed_browser_config, indeed_run_config

Key Anti-Bot Settings

Setting Value Why
browser_engine patchright Stealth patches mask automation signals
chrome_channel "chrome" Real system Chrome, not bundled Chromium
no_sandbox False Don't add --no-sandbox flag (triggers bot detector warning)
headless False Headful mode scores higher with bot detectors
use_persistent_context True Persistent profile accumulates trust over requests
user_agent "" (empty) Let Chrome advertise its natural UA, not a spoofed one

Validation Flow

  1. Cold validation (validate_*.py): Tests if public pages load without warm-up

    • Scans for block signals (e.g., "just a moment", "cf-browser-verification", "verify you are human")
    • Confirms success signals (e.g., "upwork", "g2.com", "indeed")
    • Returns verdict: ✅ accessible, ⚠ weak, or ❌ blocked
  2. Assisted validation (validate_*_assisted.py): For sites requiring human interaction

    • RUN 1: Opens headful browser, pauses for user to solve challenge (if present)
    • RUN 2: Re-crawls same URL, checking if cookies from Run 1 reuse successfully
    • Proves persistent profiles carry trust across crawls
  3. Extraction proof (extract_*.py): Parses structured data from markdown

    • Confirms markdown is clean enough for regex extraction
    • No LLM parsing needed — pure pattern matching on markdown blocks

Usage

Extract from Upwork Job Search

import asyncio
from crawl4ai_patchright import AsyncWebCrawler
from crawl4ai_patchright.upwork_preset import upwork_browser_config, upwork_run_config

async def crawl_upwork():
    browser_cfg = upwork_browser_config()
    run_cfg = upwork_run_config()

    url = "https://www.upwork.com/nx/search/jobs/?q=python+developer&sort=recency"

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(url=url, config=run_cfg)

    print(result.markdown)  # Clean, structured markdown
    print(result.html)      # Full HTML if needed

asyncio.run(crawl_upwork())

Extract from G2 Reviews

from crawl4ai_patchright import AsyncWebCrawler
from crawl4ai_patchright.g2_preset import g2_browser_config, g2_run_config

async def crawl_g2():
    browser_cfg = g2_browser_config()
    run_cfg = g2_run_config()

    url = "https://www.g2.com/products/notion/reviews"

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(url=url, config=run_cfg)

    return result.markdown

asyncio.run(crawl_g2())

Extract from Indeed Job Search

import asyncio
from crawl4ai_patchright import AsyncWebCrawler
from crawl4ai_patchright.indeed_preset import indeed_browser_config, indeed_run_config

async def crawl_indeed():
    browser_cfg = indeed_browser_config()
    run_cfg = indeed_run_config()

    url = "https://www.indeed.com/jobs?q=python+developer&l=remote"

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(url=url, config=run_cfg)

    print(result.markdown)  # Clean, structured markdown
    print(result.html)      # Full HTML if needed

asyncio.run(crawl_indeed())

File Structure

crawl4ai-patchright/
├── crawl4ai_patchright/
│   ├── g2_preset.py              # G2 BrowserConfig + CrawlerRunConfig
│   ├── upwork_preset.py          # Upwork BrowserConfig + CrawlerRunConfig
│   ├── indeed_preset.py          # Indeed BrowserConfig + CrawlerRunConfig
│   └── ...                       # Rest of crawl4ai fork internals
├── validate_g2.py                # Cold validation (public G2 pages)
├── validate_g2_reviews.py        # G2 review page validation
├── validate_g2_assisted.py       # Assisted flow (human challenge solver)
├── validate_upwork.py            # Cold validation (Upwork homepage + search)
├── validate_upwork_assisted.py   # Assisted flow
├── extract_upwork.py             # Extraction proof (5 job listings)
├── validate_indeed.py            # Cold validation (homepage + search + detail)
├── validate_indeed_assisted.py   # Assisted flow (human challenge solver)
├── extract_indeed.py             # Extraction proof (search → detail → fields)
└── README.md                     # This file

Results

G2 Validation

[✅ RESULT — Cold validation]
  crawl.success: True
  Block signals: None detected
  Success signals: ['g2.com', 'reviews', 'software', 'categories', ...]
  VERDICT: ✅ G2 accessible cold

Upwork Validation

[✅ RESULT — Cold validation]
  crawl.success: True
  Block signals: None detected
  Success signals: ['upwork', 'job', 'hourly', 'fixed', 'budget', ...]
  VERDICT: ✅ Upwork accessible cold

Indeed Validation

[✅ RESULT — Cold validation]
  Homepage:   crawl.success: True, signals: ['indeed', 'find jobs', 'job search', 'salaries', 'company reviews']
  Job search: crawl.success: True, signals: ['indeed', 'jobs', 'salary', 'posted', 'apply', 'full-time', 'remote']
  Job detail: crawl.success: True, signals: ['indeed', 'apply', 'job description', 'qualifications', 'full job description']
  VERDICT: ✅ Indeed accessible cold (all 3 URLs)

Indeed Extraction Proof

[1] Full-Stack Python Developer
     Company     : Nitka Technologies
     Location    : Remote
     Job type    : Full-time
     Schedule    : 8 hour shift
     Description : Nitka Technologies develops software for customers in the US and Europe...

[2] Python Developer
     Company     : Resource Innovations
     Location    : Boston, MA
     Salary      : $90,000 - $100,000 a year
     Job type    : Full-time
     Description : Resource Innovations seeks a Django/Python Developer to join...

[3] Remote Python Developer
     Company     : LookFar Labs
     Location    : Washington, DC
     Salary      : $95,000 - $125,000 a year
     Job type    : Full-time
     Schedule    : Monday to Friday
     Description : Our Python Developer will have the opportunity to build scalable...

EXTRACTION QUALITY: ✅ 5/5 jobs — title, company, location, job_type, description always present; salary, schedule optional

Upwork Extraction Proof

[1] Backend Engineer
    Type/Rate: Hourly — $5.00 - $20.00
    Level: Intermediate
    Description: Multi-sensor intelligence platform, backend services...

[2] Full-stack AI/ML With Front End
    Type/Rate: Fixed price — $361.00
    Level: Intermediate
    Description: Connect HTML Form, FastAPI Backend, Deploy Static Site...

[3] Experienced Full Stack Developer to Make me A Web App
    Type/Rate: Fixed price — $500.00
    Level: Expert
    Description: Web app developer needed, portfolio required...

[4] Django Migration Expert Needed for Azure Deployment
    Type/Rate: Fixed price — $400.00
    Level: Entry Level
    Description: Migrate Django from PythonAnywhere to Azure...

[5] Architect & Developer for Real-time Voice AI SaaS Project
    Type/Rate: Hourly — $25.00 - $45.00
    Level: Expert
    Description: AI-SkyTalk flight radio simulator, voice AI training...

EXTRACTION QUALITY: ✅ 5/5 jobs, all fields present

Known Issues & Limitations

  1. No Selenium grid support: Uses local Patchright + real Chrome only
  2. Headful mode required: Headless mode scores lower with bot detectors (use assisted flow if needed)
  3. Login-gated content: Public pages pass cold; full job details may require login (Upwork, Indeed)
  4. URL slug artifacts: Upwork injects span-class-highlight markers in search-result URL slugs (cosmetic; routing still works)
  5. Indeed detail pages: Job detail URLs are dynamically extracted from search results via vjk= parameters; if search is blocked, detail validation is skipped

Contributing

This fork is experimental. To contribute:

  1. Test new anti-bot techniques against G2, Upwork, or Indeed
  2. Document changes in preset docstrings
  3. Update block/success signals in validation scripts if detection patterns change
  4. Keep presets minimal — only add config that actually defeats bot detection

References

License

Same as parent repo (crawl4ai). See LICENSE in the parent repository.


Status: Fork validated for G2, Upwork, and Indeed. Production-ready presets and extraction patterns.

About

Crawl4AI + patchright + assisted session handling. An experiment on how to bypass advanced bot protection. Tested and working on G2, Upwork & Indeed. (Educational only)

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.2%
  • Other 0.8%