Skip to content

PHY041/superscrape

Repository files navigation

SuperScrape

Python 3.10+ License: MIT CI

Web scraping + AI visual intelligence that just works -- anti-bot era edition.

SuperScrape uses Camoufox (C++ anti-detection Firefox) to scrape sites that block Playwright, Selenium, and curl. Then it analyzes product images with GPT Vision to generate competitive intelligence reports.

# Scrape Amazon product images + run AI analysis
superscrape amazon visual "portable blender" --top 10

Features

  • Anti-bot scraping -- Camoufox bypasses Cloudflare, DataDome, and other bot detection
  • Amazon -- Product pages, search results, image extraction with hi-res upgrade
  • Instagram -- Public profiles, recent posts, follower counts
  • Reddit -- Subreddit posts with sorting and filtering
  • eBay, Walmart, Etsy, Shopee -- Additional e-commerce platforms
  • Visual Intelligence -- GPT Vision analyzes product images (type, angle, background, text, people)
  • Reports -- Markdown + JSON reports with category-level insights and recommendations

Prerequisites

  • Python 3.10+
  • An OpenAI API key (for Visual Intelligence features)

Installation

pip install superscrape

# Install the Camoufox browser binary
python -c "from camoufox.sync_api import Camoufox; print('ready')"

Or install from source:

git clone https://github.com/PHY041/superscrape.git
cd superscrape
pip install -e ".[dev]"

Quick Start

# 1. Set your OpenAI API key (needed for visual analysis)
export OPENAI_API_KEY="sk-..."

# 2. Scrape a single Amazon product
superscrape amazon product B0CX23V2ZK

# 3. Search Amazon
superscrape amazon search "wireless earbuds" --pages 2

# 4. Run full visual intelligence pipeline
superscrape amazon visual "boys dress shirt" --top 10 --output-dir ./reports

# 5. Scrape Instagram
superscrape instagram natgeo

# 6. Scrape Reddit
superscrape reddit SideProject --sort hot --limit 50

CLI Reference

superscrape
  amazon
    product <ASIN>              Scrape a single product
    search <KEYWORD>            Search results with pagination
    visual <KEYWORD>            Full visual intelligence pipeline
  instagram <USERNAME>          Public profile + recent posts
  reddit <SUBREDDIT>            Posts with sorting (hot/new/top)

Options

Command Flag Description
amazon product --images-only Only output image URLs
amazon search --pages N Number of search pages
amazon visual --top N Number of products to analyze
amazon visual --no-cache Bypass cached results
amazon visual --output-dir DIR Output directory
reddit --sort hot|new|top Sort order
reddit --limit N Max posts to fetch
All commands --output json|table Output format

Python API

from superscrape.sites.amazon import Amazon
from superscrape.analyzers.vision import batch_analyze_first_images
from superscrape.reporters.visual_report import aggregate_report, render_markdown

# Scrape
products = Amazon.search_images("portable blender", top_n=10)

# Analyze with GPT Vision
analyses = batch_analyze_first_images(products)

# Generate report
report = aggregate_report("portable blender", products, analyses)
markdown = render_markdown(report)

Environment Variables

Variable Required Description
OPENAI_API_KEY For visual analysis OpenAI API key for GPT Vision
BYTEPLUSES_API_KEY Optional BytePlus API key for lifestyle image generation

API Server (Optional)

SuperScrape includes an optional FastAPI server with real-time job tracking:

# Install API dependencies
pip install "superscrape[api]"

# Start the server
uvicorn api.main:app --host 0.0.0.0 --port 8001

# Or use Docker
docker compose up --build

API endpoints:

  • POST /jobs -- Submit a scraping + analysis job
  • GET /jobs/{id} -- Job status
  • GET /jobs/{id}/stream -- SSE real-time progress
  • GET /reports -- List generated reports
  • GET /health -- Health check

Architecture

CLI / API Request
    |
    v
+---------------------------+
|  Scraping Layer            |
|  sites/amazon.py           |
|  sites/instagram.py        |
|  sites/reddit.py           |
+------------+--------------+
             |
             v
      Camoufox Browser
      (C++ anti-detection)
             |
             v
+---------------------------+
|  AI Analysis               |
|  analyzers/vision.py       |
|  (OpenAI GPT Vision)       |
+------------+--------------+
             |
             v
+---------------------------+
|  Reports                   |
|  reporters/visual_report   |
|  Markdown + JSON + HTML    |
+---------------------------+

Anti-Bot Test Results

Tested with Camoufox against major platforms:

Platform Status Notes
Amazon Pass Search, product pages, images
Instagram Pass Public profiles, no login required
Reddit Pass Playwright+stealth gets blocked, Camoufox passes
eBay Pass Product listings, prices
Walmart Pass Product pages
Etsy Pass Listings, prices
Cloudflare Challenge Pass Generic CF challenge page

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

MIT License -- see LICENSE for details.

Powered by CanMarket.

About

Web scraping + AI visual intelligence that just works — anti-bot era edition

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors