Skip to content

syeedalireza/ScrapeMaster

Repository files navigation

ScrapeMaster

A distributed web scraping platform that intelligently handles both static and dynamic content with built-in anti-detection capabilities.

License: MIT Python 3.11+ Docker

What is this?

I built ScrapeMaster to solve a common problem: most scrapers either work great with static sites OR dynamic sites, but rarely both. This platform automatically detects what kind of site you're dealing with and routes the job to the appropriate engine - Scrapy for fast static scraping, or Playwright when you need full browser capabilities.

It's been running in production for several months now, handling everything from simple data collection to complex multi-step workflows that require JavaScript execution.

Main Features

Smart Engine Selection
The system automatically picks between Scrapy and Playwright based on the target site. No need to manually configure which engine to use.

Anti-Detection Tools
Proxy rotation, randomized browser fingerprints, and CAPTCHA solving integration help avoid blocks. The proxy manager cycles through your proxy list and handles failures gracefully.

Machine Learning Extraction
Instead of writing complex selectors for every site, the ML module can often figure out what data you want by analyzing the DOM structure. It's not perfect, but it works surprisingly well on common layouts.

REST API & Dashboard
Full API for programmatic access, plus a React frontend for monitoring jobs and viewing results in real-time.

Docker-Based Deployment
Everything runs in containers. Scaling is just a matter of spinning up more worker instances.

Plays Nice with Robots
Respects robots.txt, includes configurable rate limiting, and properly identifies itself. Web scraping can be done responsibly.

Architecture

graph TB
    Client[Client/Dashboard]
    API[REST API]
    Scheduler[Job Scheduler]
    
    subgraph Workers
        ScrapyWorker[Scrapy Workers]
        PlaywrightWorker[Playwright Workers]
    end
    
    subgraph Storage
        PostgreSQL[(PostgreSQL)]
        MongoDB[(MongoDB)]
        Redis[(Redis)]
    end
    
    Client --> API
    API --> Scheduler
    Scheduler --> Redis
    Scheduler --> ScrapyWorker
    Scheduler --> PlaywrightWorker
    ScrapyWorker --> MongoDB
    PlaywrightWorker --> MongoDB
    API --> PostgreSQL
Loading

Getting Started

You'll need Docker installed (Desktop on Windows/Mac, Engine on Linux). The platform needs about 4GB of RAM minimum, though 8GB is better if you're running multiple concurrent jobs.

Setup

# Clone and enter directory
git clone https://github.com/syeedalireza/ScrapeMaster.git
cd ScrapeMaster

# Copy environment template and add your passwords
cp .env.example .env
# Edit .env - don't skip this, change the default passwords!

# Fire it up
docker-compose up -d

That's it. The dashboard will be available at http://localhost once all containers are healthy (takes about 30 seconds).

Running Your First Job

import requests

response = requests.post('http://localhost/api/v1/jobs', json={
    "url": "https://example.com",
    "engine": "auto",  # Let it decide
    "selectors": {
        "title": "h1",
        "content": "article"
    }
})

print(f"Job ID: {response.json()['job_id']}")

Check the dashboard or poll the API to see when it's done. Results get stored in MongoDB.

Tech Stack

Backend
Python 3.11 with FastAPI for the API layer. Scrapy handles lightweight scraping, Playwright takes care of the heavy JavaScript sites. Celery manages the job queue, SQLAlchemy for database ORM.

Storage
PostgreSQL for structured data (job configs, user accounts), MongoDB for the actual scraped content (since schema varies per site), and Redis for caching and as the Celery broker.

ML Components
Scikit-learn for classification tasks, NLTK for text processing. The extraction models are custom-trained on common website patterns.

Frontend
React 18 with Tailwind CSS. WebSocket connection for live job updates. Recharts for the graphs.

Infrastructure
Docker Compose for local development and small deployments. Nginx as reverse proxy. Added GitHub Actions for CI but honestly still tweaking that part.

Project Structure

ScrapeMaster/
├── api/                    # FastAPI application
│   ├── routes/            # API endpoints
│   ├── models/            # Database models
│   ├── schemas/           # Pydantic schemas
│   └── auth/              # Authentication
├── scrapers/              # Scraping engines
│   ├── engines/           # Scrapy & Playwright
│   ├── evasion/           # Anti-bot systems
│   ├── ethics/            # robots.txt, rate limiting
│   └── adapters/          # Engine abstraction
├── ml/                    # Machine learning
│   ├── extractors/        # Smart selectors
│   ├── models/            # Trained models
│   └── training/          # Training scripts
├── pipeline/              # Data processing
│   ├── validators.py      # Schema validation
│   ├── cleaners.py        # Data cleaning
│   └── deduplicator.py    # Duplicate detection
├── frontend/              # React dashboard
│   └── src/
│       ├── components/    # UI components
│       └── pages/         # Page layouts
├── tests/                 # Test suites
│   ├── unit/
│   ├── integration/
│   └── e2e/
├── docs/                  # Documentation
├── config/                # Configuration files
├── dockerfiles/           # Service Dockerfiles
├── nginx/                 # Nginx configs
└── scripts/               # Utility scripts

Documentation

Check the docs/ folder for detailed guides:

Performance Notes

In testing, the platform handles around 50,000 pages per day on a modest server setup. I've had over 1,000 concurrent jobs running without issues (though your mileage will vary based on hardware).

Proxy rotation cuts down ban rates significantly - roughly 80% fewer blocks compared to direct requests. The ML extraction hits about 95% accuracy on sites with standard HTML structure, drops lower on heavily customized layouts.

API response time stays under 100ms for most operations.

What I've Used It For

Mostly price monitoring and product data collection for e-commerce sites. Also built a real estate aggregator that pulls listings from multiple sources, and a job board scraper for market research.

It works well for news article extraction too, though you need to respect rate limits pretty strictly with news sites.

Legal & Ethical Stuff

Look, web scraping exists in a gray area legally. This tool respects robots.txt by default and includes rate limiting to avoid hammering servers. It identifies itself properly in the user agent.

You're responsible for how you use this. Make sure you're complying with:

  • Website terms of service
  • Local laws about data collection
  • Privacy regulations (GDPR, CCPA, etc.)
  • Copyright law

When in doubt, ask for permission or consult a lawyer. Don't be the person who ruins it for everyone by scraping irresponsibly.

Contributing

Found a bug? Have an idea for improvement? Pull requests are welcome. Check CONTRIBUTING.md for guidelines.

For local development without Docker:

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
pytest tests/

I try to keep code formatted with Black and imports sorted with isort.

License

MIT License - do whatever you want with it. See LICENSE for the legal text.

What's Next

Things I'm considering adding:

  • GraphQL API (FastAPI makes this pretty easy)
  • Kubernetes configs for larger deployments
  • Better ML models for extraction
  • Maybe a browser extension for visual selector building
  • Cloud deployment templates

No promises on timeline though.

Credits

Built on the shoulders of giants:

  • Scrapy - the workhorse of web scraping
  • Playwright - browser automation done right
  • Various open source libraries that make Python awesome

If you find this useful, star the repo. If you find bugs (you will), open an issue.

About

Professional Web Scraper and Data Extraction tool for harvesting large-scale datasets from complex websites.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors