ScrapeMaster

A distributed web scraping platform that intelligently handles both static and dynamic content with built-in anti-detection capabilities.

What is this?

I built ScrapeMaster to solve a common problem: most scrapers either work great with static sites OR dynamic sites, but rarely both. This platform automatically detects what kind of site you're dealing with and routes the job to the appropriate engine - Scrapy for fast static scraping, or Playwright when you need full browser capabilities.

It's been running in production for several months now, handling everything from simple data collection to complex multi-step workflows that require JavaScript execution.

Main Features

Smart Engine Selection
The system automatically picks between Scrapy and Playwright based on the target site. No need to manually configure which engine to use.

Anti-Detection Tools
Proxy rotation, randomized browser fingerprints, and CAPTCHA solving integration help avoid blocks. The proxy manager cycles through your proxy list and handles failures gracefully.

Machine Learning Extraction
Instead of writing complex selectors for every site, the ML module can often figure out what data you want by analyzing the DOM structure. It's not perfect, but it works surprisingly well on common layouts.

REST API & Dashboard
Full API for programmatic access, plus a React frontend for monitoring jobs and viewing results in real-time.

Docker-Based Deployment
Everything runs in containers. Scaling is just a matter of spinning up more worker instances.

Plays Nice with Robots
Respects robots.txt, includes configurable rate limiting, and properly identifies itself. Web scraping can be done responsibly.

Architecture

graph TB
    Client[Client/Dashboard]
    API[REST API]
    Scheduler[Job Scheduler]
    
    subgraph Workers
        ScrapyWorker[Scrapy Workers]
        PlaywrightWorker[Playwright Workers]
    end
    
    subgraph Storage
        PostgreSQL[(PostgreSQL)]
        MongoDB[(MongoDB)]
        Redis[(Redis)]
    end
    
    Client --> API
    API --> Scheduler
    Scheduler --> Redis
    Scheduler --> ScrapyWorker
    Scheduler --> PlaywrightWorker
    ScrapyWorker --> MongoDB
    PlaywrightWorker --> MongoDB
    API --> PostgreSQL

Getting Started

You'll need Docker installed (Desktop on Windows/Mac, Engine on Linux). The platform needs about 4GB of RAM minimum, though 8GB is better if you're running multiple concurrent jobs.

Setup

# Clone and enter directory
git clone https://github.com/syeedalireza/ScrapeMaster.git
cd ScrapeMaster

# Copy environment template and add your passwords
cp .env.example .env
# Edit .env - don't skip this, change the default passwords!

# Fire it up
docker-compose up -d

That's it. The dashboard will be available at http://localhost once all containers are healthy (takes about 30 seconds).

Running Your First Job

import requests

response = requests.post('http://localhost/api/v1/jobs', json={
    "url": "https://example.com",
    "engine": "auto",  # Let it decide
    "selectors": {
        "title": "h1",
        "content": "article"
    }
})

print(f"Job ID: {response.json()['job_id']}")

Check the dashboard or poll the API to see when it's done. Results get stored in MongoDB.

Tech Stack

Backend
Python 3.11 with FastAPI for the API layer. Scrapy handles lightweight scraping, Playwright takes care of the heavy JavaScript sites. Celery manages the job queue, SQLAlchemy for database ORM.

Storage
PostgreSQL for structured data (job configs, user accounts), MongoDB for the actual scraped content (since schema varies per site), and Redis for caching and as the Celery broker.

ML Components
Scikit-learn for classification tasks, NLTK for text processing. The extraction models are custom-trained on common website patterns.

Frontend
React 18 with Tailwind CSS. WebSocket connection for live job updates. Recharts for the graphs.

Infrastructure
Docker Compose for local development and small deployments. Nginx as reverse proxy. Added GitHub Actions for CI but honestly still tweaking that part.

Project Structure

ScrapeMaster/
├── api/                    # FastAPI application
│   ├── routes/            # API endpoints
│   ├── models/            # Database models
│   ├── schemas/           # Pydantic schemas
│   └── auth/              # Authentication
├── scrapers/              # Scraping engines
│   ├── engines/           # Scrapy & Playwright
│   ├── evasion/           # Anti-bot systems
│   ├── ethics/            # robots.txt, rate limiting
│   └── adapters/          # Engine abstraction
├── ml/                    # Machine learning
│   ├── extractors/        # Smart selectors
│   ├── models/            # Trained models
│   └── training/          # Training scripts
├── pipeline/              # Data processing
│   ├── validators.py      # Schema validation
│   ├── cleaners.py        # Data cleaning
│   └── deduplicator.py    # Duplicate detection
├── frontend/              # React dashboard
│   └── src/
│       ├── components/    # UI components
│       └── pages/         # Page layouts
├── tests/                 # Test suites
│   ├── unit/
│   ├── integration/
│   └── e2e/
├── docs/                  # Documentation
├── config/                # Configuration files
├── dockerfiles/           # Service Dockerfiles
├── nginx/                 # Nginx configs
└── scripts/               # Utility scripts

Documentation

Check the docs/ folder for detailed guides:

Setup Guide - More detailed installation steps and troubleshooting
API Reference - All available endpoints and examples
Contributing - If you want to help improve this
Ethics Guide - How to scrape responsibly and stay legal

Performance Notes

In testing, the platform handles around 50,000 pages per day on a modest server setup. I've had over 1,000 concurrent jobs running without issues (though your mileage will vary based on hardware).

Proxy rotation cuts down ban rates significantly - roughly 80% fewer blocks compared to direct requests. The ML extraction hits about 95% accuracy on sites with standard HTML structure, drops lower on heavily customized layouts.

API response time stays under 100ms for most operations.

What I've Used It For

Mostly price monitoring and product data collection for e-commerce sites. Also built a real estate aggregator that pulls listings from multiple sources, and a job board scraper for market research.

It works well for news article extraction too, though you need to respect rate limits pretty strictly with news sites.

Legal & Ethical Stuff

Look, web scraping exists in a gray area legally. This tool respects robots.txt by default and includes rate limiting to avoid hammering servers. It identifies itself properly in the user agent.

You're responsible for how you use this. Make sure you're complying with:

Website terms of service
Local laws about data collection
Privacy regulations (GDPR, CCPA, etc.)
Copyright law

When in doubt, ask for permission or consult a lawyer. Don't be the person who ruins it for everyone by scraping irresponsibly.

Contributing

Found a bug? Have an idea for improvement? Pull requests are welcome. Check CONTRIBUTING.md for guidelines.

For local development without Docker:

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
pytest tests/

I try to keep code formatted with Black and imports sorted with isort.

License

MIT License - do whatever you want with it. See LICENSE for the legal text.

What's Next

Things I'm considering adding:

GraphQL API (FastAPI makes this pretty easy)
Kubernetes configs for larger deployments
Better ML models for extraction
Maybe a browser extension for visual selector building
Cloud deployment templates

No promises on timeline though.

Credits

Built on the shoulders of giants:

Scrapy - the workhorse of web scraping
Playwright - browser automation done right
Various open source libraries that make Python awesome

If you find this useful, star the repo. If you find bugs (you will), open an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapeMaster

What is this?

Main Features

Architecture

Getting Started

Tech Stack

Project Structure

Documentation

Performance Notes

What I've Used It For

Legal & Ethical Stuff

Contributing

License

What's Next

Credits

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
api		api
config		config
dockerfiles		dockerfiles
docs		docs
frontend		frontend
ml		ml
nginx		nginx
pipeline		pipeline
scrapers		scrapers
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ScrapeMaster

What is this?

Main Features

Architecture

Getting Started

Tech Stack

Project Structure

Documentation

Performance Notes

What I've Used It For

Legal & Ethical Stuff

Contributing

License

What's Next

Credits

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages