A distributed web scraping platform that intelligently handles both static and dynamic content with built-in anti-detection capabilities.
I built ScrapeMaster to solve a common problem: most scrapers either work great with static sites OR dynamic sites, but rarely both. This platform automatically detects what kind of site you're dealing with and routes the job to the appropriate engine - Scrapy for fast static scraping, or Playwright when you need full browser capabilities.
It's been running in production for several months now, handling everything from simple data collection to complex multi-step workflows that require JavaScript execution.
Smart Engine Selection
The system automatically picks between Scrapy and Playwright based on the target site. No need to manually configure which engine to use.
Anti-Detection Tools
Proxy rotation, randomized browser fingerprints, and CAPTCHA solving integration help avoid blocks. The proxy manager cycles through your proxy list and handles failures gracefully.
Machine Learning Extraction
Instead of writing complex selectors for every site, the ML module can often figure out what data you want by analyzing the DOM structure. It's not perfect, but it works surprisingly well on common layouts.
REST API & Dashboard
Full API for programmatic access, plus a React frontend for monitoring jobs and viewing results in real-time.
Docker-Based Deployment
Everything runs in containers. Scaling is just a matter of spinning up more worker instances.
Plays Nice with Robots
Respects robots.txt, includes configurable rate limiting, and properly identifies itself. Web scraping can be done responsibly.
graph TB
Client[Client/Dashboard]
API[REST API]
Scheduler[Job Scheduler]
subgraph Workers
ScrapyWorker[Scrapy Workers]
PlaywrightWorker[Playwright Workers]
end
subgraph Storage
PostgreSQL[(PostgreSQL)]
MongoDB[(MongoDB)]
Redis[(Redis)]
end
Client --> API
API --> Scheduler
Scheduler --> Redis
Scheduler --> ScrapyWorker
Scheduler --> PlaywrightWorker
ScrapyWorker --> MongoDB
PlaywrightWorker --> MongoDB
API --> PostgreSQL
You'll need Docker installed (Desktop on Windows/Mac, Engine on Linux). The platform needs about 4GB of RAM minimum, though 8GB is better if you're running multiple concurrent jobs.
Setup
# Clone and enter directory
git clone https://github.com/syeedalireza/ScrapeMaster.git
cd ScrapeMaster
# Copy environment template and add your passwords
cp .env.example .env
# Edit .env - don't skip this, change the default passwords!
# Fire it up
docker-compose up -dThat's it. The dashboard will be available at http://localhost once all containers are healthy (takes about 30 seconds).
Running Your First Job
import requests
response = requests.post('http://localhost/api/v1/jobs', json={
"url": "https://example.com",
"engine": "auto", # Let it decide
"selectors": {
"title": "h1",
"content": "article"
}
})
print(f"Job ID: {response.json()['job_id']}")Check the dashboard or poll the API to see when it's done. Results get stored in MongoDB.
Backend
Python 3.11 with FastAPI for the API layer. Scrapy handles lightweight scraping, Playwright takes care of the heavy JavaScript sites. Celery manages the job queue, SQLAlchemy for database ORM.
Storage
PostgreSQL for structured data (job configs, user accounts), MongoDB for the actual scraped content (since schema varies per site), and Redis for caching and as the Celery broker.
ML Components
Scikit-learn for classification tasks, NLTK for text processing. The extraction models are custom-trained on common website patterns.
Frontend
React 18 with Tailwind CSS. WebSocket connection for live job updates. Recharts for the graphs.
Infrastructure
Docker Compose for local development and small deployments. Nginx as reverse proxy. Added GitHub Actions for CI but honestly still tweaking that part.
ScrapeMaster/
├── api/ # FastAPI application
│ ├── routes/ # API endpoints
│ ├── models/ # Database models
│ ├── schemas/ # Pydantic schemas
│ └── auth/ # Authentication
├── scrapers/ # Scraping engines
│ ├── engines/ # Scrapy & Playwright
│ ├── evasion/ # Anti-bot systems
│ ├── ethics/ # robots.txt, rate limiting
│ └── adapters/ # Engine abstraction
├── ml/ # Machine learning
│ ├── extractors/ # Smart selectors
│ ├── models/ # Trained models
│ └── training/ # Training scripts
├── pipeline/ # Data processing
│ ├── validators.py # Schema validation
│ ├── cleaners.py # Data cleaning
│ └── deduplicator.py # Duplicate detection
├── frontend/ # React dashboard
│ └── src/
│ ├── components/ # UI components
│ └── pages/ # Page layouts
├── tests/ # Test suites
│ ├── unit/
│ ├── integration/
│ └── e2e/
├── docs/ # Documentation
├── config/ # Configuration files
├── dockerfiles/ # Service Dockerfiles
├── nginx/ # Nginx configs
└── scripts/ # Utility scripts
Check the docs/ folder for detailed guides:
- Setup Guide - More detailed installation steps and troubleshooting
- API Reference - All available endpoints and examples
- Contributing - If you want to help improve this
- Ethics Guide - How to scrape responsibly and stay legal
In testing, the platform handles around 50,000 pages per day on a modest server setup. I've had over 1,000 concurrent jobs running without issues (though your mileage will vary based on hardware).
Proxy rotation cuts down ban rates significantly - roughly 80% fewer blocks compared to direct requests. The ML extraction hits about 95% accuracy on sites with standard HTML structure, drops lower on heavily customized layouts.
API response time stays under 100ms for most operations.
Mostly price monitoring and product data collection for e-commerce sites. Also built a real estate aggregator that pulls listings from multiple sources, and a job board scraper for market research.
It works well for news article extraction too, though you need to respect rate limits pretty strictly with news sites.
Look, web scraping exists in a gray area legally. This tool respects robots.txt by default and includes rate limiting to avoid hammering servers. It identifies itself properly in the user agent.
You're responsible for how you use this. Make sure you're complying with:
- Website terms of service
- Local laws about data collection
- Privacy regulations (GDPR, CCPA, etc.)
- Copyright law
When in doubt, ask for permission or consult a lawyer. Don't be the person who ruins it for everyone by scraping irresponsibly.
Found a bug? Have an idea for improvement? Pull requests are welcome. Check CONTRIBUTING.md for guidelines.
For local development without Docker:
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
pytest tests/I try to keep code formatted with Black and imports sorted with isort.
MIT License - do whatever you want with it. See LICENSE for the legal text.
Things I'm considering adding:
- GraphQL API (FastAPI makes this pretty easy)
- Kubernetes configs for larger deployments
- Better ML models for extraction
- Maybe a browser extension for visual selector building
- Cloud deployment templates
No promises on timeline though.
Built on the shoulders of giants:
- Scrapy - the workhorse of web scraping
- Playwright - browser automation done right
- Various open source libraries that make Python awesome
If you find this useful, star the repo. If you find bugs (you will), open an issue.