A high-performance parallel web scraper that converts websites into organized Markdown files. Built with Python and crawl4ai, requests, psutil, rich, and asyncio, this tool efficiently processes websites by leveraging their sitemaps and supports concurrent scraping with built-in memory monitoring.
- 🚀 Parallel scraping with configurable concurrency
- 📑 Automatic sitemap detection and processing
- 📁 Organized output with clean directory structure
- 💾 Memory-efficient with built-in monitoring
- 🌐 Browser-based scraping using crawl4ai
- 📊 Progress tracking and detailed logging
- 🔍 Preview mode with dry-run option
- Python 3.7+
- crawl4ai
- rich
- psutil
- requests
- Clone the repository:
git clone https://github.com/rkabrick/scrape.git
cd web-scraper- Install dependencies:
pip install -r requirements.txtBasic usage:
python scrape https://example.comscrape [-h] [--max-concurrent MAX_CONCURRENT] [-v] [--dry-run] urlArguments:
url: The target URL to scrape (must include http:// or https://)--max-concurrent: Maximum number of concurrent scrapers (default: 3)-v: Increase verbosity level-v: Show file names-vv: Show browser output-vvv: Show memory monitoring
--dry-run: Preview the file structure without performing the scrape
- Basic scraping:
scrape https://example.com- Scraping with increased concurrency:
scrape --max-concurrent 5 https://example.com- Preview mode with file structure:
scrape --dry-run https://example.com- Verbose output with memory monitoring:
scrape -vvv https://example.comThe scraper creates an organized directory structure based on the website's URL paths. For example:
example.com/
├── index.md
├── about/
│ └── index.md
├── blog/
│ ├── post1.md
│ └── post2.md
└── products/
├── category1/
│ └── item1.md
└── category2/
└── item2.md
- Automatically detects and processes XML sitemaps
- Falls back to single URL processing if no sitemap is found
- Supports both simple and nested sitemap structures
- Built-in memory monitoring for resource-intensive operations
- Configurable concurrent scraping to balance performance and resource usage
- Automatic cleanup of browser instances
- Intelligent path handling and file naming
- Duplicate file name resolution
- Clean, SEO-friendly file structure
- Markdown output for compatibility
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with crawl4ai for reliable web scraping
- Uses rich for beautiful terminal output
- Memory monitoring powered by psutil
For issues, questions, or contributions, please open an issue in the GitHub repository.