A modular web scraping system that downloads PDFs and articles from multiple Australian energy sector websites and integrates with RAGFlow for RAG ingestion.
- PDF Scrapers: AEMO, AEMC, AER, ENA, ECA (Australian energy sector documents)
- Article Scrapers: RenewEconomy, TheEnergy, Guardian Australia, The Conversation
- Modular scraper architecture - easily add new scrapers
- HTMX-based web interface for configuration and monitoring
- RAGFlow integration with metadata support for document ingestion
- CLI interface for n8n integration
- FlareSolverr support for Cloudflare bypass
- Docker-ready for deployment on Unraid
For detailed deployment instructions, see DEPLOYMENT_GUIDE.md.
Quick setup:
# 1. Clone and configure
git clone <repository-url>
cd scraper
cp .env.example .env
nano .env # Configure SECRET_KEY, RAGFlow, etc.
# 2. Build and start
docker compose build
docker compose up -d
# 3. Access web UI
open http://localhost:5000Docker Compose starts the scraper, FlareSolverr, and Gotenberg:
docker compose up -dSee DEPLOYMENT_GUIDE.md for:
- Environment configuration
- Service connectivity tests
- Troubleshooting guide
- Production best practices
For day-to-day operations, see RUNBOOK_COMMON_OPERATIONS.md.
Common commands:
# Start/stop services
docker compose up -d
docker compose down
# View logs
docker compose logs -f scraper
# Run scraper
docker compose exec scraper python scripts/run_scraper.py --scraper aemo
# Backup data
tar -czf backup.tar.gz data/state/ data/metadata/ config/These targets default to docker-compose.dev.yml and run everything inside the dev container.
# Build and run the dev stack
make dev-build
make dev-up
# Logs and shell
make logs
make shell
# Tests
make test # all tests
make test-unit # unit tests only
make test-int # integration tests only
make test-file FILE=tests/unit/test_metadata_validation.py::TestClass::test_case
# Optional: override compose file (defaults to docker-compose.dev.yml)
make dev-up COMPOSE=docker-compose.ymlNotes:
- Dev web UI: http://localhost:5001 (mapped from container 5000).
- VS Code tasks mirror these targets (Terminal → Run Task).
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Copy and configure environment
cp .env.example .env
# Edit .env with your settings
# Option A: Use Docker Compose for FlareSolverr + Gotenberg
make dev-up # starts all services, web UI at http://localhost:5001
# Option B: Run just the web interface (no rendered-page scraping)
python app/main.py
# Or run a scraper directly
python scripts/run_scraper.py --scraper aemodocker-compose up --buildAccess the web UI at http://localhost:5000
# List available scrapers
python scripts/run_scraper.py --list-scrapers
# Run a scraper
python scripts/run_scraper.py --scraper aemo
# Run with options
python scripts/run_scraper.py --scraper aemo --max-pages 5 --output-format json
# Upload to RAGFlow after scraping
python scripts/run_scraper.py --scraper aemo --upload-to-ragflow --dataset-id abc123# Validate state files (read-only)
python scripts/run_scraper.py state validate
# Repair state files and write sanitized copies
python scripts/run_scraper.py state repair --write
# Validate settings.json and scraper configs
python scripts/run_scraper.py config validate
# Migrate settings/scraper configs to defaults/schema and write back
python scripts/run_scraper.py config migrate --writeTip: when running locally outside Docker, override dirs to avoid
/appdefaults, e.g.DOWNLOAD_DIR=./data/scraped STATE_DIR=./data/state.
scraper/
├── app/
│ ├── backends/ # Swappable parser, archive, RAG, vectorstore backends
│ ├── scrapers/ # Scraper modules
│ ├── services/ # External integrations (RAGFlow, FlareSolverr, Paperless)
│ ├── orchestrator/ # Scheduling and pipelines
│ ├── web/ # Flask web interface (blueprints-based)
│ └── utils/ # Shared utilities
├── config/ # Configuration files
│ ├── settings.json # Runtime settings
│ └── scrapers/ # Per-scraper configurations
├── data/ # Runtime data
│ ├── scraped/ # Downloaded documents
│ ├── metadata/ # Document metadata
│ ├── state/ # Scraper state files
│ └── logs/ # Application logs
├── docs/ # Documentation
│ ├── DEPLOYMENT_GUIDE.md # Production deployment
│ ├── RUNBOOK_COMMON_OPERATIONS.md # Day-to-day operations
│ ├── MIGRATION_AND_STATE_REPAIR.md # State management
│ ├── DEVELOPER_GUIDE.md # Development guide (see below)
│ └── ...
├── scripts/ # CLI tools and utilities
└── docker-compose.yml # Production compose file
For detailed instructions, see DEVELOPER_GUIDE.md.
Quick start:
- Create a new file in
app/scrapers/(e.g.,my_scraper.py) - Inherit from
BaseScraperand implement required methods - The scraper will be auto-discovered and available via CLI and web UI
from app.scrapers.base_scraper import BaseScraper
class MyScraper(BaseScraper):
NAME = "my-scraper"
DESCRIPTION = "Scrapes documents from example.com"
def scrape(self):
# Implementation here
pass
def get_metadata(self, filepath):
# Extract document metadata
return {
"title": "Document title",
"source": "my-scraper",
"url": "https://example.com/doc.pdf"
}See DEVELOPER_GUIDE.md for:
- Development setup
- Scraper best practices
- Testing and debugging
- Architecture overview
Complete documentation index: docs/README.md
- Contributing Guide - How to contribute to the project
- Security Policy - Security reporting and best practices
- Deployment Guide - Production deployment, Docker setup, service configuration
- Runbook - Common Operations - Daily operations, troubleshooting, maintenance tasks
- Backend Migration Guide - Switching between parser/archive/RAG backends
- Migration & State Repair - State file management, recovery procedures
- Secrets Rotation - Credential management and rotation procedures
- Troubleshooting: RAGFlow - RAGFlow integration issues and fixes
- Developer Guide - Development setup, scraper architecture, best practices
- Example Scraper Walkthrough - Step-by-step guide to creating a new scraper
- Backend Developer Guide - Creating new parser/archive/RAG backends
- Configuration & Services - Configuration system, service integration patterns
- Error Handling - Exception hierarchy, retry patterns
- Logging & Error Standards - Logging best practices
- Metadata Schema - Document metadata structure and validation
- Changelog - Version history and release notes
- TODO/Roadmap - Planned features and improvements
See .env.example for all configuration options.
- Enable basic auth on the web UI by setting
BASIC_AUTH_ENABLED=trueand providingBASIC_AUTH_USERNAME/BASIC_AUTH_PASSWORD. - Leave disabled for local development (default).
- File logs default to JSON lines with size-based rotation (10 MB, 5 backups). Configure via:
LOG_JSON_FORMAT(true/false)LOG_FILE_MAX_BYTES(bytes)LOG_FILE_BACKUP_COUNT(files to keep)LOG_TO_FILE(toggle file output)LOG_LEVEL(INFO, DEBUG, etc.)
- Secrets and endpoints come from
.env(environment variables). - Runtime-tunable behavior (timeouts, defaults, per-scraper overrides) lives in
config/settings.jsonand is validated against an internal JSON schema at load/save time. - When in doubt:
.envwins for secrets/URLs;settings.jsonwins for UI-tuned behavior.
-
Always terminate TLS in front of the app (e.g., nginx/Traefik with valid certs) when exposed off-LAN.
-
Enable
BASIC_AUTH_ENABLED+ credentials for the UI whenever it is reachable outside trusted networks. -
Keep secrets in
.env(not insettings.json); rotate keys regularly and scope API keys per-environment. -
If running behind a proxy, set forwarded headers correctly (X-Forwarded-Proto/Host) and prefer HSTS at the proxy layer.
-
When behind a reverse proxy, set
TRUST_PROXY_COUNT(e.g., 1 for a single proxy hop) so Flask respects forwarded host/proto via ProxyFix. -
Quick checklist: TLS terminated at proxy with HSTS;
BASIC_AUTH_ENABLED=truewith strong creds if exposed;TRUST_PROXY_COUNTset when proxied; secrets only in.env; restrict writeable volumes (config/,data/,logs/) to trusted hosts. -
Restrict write volumes (
config/,data/,logs/) to least privilege; avoid sharing these into untrusted containers. -
Example Traefik snippet (secure headers + forwarded proto):
labels: - traefik.enable=true - traefik.http.routers.scraper.rule=Host(`scraper.example.com`) - traefik.http.routers.scraper.entrypoints=websecure - traefik.http.routers.scraper.tls.certresolver=letsencrypt - traefik.http.middlewares.scraper-headers.headers.stsSeconds=31536000 - traefik.http.middlewares.scraper-headers.headers.forceSTSHeader=true - traefik.http.middlewares.scraper-headers.headers.stsIncludeSubdomains=true - traefik.http.middlewares.scraper-headers.headers.stsPreload=true - traefik.http.middlewares.scraper-headers.headers.referrerPolicy=same-origin - traefik.http.routers.scraper.middlewares=scraper-headers
pip install -r requirements.txt
pip install -r requirements-dev.txt
pytest tests/unit -v --cov=appIntegration tests are skipped by default; set RUN_INTEGRATION_TESTS=1 to enable them. Integration runs may require network access and FlareSolverr.
Security scan:
pip install -r requirements-dev.txt
pip-auditMIT