Multi-Site PDF Scraper with RAGFlow Integration

A modular web scraping system that downloads PDFs and articles from multiple Australian energy sector websites and integrates with RAGFlow for RAG ingestion.

Features

PDF Scrapers: AEMO, AEMC, AER, ENA, ECA (Australian energy sector documents)
Article Scrapers: RenewEconomy, TheEnergy, Guardian Australia, The Conversation
Modular scraper architecture - easily add new scrapers
HTMX-based web interface for configuration and monitoring
RAGFlow integration with metadata support for document ingestion
CLI interface for n8n integration
FlareSolverr support for Cloudflare bypass
Docker-ready for deployment on Unraid

Quick Start

Production Deployment

For detailed deployment instructions, see DEPLOYMENT_GUIDE.md.

Quick setup:

# 1. Clone and configure
git clone <repository-url>
cd scraper
cp .env.example .env
nano .env  # Configure SECRET_KEY, RAGFlow, etc.

# 2. Build and start
docker compose build
docker compose up -d

# 3. Access web UI
open http://localhost:5000

Docker Compose starts the scraper, FlareSolverr, and Gotenberg:

docker compose up -d

See DEPLOYMENT_GUIDE.md for:

Environment configuration
Service connectivity tests
Troubleshooting guide
Production best practices

Day-to-Day Operations

For day-to-day operations, see RUNBOOK_COMMON_OPERATIONS.md.

Common commands:

# Start/stop services
docker compose up -d
docker compose down

# View logs
docker compose logs -f scraper

# Run scraper
docker compose exec scraper python scripts/run_scraper.py --scraper aemo

# Backup data
tar -czf backup.tar.gz data/state/ data/metadata/ config/

Dev Workflow (Make + dev compose)

These targets default to docker-compose.dev.yml and run everything inside the dev container.

# Build and run the dev stack
make dev-build
make dev-up

# Logs and shell
make logs
make shell

# Tests
make test          # all tests
make test-unit     # unit tests only
make test-int      # integration tests only
make test-file FILE=tests/unit/test_metadata_validation.py::TestClass::test_case

# Optional: override compose file (defaults to docker-compose.dev.yml)
make dev-up COMPOSE=docker-compose.yml

Notes:

Dev web UI: http://localhost:5001 (mapped from container 5000).
VS Code tasks mirror these targets (Terminal → Run Task).

Local Development

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Copy and configure environment
cp .env.example .env
# Edit .env with your settings

# Option A: Use Docker Compose for FlareSolverr + Gotenberg
make dev-up   # starts all services, web UI at http://localhost:5001

# Option B: Run just the web interface (no rendered-page scraping)
python app/main.py

# Or run a scraper directly
python scripts/run_scraper.py --scraper aemo

Docker Deployment

docker-compose up --build

Access the web UI at http://localhost:5000

CLI Usage

# List available scrapers
python scripts/run_scraper.py --list-scrapers

# Run a scraper
python scripts/run_scraper.py --scraper aemo

# Run with options
python scripts/run_scraper.py --scraper aemo --max-pages 5 --output-format json

# Upload to RAGFlow after scraping
python scripts/run_scraper.py --scraper aemo --upload-to-ragflow --dataset-id abc123

Validation and maintenance

# Validate state files (read-only)
python scripts/run_scraper.py state validate

# Repair state files and write sanitized copies
python scripts/run_scraper.py state repair --write

# Validate settings.json and scraper configs
python scripts/run_scraper.py config validate

# Migrate settings/scraper configs to defaults/schema and write back
python scripts/run_scraper.py config migrate --write

Tip: when running locally outside Docker, override dirs to avoid /app defaults, e.g. DOWNLOAD_DIR=./data/scraped STATE_DIR=./data/state.

Project Structure

scraper/
├── app/
│   ├── backends/       # Swappable parser, archive, RAG, vectorstore backends
│   ├── scrapers/       # Scraper modules
│   ├── services/       # External integrations (RAGFlow, FlareSolverr, Paperless)
│   ├── orchestrator/   # Scheduling and pipelines
│   ├── web/            # Flask web interface (blueprints-based)
│   └── utils/          # Shared utilities
├── config/             # Configuration files
│   ├── settings.json   # Runtime settings
│   └── scrapers/       # Per-scraper configurations
├── data/               # Runtime data
│   ├── scraped/        # Downloaded documents
│   ├── metadata/       # Document metadata
│   ├── state/          # Scraper state files
│   └── logs/           # Application logs
├── docs/               # Documentation
│   ├── DEPLOYMENT_GUIDE.md              # Production deployment
│   ├── RUNBOOK_COMMON_OPERATIONS.md     # Day-to-day operations
│   ├── MIGRATION_AND_STATE_REPAIR.md    # State management
│   ├── DEVELOPER_GUIDE.md               # Development guide (see below)
│   └── ...
├── scripts/            # CLI tools and utilities
└── docker-compose.yml  # Production compose file

Adding a New Scraper

For detailed instructions, see DEVELOPER_GUIDE.md.

Quick start:

Create a new file in app/scrapers/ (e.g., my_scraper.py)
Inherit from BaseScraper and implement required methods
The scraper will be auto-discovered and available via CLI and web UI

from app.scrapers.base_scraper import BaseScraper

class MyScraper(BaseScraper):
    NAME = "my-scraper"
    DESCRIPTION = "Scrapes documents from example.com"

    def scrape(self):
        # Implementation here
        pass
    
    def get_metadata(self, filepath):
        # Extract document metadata
        return {
            "title": "Document title",
            "source": "my-scraper",
            "url": "https://example.com/doc.pdf"
        }

See DEVELOPER_GUIDE.md for:

Development setup
Scraper best practices
Testing and debugging
Architecture overview

Documentation

Complete documentation index: docs/README.md

Getting Started

Contributing Guide - How to contribute to the project
Security Policy - Security reporting and best practices

Operations

Deployment Guide - Production deployment, Docker setup, service configuration
Runbook - Common Operations - Daily operations, troubleshooting, maintenance tasks
Backend Migration Guide - Switching between parser/archive/RAG backends
Migration & State Repair - State file management, recovery procedures
Secrets Rotation - Credential management and rotation procedures
Troubleshooting: RAGFlow - RAGFlow integration issues and fixes

Development

Developer Guide - Development setup, scraper architecture, best practices
Example Scraper Walkthrough - Step-by-step guide to creating a new scraper
Backend Developer Guide - Creating new parser/archive/RAG backends
Configuration & Services - Configuration system, service integration patterns
Error Handling - Exception hierarchy, retry patterns
Logging & Error Standards - Logging best practices

Reference

Metadata Schema - Document metadata structure and validation
Changelog - Version history and release notes
TODO/Roadmap - Planned features and improvements

Environment Variables

See .env.example for all configuration options.

Authentication (optional)

Enable basic auth on the web UI by setting BASIC_AUTH_ENABLED=true and providing BASIC_AUTH_USERNAME / BASIC_AUTH_PASSWORD.
Leave disabled for local development (default).

Logging

File logs default to JSON lines with size-based rotation (10 MB, 5 backups). Configure via:
- LOG_JSON_FORMAT (true/false)
- LOG_FILE_MAX_BYTES (bytes)
- LOG_FILE_BACKUP_COUNT (files to keep)
- LOG_TO_FILE (toggle file output)
- LOG_LEVEL (INFO, DEBUG, etc.)

Config precedence

Secrets and endpoints come from .env (environment variables).
Runtime-tunable behavior (timeouts, defaults, per-scraper overrides) lives in config/settings.json and is validated against an internal JSON schema at load/save time.
When in doubt: .env wins for secrets/URLs; settings.json wins for UI-tuned behavior.

Security/HTTPS

Always terminate TLS in front of the app (e.g., nginx/Traefik with valid certs) when exposed off-LAN.
Enable BASIC_AUTH_ENABLED + credentials for the UI whenever it is reachable outside trusted networks.
Keep secrets in .env (not in settings.json); rotate keys regularly and scope API keys per-environment.
If running behind a proxy, set forwarded headers correctly (X-Forwarded-Proto/Host) and prefer HSTS at the proxy layer.
When behind a reverse proxy, set TRUST_PROXY_COUNT (e.g., 1 for a single proxy hop) so Flask respects forwarded host/proto via ProxyFix.
Quick checklist: TLS terminated at proxy with HSTS; BASIC_AUTH_ENABLED=true with strong creds if exposed; TRUST_PROXY_COUNT set when proxied; secrets only in .env; restrict writeable volumes (config/, data/, logs/) to trusted hosts.
Restrict write volumes (config/, data/, logs/) to least privilege; avoid sharing these into untrusted containers.

Example Traefik snippet (secure headers + forwarded proto):

labels:
    - traefik.enable=true
    - traefik.http.routers.scraper.rule=Host(`scraper.example.com`)
    - traefik.http.routers.scraper.entrypoints=websecure
    - traefik.http.routers.scraper.tls.certresolver=letsencrypt
    - traefik.http.middlewares.scraper-headers.headers.stsSeconds=31536000
    - traefik.http.middlewares.scraper-headers.headers.forceSTSHeader=true
    - traefik.http.middlewares.scraper-headers.headers.stsIncludeSubdomains=true
    - traefik.http.middlewares.scraper-headers.headers.stsPreload=true
    - traefik.http.middlewares.scraper-headers.headers.referrerPolicy=same-origin
    - traefik.http.routers.scraper.middlewares=scraper-headers

Running Tests

pip install -r requirements.txt
pip install -r requirements-dev.txt
pytest tests/unit -v --cov=app

Integration tests are skipped by default; set RUN_INTEGRATION_TESTS=1 to enable them. Integration runs may require network access and FlareSolverr.

Security scan:

pip install -r requirements-dev.txt
pip-audit

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
.claude		.claude
.github/workflows		.github/workflows
app		app
config		config
data		data
docs		docs
mcp_server		mcp_server
migrations		migrations
scripts		scripts
tests		tests
unraid		unraid
.dockerignore		.dockerignore
.env.example		.env.example
.env.test		.env.test
.gitignore		.gitignore
.hadolint.yaml		.hadolint.yaml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
codecov.yml		codecov.yml
constraints.txt		constraints.txt
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml
test_scrape_report.py		test_scrape_report.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Site PDF Scraper with RAGFlow Integration

Features

Quick Start

Production Deployment

Day-to-Day Operations

Dev Workflow (Make + dev compose)

Local Development

Docker Deployment

CLI Usage

Validation and maintenance

Project Structure

Adding a New Scraper

Documentation

Getting Started

Operations

Development

Reference

Environment Variables

Authentication (optional)

Logging

Config precedence

Security/HTTPS

Running Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Site PDF Scraper with RAGFlow Integration

Features

Quick Start

Production Deployment

Day-to-Day Operations

Dev Workflow (Make + dev compose)

Local Development

Docker Deployment

CLI Usage

Validation and maintenance

Project Structure

Adding a New Scraper

Documentation

Getting Started

Operations

Development

Reference

Environment Variables

Authentication (optional)

Logging

Config precedence

Security/HTTPS

Running Tests

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages