Simple web-scraping utility that extracts table data from a target page and downloads linked documents.
Files
- app.py — main scraper script (requests + BeautifulSoup + pandas).
- requirements.txt — Python dependencies.
- scraped_data.csv — sample output CSV.
- downloads/ — directory where linked documents are saved.
Quickstart (Windows)
- Create and activate a virtual env:
- python -m venv .venv
- .venv\Scripts\activate
- Install deps:
- pip install -r requirements.txt
- Run the scraper:
- python app.py
Behavior
- Scrapes the specified URL in
app.py, extracts the first (or first wikitable) HTML table into CSV, and saves documents (PDF/DOC/TXT) referenced from the page intodownloads/. - Sends a descriptive User-Agent header (modify in
app.py) — respect site robots and rate limits. - Handles HTTP errors via
response.raise_for_status()and prints simple progress/errors.
Configuration
- Edit the target URL and headers directly in app.py.
- Increase request timeout or add retry logic for robustness.
Notes
- Wikimedia and many sites block default UAs; keep a descriptive UA with contact info.
- Respect robots.txt and site terms. Add exponential backoff / 429 retry handling for production use.
License
- MIT