Skip to content

Pieter1821/PyScrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

PyScrape

Simple web-scraping utility that extracts table data from a target page and downloads linked documents.

Files

  • app.py — main scraper script (requests + BeautifulSoup + pandas).
  • requirements.txt — Python dependencies.
  • scraped_data.csv — sample output CSV.
  • downloads/ — directory where linked documents are saved.

Quickstart (Windows)

  1. Create and activate a virtual env:
    • python -m venv .venv
    • .venv\Scripts\activate
  2. Install deps:
    • pip install -r requirements.txt
  3. Run the scraper:
    • python app.py

Behavior

  • Scrapes the specified URL in app.py, extracts the first (or first wikitable) HTML table into CSV, and saves documents (PDF/DOC/TXT) referenced from the page into downloads/.
  • Sends a descriptive User-Agent header (modify in app.py) — respect site robots and rate limits.
  • Handles HTTP errors via response.raise_for_status() and prints simple progress/errors.

Configuration

  • Edit the target URL and headers directly in app.py.
  • Increase request timeout or add retry logic for robustness.

Notes

  • Wikimedia and many sites block default UAs; keep a descriptive UA with contact info.
  • Respect robots.txt and site terms. Add exponential backoff / 429 retry handling for production use.

License

  • MIT

About

Simple web-scraping utility that extracts table data from a target page and downloads linked documents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages