This project provides tools to scrape, process, and analyze weather station metadata. It extracts information about weather stations including their names, coordinates, and other metadata from HTML pages.
DISCLAIMER: This tool is for educational purposes only. Users are responsible for ensuring compliance with the terms of service of any website they interact with. Always check robots.txt and respect rate limits when scraping websites.
- Web scraping of weather station metadata
- Data cleaning and processing
- Coordinate extraction and standardization
- CSV export functionality
weather-data-scraper/
├── README.md
├── requirements.txt
├── data/ # Directory for input/output data files
├── src/ # Source code
│ ├── scraper.py # Web scraping functionality
│ ├── data_processor.py # Data processing and cleaning
│ └── utils.py # Utility functions
└── tests/ # Test directory
- Clone this repository:
git clone https://github.com/yourusername/weather-data-scraper.git
cd weather-data-scraper- Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtfrom src.scraper import extract_data
from src.data_processor import enrich_data, clean_data
# Extract data from a single URL
data = extract_data("example-url.com/station/123")
# Process a CSV file containing station URLs
enriched_df = enrich_data("data/station_list.csv")
# Clean the data
cleaned_df = clean_data(enriched_df)
# Save to CSV
cleaned_df.to_csv("data/cleaned_station_list.csv", index=False)- Prepare a CSV file with station URLs
- Run the enrichment process to extract metadata
- Clean the data to remove empty columns and standardize formats
- Analyze the resulting dataset