- Team: Tilmann Diepenbruck, Fabian Klug
- University project for "Fundamentals of Earth system data processing"
- Student project for University of Cologne
This project aims to transform daily climate summaries into a fully operational database system using PostgreSQL. It includes functionalities for downloading data, ingesting it into the database, managing station information, performing data analysis, and optionally, creating a web service for data visualization.
The Global Historical Climatology Network - Daily (GHCN-Daily) dataset integrates daily climate observations from approximately 30 different data sources. Version 3 was released in September 2012 with the addition of data from two additional station networks. Changes to the processing system associated with the version 3 release also allowed for updates to occur 7 days a week rather than only on most weekdays. Version 3 contains station-based measurements from well over 90,000 land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. Over 25,000 stations are regularly updated with observations from within roughly the last month. The dataset is also routinely reconstructed (usually every week) from its roughly 30 data sources to ensure that GHCN-Daily is generally in sync with its growing list of constituent sources. During this process, quality assurance checks are applied to the full dataset. Where possible, GHCN-Daily station data are also updated daily from a variety of data streams. Station values for each daily update also undergo a suite of quality checks. NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ [17.07.2024]
- The Presentation.ipynb file contains a rundown demo for the task here
-
Data Acquisition and Ingestion: [✔]
- Download daily climate summaries for at least 10 years and load them into a PostgreSQL database.
- Automate data download and ingestion processes.
-
Database Structure: [✔]
- Define the database schema to store climate data and station metadata efficiently.
-
Functionality: [✔]
- Develop Python functions to query the database based on user-specified station and parameters.
- Implement basic data analysis capabilities such as plotting graphs based on user inputs.
-
Optional Web Service: [✖]
- Create a web-based frontend to interact with the database, allowing users to select stations and parameters and visualize data.
- Utilize Docker for setting up the PostgreSQL database.
- Advantages of Docker include easy deployment, isolation of dependencies, and scalability.
- You can download Docker from here: https://www.docker.com/
- Description: Contains functions to manage interactions with the PostgreSQL database.
- Features:
- Modular execution of database operations.
- Database connection handling and query execution.
- Description: Handles data management tasks such as downloading, filtering, and exporting climate data.
- Features:
- Automated data download with requests api.
- Data filtering based on predefined criteria.
- Python
- Download python here
- python packages
pip install -r requirements.txt
- Docker
- Download docker here
- We configured a docker-compose.yml file which is used for controlling the docker container and its dependencies
- To start the docker container type for the first time
docker-compose up -d --build # if not for the first time, just dismiss the --build flag
-
Create a new .ipynb file and insert the following code blocks
-
setup user settings
years_in_db = [1950, 1951, 1952, 1953, 1954, 2010, 2011, 2012, 2013, 2014] stations_file_path = "./data/stations/" modified_stations_file_path = "./data/stations/modifiedStations/" download_csv_file_path = "./data/NCEI/ghcn/daily/" modified_csv_file_path = "./data/NCEI/modified/daily/" db_name = "NCEIDatabase" db_user = "ESDP" db_password = "esdp1" debug_messages = False station_cols = ["id", "latitude", "longitude", "elevation", "state", "name", "gsn_flag", "hcn_crn_flag", "wmo_id"] weather_cols = ["id", "stationcode", "datelabel", "param", "value", "mflag", "qflag", "sflag", "time"]
-
configure .env file
#This command writes the db_name, db_user and db_password to an env file which is used by the docker-container during postgresql setup envVariables = { "DATABASE_NAME": db_name, "DATABASE_USER": db_user, "DATABASE_PASSWORD": db_password, } with open(".env", "w") as env_file: for key, value in envVariables.items(): env_file.write(f"{key}={value}\n") print(".env file created successfully.")
-
Impport the packages (adjust to the path of the modules - here /nceiDatabaseConnector/nceiDatabasePackage/nceiDataManager translates to the following)
from nceiDatabaseConnector.nceiDatabasePackage.nceiDataManager import NCEIDataManager from nceiDatabaseConnector.nceiDatabasePackage.nceiDatabaseManager import NCEIDatabaseManager
-
Create data and database instances
ncei_db = NCEIDatabaseManager(db_name=db_name, db_user=db_user, db_password=db_password, db_host="localhost", db_port="5432", debug_messages = False, years_in_db=years_in_db) ncei_data = NCEIDataManager() -
Use the implemented functions
ncei_db.create_stations_table() ncei_data.download_stations(file_path_dest=stations_file_path)
To interact with the PostgreSQL database using Docker:
-
Access the Docker container:
docker exec -it esdp1-database-project-db-1 bash -
Connect to database
psql -U ESDP -d NCEIDatabase -h localhost
-
List all tables
\dt -
Check Climate table if it has values
SELECT COUNT(*) AS count, stationcode FROM "Climate1952" ORDER BY count;
-
Check which stations has the most values
SELECT COUNT(*) AS count, stationcode FROM "Climate1952" GROUP BY stationcode ORDER BY count DESC LIMIT 10;
-
Check which stations has lowest elevation
SELECT * FROM "Station" WHERE elevation > -999 ORDER BY elevation ASC LIMIT 10;
-
Check which stations has highest elevation
SELECT * FROM "Station" WHERE elevation > -999 ORDER BY elevation DESC LIMIT 10;
-
Check which stations has highest elevation
SELECT COUNT(*) AS count, stationcode FROM "Climate1955" c LEFT JOIN "Station" s ON s.id = c.stationcode WHERE latitude < 0 GROUP BY stationcode ORDER BY count DESC LIMIT 10;