Welcome to the exercises and projects repository for the Data Engineering course at Coderhouse! This repository contains hands-on assignments, code samples, and projects written by me that are part of the course curriculum.
Welcome to the Data Engineering course project repository.
This project revolves around the ETL (Extract, Transform, Load) process, focusing specifically on financial data - extracting stock ticker data, transforming it for analytical purposes, and eventually loading it into a data warehouse. Our final aim is to create a robust pipeline that can handle vast amounts of financial data, ensuring that it is readily available for any subsequent analysis.
The main technologies and tools we're employing include:
- Pandas: For scripting and data manipulation.
- Amazon Redshift / PostgreSQL: As our primary data storage and querying solution.
- psycopg2: To connect and interact with Amazon Redshift.
- YFinance: A data source utilized for fetching stock ticker data.
This deliverable advances the ETL pipeline by adding a validation step on the downloaded data and uploading to a staging table before replacing in the production table.
The main technologies and tools we're employing include (additional to the ones mentioned in deliverable 1):
- pandas_market_calendars: for market data validation.
- Custom functions built in Python: making the repeatable process of uploading data replicable.
On this deliverable a custom Docker compose is built for scheduling the ETL on Airflow. From this step onward the ETL is runnable on any computer that has Docker without any further configuration.
The main technologies and tools we're employing include (additional to the ones mentioned in deliverable 1 and 2):
- Docker: custom Docker compose file for further modularization and reproducibility.
- Airflow: daily execution of the ETL.
- Papermill: PapermillOperator for executing python notebooks.
- PostgreSQL: as a microservice in Docker with ticker data. (this can be replaced with Redshift connection information)
Here we set up XCOM between tasks in Airflow to forward basic data downloading reports to a new task. This tasks sends the message to specific receivers.
The main technologies and tools we're employing include (additional to the ones mentioned in deliverable 1, 2 and 3):
- XCOM: custom Docker compose file for further modularization and reproducibility.
- SMTPlib: python library to handle SMTP connections.
Weekly content has been selectively curated, omitting certain weeks with introductory material to maintain the specialized focus of this repository.
The provided SQL scripts serve as the backbone for initializing and processing data within our DESASTRES database. The create_base.sql script establishes the foundational structure of our database. It focuses on the creation of the DESASTRES database and subsequently sets up the clima table, designed to store climate-related data such as yearly temperature and oxygen levels.
On the other hand, the create_procedure.sql script introduces a stored procedure, petl_desastres(), within the public schema. This procedure is pivotal for data processing and transformation. It ensures data consistency by first clearing entries in the desastres_final table and then populates it with aggregated data, emphasizing key metrics like average temperature, average oxygen levels, and total tsunamis over specified intervals.
This collection of projects encompasses various aspects of data handling and analysis. We begin with exploratory data analysis (EDA) in pandas_ex.ipynb, focusing on assessing duplicates, null values, and anomalies in the dataset. This is followed by pytrend_ex.ipynb, where we delve into trend analysis using Google Trends data for specific keywords and visualizing the trend insights. Lastly, in sql-json_ex.ipynb, we extract data from both JSON and SQL sources, making it ready for further processing and analysis.
The main technologies and tools employed across these projects include:
- Pandas: Employed extensively for data manipulation and EDA.
- Pytrends: Utilized to fetch Google Trends data.
- JSON: To read and normalize raw and nested JSON files.
- SQLite3: For connecting to SQLite databases and extracting data.