This project demonstrates the end-to-end process of extracting data from the Rick and Morty open-source API, transforming it, and loading the transformed data into an Amazon RDS MySQL database. The project leverages AWS services such as Lambda for automation and S3 for storage. The data is then connected to HeidiSQL for easy querying and analysis.
- Prerequisites
- Environment Setup
- Data Extraction
- Data Transformation
- Data Loading
- Connecting to RDS using HeidiSQL
- Error Handling and Troubleshooting
- Conclusion
Before starting, ensure you have the following:
- AWS Account: Access to an AWS account with permissions to create and manage S3, RDS, and Lambda.
- Python: Python 3.8 or higher installed.
- AWS CLI: Installed and configured with your credentials.
- HeidiSQL: Installed on your local machine.
-
Clone the Repository:
git clone https://github.com/yourusername/data-engineering.git cd data-engineering -
Create a Virtual Environment:
python3 -m venv Mustard source Mustard/bin/activate -
Install Dependencies:
pip install -r requirements.txt
-
AWS Configuration: Run
aws configureto set up your AWS credentials.aws configure
Provide your
AWS Access Key,Secret Key,Region, andOutput format.
The data is extracted from the Rick and Morty API's three main endpoints: Characters, Locations, and Episodes. The data is saved as CSV files in an S3 bucket.
Steps:
-
Extract Data:
- The script iterates over the API pages to collect data from each endpoint.
- Data is saved into three separate DataFrames:
characters.csv,locations.csv, andepisodes.csv.
-
Save Data to S3:
- The data is saved to the
Rick&Morty/Untransformed/folder in your S3 bucket.
- The data is saved to the
The transformation involves cleaning and restructuring the data to prepare it for loading into the RDS database.
Steps:
-
Load Data from S3:
- The script reads the raw CSV files from the S3 bucket.
-
Transform Data:
- Extracts necessary fields from complex data types (e.g., nested JSON structures).
- Creates additional DataFrames such as
appearances.csvto normalize the data.
-
Save Transformed Data:
- The transformed data is saved back to the S3 bucket in the
Rick&Morty/Transformed/folder.
- The transformed data is saved back to the S3 bucket in the
This script loads the transformed data into the RDS MySQL database.
Steps:
-
Connect to RDS:
- Uses the
mysql-connector-pythonlibrary to connect to the RDS instance.
- Uses the
-
Create Tables:
- Executes SQL scripts to create tables (
Character_Table,Episode_Table,Appearance_Table,Location_Table) in the database.
- Executes SQL scripts to create tables (
-
Insert Data:
- Iterates over the DataFrames and inserts the data into the corresponding tables.
-
Open HeidiSQL.
-
Create a New Session:
- Select
MySQLas the connection type. - Fill in the connection details:
- Hostname/IP: Your RDS endpoint.
- User:
admin(or your RDS username). - Password: Your RDS password.
- Database:
rick-and-morty-db.
- Select
-
Connect:
- Once connected, you can view the tables and execute queries against your Rick and Morty data.
- NoCredentialsError: Ensure your AWS credentials are correctly configured using
aws configure. - AccessDeniedError: Attach the necessary IAM policies to your user or role to allow S3 and RDS operations.
- KeyError: Ensure that the data fields being referenced exist in your DataFrames after transformation.
This project showcases a complete ETL pipeline using AWS services and Python. The data extracted from the Rick and Morty API is transformed and stored in a MySQL database hosted on AWS RDS. Finally, the data is accessible and queryable through HeidiSQL, making it a robust solution for data analysis.