Data Engineering Project: End-to-End ETL Data Pipeline for Processing and Analyzing YouTube Trends on AWS βΆοΈ π
This project is a complete end-to-end data engineering pipeline for analyzing YouTube data using AWS cloud services. The core goal is to extract insights from raw CSV and JSON data formats using AWS Glue, Lambda, Athena, and QuickSight.
Through this project, I aimed to strengthen my understanding of:
- Cloud data lakes and cataloging using AWS
- Real-world challenges with JSON normalization
- Automating ETL workflows with Lambda and PySpark
- Building analytical dashboards with Athena and QuickSight
- IAM β Create roles with attached required policies for AWS services to acccess another service
- Amazon S3 β Storage of raw and processed data
- AWS CLI β Upload data to S3 with hive-style partitioning
- AWS Lambda β Transform/normalize JSON to Parquet
- AWS Glue β Data cataloging, crawling, and ETL jobs (both visual and PySpark-based)
- Amazon Athena β Serverless querying of datasets
- Amazon QuickSight β Dashboard and reporting
- Amazon CloudWatch β Logging errors and outputs for debugging and data verification
- An active AWS Account
- AWS CLI installed and configured with credentials
- Familiarity with:
- Python
- PySpark
- Basic AWS Services (especially Glue, S3, Lambda, Athena, QuickSight)
- IAM roles/policies granting S3, Glue, Lambda, and Athena access
A daily-record dataset of top trending YouTube videos across multiple countries, originally compiled by datasnaek and hosted on Kaggle. Link - https://www.kaggle.com/datasets/datasnaek/youtube-new/data
- Covers trending videos from 14 NovemberΒ 2017 to 14 JuneΒ 2018 (7βmonth period).
- Includes up to 200 daily trending videos per country.
- Countries included: USA, Canada, Great Britain, Germany, France, Russia, India, Japan, South Korea, Mexico.
- Intended for trend analysis, visualization, and predictive modeling based on public engagement metrics.
| Item | Value |
|---|---|
| Date Range | 14 Nov 2017 β 14 Jun 2018 |
| Number of Days | 213 |
| Videos per Day | 200 |
| Approximate Records per File | 42,600 |
| Number of Countries | 10 |
| Total Records (All Files) | ~426,000 |
β οΈ Note: Actual counts may vary slightly due to video removals, regional restrictions, or missing data on certain days.
Each CSV file (e.g., USvideos.csv, INvideos.csv) includes:
| Column | Description |
|---|---|
video_id |
Unique video identifier |
trending_date |
Date when the video appeared on trending list (format: YY.DD.MM) |
publish_time |
ISOβ8601 UTC timestamp when uploaded |
title, channel_title |
Video title and channel name |
category_id |
Numeric category code (countryβspecific) |
tags |
List of video tags separated by | |
views, likes, dislikes, comment_count |
Engagement metrics |
description |
Video description text |
comments_disabled, ratings_disabled, video_error_or_removed |
Boolean flags for video status |
- Up to 16 columns per file.
- Contains daily snapshots.
β οΈ Note: Some videos may appear multiple times across days if still trending.
E.g., US_category_id.json, IN_category_id.json:
{
"kind":"youtube#videoCategoryListResponse",
"etag":"\"ld9biNPKjAjgjV7EZ4EKeEGrhao/1v2mrzYSYG6onNLt2qTj13hkQZk\"",
"items":[
{
"kind":"youtube#videoCategory",
"etag":"\"ld9biNPKjAjgjV7EZ4EKeEGrhao/Xy1mB4_yLrHy_BmKmPBggty2mZQ\"",
"id":"1",
"snippet":{
"channelId":"UCBR8-60-B28hp2BmDPdntcQ",
"title":"Film & Animation",
"assignable":true
}
},
{
"kind":"youtube#videoCategory",
"etag":"\"ld9biNPKjAjgjV7EZ4EKeEGrhao/UZ1oLIIz2dxIhO45ZTFR3a3NyTA\"",
"id":"2",
"snippet":{
"channelId":"UCBR8-60-B28hp2BmDPdntcQ",
"title":"Autos & Vehicles",
"assignable":true
}
},
...
]
}Each *_category_id.json file provides metadata about YouTube video categories for a specific region (e.g., US_category_id.json).
- It maps each numeric
category_id(used in the CSV files) to a human-readable category name like"Music"or"Sports". - The data is nested: you'll find a list of
items, where eachitemcontains anidand asnippet.title.
To interpret:
- Match the
category_idin the CSV to theidin JSON - Extract
snippet.titleas the actual category name
β οΈ Note: The other fields (likechannelId,etag, etc.) are metadata from the YouTube API and aren't needed for mapping or analysis.
- Download raw CSV and JSON data locally in
~/../<project_folder>/data/folder. - Create an S3 bucket and upload the data:
- CSV files with Hive-style partitioning (e.g.,
/region=us/,/region=ca/) - JSON files stored in a separate folder without partitioning
- CSV files with Hive-style partitioning (e.g.,
- Use
aws s3 cporaws s3 syncvia AWS CLI for uploads.
Full commands - S3-CLI-Command.sh
- Tried creating a Glue Crawler over JSON data.
- Encountered issues due to Glue's limited support for nested JSON structures.
- Decided to normalize the JSON before further processing.
β Note - Here I also created
youtubeanalysis-glue-s3-rolerole inIAMwith following policies:AmazonS3FullAccessandAWSGlueServiceRoleto give Glue access to S3 buckets.
- Wrote a Lambda function to:
- Read JSON from raw S3
- Normalize/flatten JSON
- Write as Parquet to a new "cleaned" S3 bucket
- Successfully tested on one file using
S3 Putevent. - Verified results using Athena queries on the generated Glue Catalog.
SELECT * FROM "AwsDataCatalog"."db_youtube_cleaned"."cleaned_statistics_ref_data" limit 10;β Note - 1. Created another role
youtubeanalysis-lambda-s3-rolewith full access to S3AmazonS3FullAccessand GlueAWSGlueServiceRole2. Used Lambda LayersAWSSDKPandas-Python39version28for the dependecy packages. 3. Modified thetimeout:3min,memory:256MBandstorage:512MBfor seamless run.
- Created a Glue Crawler to crawl the partitioned CSV files.
- Table was created with
regionas the partition key.
- Used Athena to join the JSON (reference) and CSV (raw) tables.
- Adjusted data types in the Glue catalog to avoid repeated casting.
- Added an S3 trigger on the raw JSON folder to invoke the Lambda function.
- Lambda automatically creates a Parquet version of each
.jsonfile in a different bucket undercleaned_statistics_ref_datafolder and updates(append) the Glue catalog.
β Note - Triggers will be
All object create eventi.e.PUT,POST,COPYof only.jsontype in the JSON data source folder.
- Initially attempted Glue Visual ETL β faced encoding & special character issues. Visual_to_Glue_Spark_Script_Failed.py
- Switched to custom PySpark job: Glue_PySpark_ETL_CSV_to_Parquet.py
- Read CSV from raw S3
- Added "region" column manually
- Applied transformations i.e. type conversion of
stringcolumns tolongandbooleanwherever necessary - Wrote cleaned Parquet to a new S3 location under
raw_data_transformedfolder partitioned byregion
- Created a new Glue Crawler to crawl the cleaned raw folder
raw_data_transformedto create a new catalog table with partition columnregion.
β Note - Verified data by joining both reference(JSON now parquet) and raw(CSV now parquet) in Athena.
-- join raw transformed catalog table with the cleaned ref catalog table
-- produces 3,734,265 records
-- this then is implemented in Glue Visual ETL job to create 1 final data for analysis/reporting.
SELECT *
FROM "db_youtube_cleaned"."cleaned_statistics_ref_data" ref
INNER JOIN "db_youtube_cleaned"."raw_data_transformed" raw
ON ref.id = raw.category_id
;- Used Glue Visual ETL Job to join the cleaned raw(
raw_data_transformed) and ref(cleaned_statistics_ref_data) catalogs tables onraw.category_id=ref.id. - Output(a
parquetfile) written to a new Final Analytics S3 bucket undercombined_ref_raw/folder. - Also the ETL job creates a Glue catalog table
combined_ref_raw_catalogfrom the final joined data with two partition keysregionandcategory_id.
Auto generated PySpark Code - Glue_Visual_ETL_Final_Combined_Raw_n_Ref_data.py
- Connected QuickSight to Athena.
- Imported the
combined_ref_raw_catalogtable for reporting. - Built insightful dashboards using dimensions like region, category, and more.
- View the Analysis Report
- The project demonstrates an end-to-end Data Lake architecture.
- Highlights the strengths and limitations of AWS-native tools.
- Offers hands-on experience with data wrangling, schema handling, and reporting.
