AWS ETL Pipeline for YouTube Data Analytics

📋 Project Overview

This project implements a complete serverless ETL (Extract, Transform, Load) pipeline on AWS to process and analyze YouTube trending video statistics data from multiple regions. The pipeline leverages AWS managed services to build a scalable, cost-effective data analytics solution.

Business Context

Cloud-based Data Lake solutions help develop rich analytics while organizing data into storage phases (raw, cleansed, and analytical). This project aims to securely manage, streamline, and perform analysis on structured and semi-structured YouTube videos data based on video categories and trending metrics.

Dataset

This Kaggle dataset contains statistics (CSV files) on daily popular YouTube videos over several months. There are up to 200 trending videos published every day across multiple regions. Each region has its own data file containing:

Video title, channel title, publication time
Tags, views, likes, dislikes
Description and comment count
Category ID (linked via JSON reference files)

🏗️ Architecture

The pipeline follows a medallion architecture with three data layers:

Raw Data (S3) → Lambda → Cleansed Layer (S3) → Glue ETL → Analytics Layer (S3) → Athena/QuickSight

Key Components:

Amazon S3: Data lake storage (raw, cleansed, and analytics layers)
AWS Lambda: Serverless data processing for JSON transformation
AWS Glue: Serverless ETL jobs using Apache Spark
AWS Glue Data Catalog: Centralized metadata repository
Amazon Athena: SQL-based data querying
Amazon QuickSight: Data visualization and dashboards

📊 Data Flow Workflow

1. Data Ingestion Layer

Raw CSV and JSON data files are uploaded to S3 raw bucket
Data is organized using Hive-style partitioning (region=ca/, region=us/, etc.)
Supports multiple regions: CA, DE, FR, GB, IN, JP, KR, MX, RU, US

2. Data Processing Layer

Lambda Function: Triggered by S3 events when JSON files are uploaded
- Reads JSON data using AWS Data Wrangler
- Normalizes nested JSON structures using pandas
- Converts to Parquet format for optimized storage
- Registers data in Glue Data Catalog

3. Data Transformation Layer

Glue ETL Jobs: PySpark-based transformations
- Reads data from Glue Data Catalog
- Applies schema mapping and data type conversions
- Resolves data quality issues (null fields, type conflicts)
- Partitions data by region
- Implements predicate pushdown for query optimization

4. Data Analytics Layer

Transformed data stored in Parquet format
Partitioned by region for efficient querying
Queryable via Amazon Athena
Visualized using Amazon QuickSight

📁 Project Structure

.
├── README.md
├── Architecture.jpg                     # Pipeline architecture diagram
├── lambda function.py                   # Lambda function for JSON processing
├── Spark code Glue job.py               # Basic Glue ETL job
├── Spark code Glue job with pushdown predicate.py  # Optimized Glue job
└── Amazon S3 CLI copy commands.sh       # S3 data upload commands

🚀 Getting Started

Prerequisites

AWS Account with appropriate IAM permissions
AWS CLI configured with credentials
Python 3.8+
Basic understanding of AWS services (S3, Lambda, Glue, Athena)

Required AWS Services Setup

S3 Buckets

- Raw bucket: s3://bigdata-on-youtube-raw-{region}-{account-id}-{env}/
- Cleansed bucket: s3://bigdata-on-youtube-cleansed-{region}-{account-id}-{env}/
- Analytics bucket: s3://bigdata-on-youtube-analytics-{region}-{account-id}-{env}/

IAM Roles
- Lambda execution role with S3 and Glue permissions
- Glue service role with S3 read/write permissions
Lambda Configuration
- Runtime: Python 3.9+
- Layer: AWS Data Wrangler
- Environment Variables:
  - s3_cleansed_layer: Target S3 path
  - glue_catalog_db_name: Glue database name
  - glue_catalog_table_name: Glue table name
  - write_data_operation: Write mode (append/overwrite)

Installation Steps

Upload Raw Data to S3

# Copy JSON reference data
aws s3 cp . s3://your-raw-bucket/youtube/raw_statistics_reference_data/ \
  --recursive --exclude "*" --include "*.json"

# Copy CSV files with regional partitioning
aws s3 cp CAvideos.csv s3://your-raw-bucket/youtube/raw_statistics/region=ca/
aws s3 cp USvideos.csv s3://your-raw-bucket/youtube/raw_statistics/region=us/
# ... (repeat for other regions)

Deploy Lambda Function
- Create Lambda function in AWS Console
- Copy code from lambda function.py
- Add AWS Data Wrangler layer
- Configure environment variables
- Set S3 trigger for JSON file uploads
Create Glue Crawler
- Point to S3 raw data location
- Configure to create database and tables
- Schedule or run on-demand
Deploy Glue ETL Job
- Create Glue job in AWS Console
- Copy code from Spark code Glue job.py or optimized version
- Configure job parameters (DPU, timeout, etc.)
- Set S3 output path
Configure Athena
- Create workgroup and query results location
- Run queries against Glue Data Catalog tables

💾 Data Schema

YouTube Statistics Table

Column	Type	Description
video_id	string	Unique video identifier
trending_date	string	Date video was trending
title	string	Video title
channel_title	string	Channel name
category_id	long	Video category ID
publish_time	string	Video publish timestamp
tags	string	Video tags
views	long	View count
likes	long	Like count
dislikes	long	Dislike count
comment_count	long	Comment count
thumbnail_link	string	Thumbnail URL
comments_disabled	boolean	Comments status
ratings_disabled	boolean	Ratings status
video_error_or_removed	boolean	Video availability
description	string	Video description
region	string	Country/region code

⚙️ Key Features

Lambda Function

Event-driven processing: Automatically triggered on S3 uploads
Format conversion: JSON to Parquet transformation
Schema registration: Automatic Glue Catalog updates
Error handling: Robust exception management

Glue ETL Jobs

Serverless processing: No infrastructure management
Schema mapping: Automatic type conversion and validation
Data quality: Null field removal and conflict resolution
Partitioning: Regional data organization for optimized queries
Predicate pushdown: Filter data at source for improved performance
File optimization: Coalesce output to reduce small files

🔧 Optimization Techniques

Pushdown Predicates (Spark code Glue job with pushdown predicate.py)
- Filters data at the source level
- Reduces data transfer and processing time
- Example: predicate_pushdown = "region in ('ca','gb','us')"
Data Partitioning
- Partitioned by region for query efficiency
- Enables partition pruning in Athena
File Format
- Parquet columnar format for compression and fast queries
- Reduced storage costs by ~75% compared to CSV
File Consolidation
- Uses .coalesce(1) to reduce small file issues
- Improves query performance

📊 Use Cases

Analyze trending YouTube videos across multiple regions
Track video engagement metrics (views, likes, comments)
Compare trending patterns between countries
Identify popular categories and channels
Time-series analysis of video trends

🔐 Security Best Practices

Use IAM roles with least privilege principle
Enable S3 bucket encryption at rest
Enable S3 bucket versioning for data protection
Use VPC endpoints for private communication
Enable AWS CloudTrail for audit logging
Implement S3 bucket policies and access controls

💰 Cost Optimization

Use S3 Intelligent-Tiering for automatic cost savings
Leverage Glue job bookmarks to avoid reprocessing data
Use Athena query result caching
Implement S3 lifecycle policies to archive old data
Monitor with AWS Cost Explorer and set budgets

📈 Monitoring & Logging

CloudWatch Logs: Lambda and Glue job logs
CloudWatch Metrics: Execution duration, memory usage
Glue Job Metrics: DPU utilization, data processed
S3 Metrics: Request rates, data transfer
AWS X-Ray: Distributed tracing for Lambda

🛠️ Troubleshooting

Common Issues

Lambda timeout: Increase timeout or optimize data processing
Glue job failures: Check CloudWatch logs for errors
S3 permissions: Verify IAM roles have correct policies
Schema conflicts: Ensure consistent data types across sources
Partition issues: Verify Hive-style partitioning format

🎯 Key Takeaways

Understanding ETL on Big Data
Building Data Lakes with staging layers (raw, cleansed, analytical)
Creating IAM Roles and Policies for secure access
Developing Lambda Functions for event-driven processing
Setting up Glue Jobs for serverless ETL
Using Glue Crawler and Glue Studio
Creating and managing Glue Data Catalog
Converting JSON to Parquet format for optimization
Performing Data Transformations and Joins with PySpark
Visualizing insights in QuickSight

📚 Additional Resources

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

👤 Author

Built as part of an AWS ETL pipeline learning project.

🙏 Acknowledgments

AWS Documentation and tutorials
YouTube Trending Dataset
Open-source community

Note: Remember to replace placeholder values (bucket names, account IDs, regions) with your actual AWS resources before deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Amazon S3 CLI copy commands.sh		Amazon S3 CLI copy commands.sh
Architecture.jpg		Architecture.jpg
README.md		README.md
Spark code Glue job with pushdown predicate.py		Spark code Glue job with pushdown predicate.py
Spark code Glue job.py		Spark code Glue job.py
lambda function.py		lambda function.py

Folders and files

Latest commit

History

Repository files navigation

AWS ETL Pipeline for YouTube Data Analytics

📋 Project Overview

Business Context

Dataset

🏗️ Architecture

Key Components:

📊 Data Flow Workflow

1. Data Ingestion Layer

2. Data Processing Layer

3. Data Transformation Layer

4. Data Analytics Layer

📁 Project Structure

🚀 Getting Started

Prerequisites

Required AWS Services Setup

Installation Steps

💾 Data Schema

YouTube Statistics Table

⚙️ Key Features

Lambda Function

Glue ETL Jobs

🔧 Optimization Techniques

📊 Use Cases

🔐 Security Best Practices

💰 Cost Optimization

📈 Monitoring & Logging

🛠️ Troubleshooting

Common Issues

🎯 Key Takeaways

📚 Additional Resources

🤝 Contributing

📝 License

👤 Author

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages