This project implements a complete serverless ETL (Extract, Transform, Load) pipeline on AWS to process and analyze YouTube trending video statistics data from multiple regions. The pipeline leverages AWS managed services to build a scalable, cost-effective data analytics solution.
Cloud-based Data Lake solutions help develop rich analytics while organizing data into storage phases (raw, cleansed, and analytical). This project aims to securely manage, streamline, and perform analysis on structured and semi-structured YouTube videos data based on video categories and trending metrics.
This Kaggle dataset contains statistics (CSV files) on daily popular YouTube videos over several months. There are up to 200 trending videos published every day across multiple regions. Each region has its own data file containing:
- Video title, channel title, publication time
- Tags, views, likes, dislikes
- Description and comment count
- Category ID (linked via JSON reference files)
The pipeline follows a medallion architecture with three data layers:
Raw Data (S3) β Lambda β Cleansed Layer (S3) β Glue ETL β Analytics Layer (S3) β Athena/QuickSight
- Amazon S3: Data lake storage (raw, cleansed, and analytics layers)
- AWS Lambda: Serverless data processing for JSON transformation
- AWS Glue: Serverless ETL jobs using Apache Spark
- AWS Glue Data Catalog: Centralized metadata repository
- Amazon Athena: SQL-based data querying
- Amazon QuickSight: Data visualization and dashboards
- Raw CSV and JSON data files are uploaded to S3 raw bucket
- Data is organized using Hive-style partitioning (
region=ca/,region=us/, etc.) - Supports multiple regions: CA, DE, FR, GB, IN, JP, KR, MX, RU, US
- Lambda Function: Triggered by S3 events when JSON files are uploaded
- Reads JSON data using AWS Data Wrangler
- Normalizes nested JSON structures using pandas
- Converts to Parquet format for optimized storage
- Registers data in Glue Data Catalog
- Glue ETL Jobs: PySpark-based transformations
- Reads data from Glue Data Catalog
- Applies schema mapping and data type conversions
- Resolves data quality issues (null fields, type conflicts)
- Partitions data by region
- Implements predicate pushdown for query optimization
- Transformed data stored in Parquet format
- Partitioned by region for efficient querying
- Queryable via Amazon Athena
- Visualized using Amazon QuickSight
.
βββ README.md
βββ Architecture.jpg # Pipeline architecture diagram
βββ lambda function.py # Lambda function for JSON processing
βββ Spark code Glue job.py # Basic Glue ETL job
βββ Spark code Glue job with pushdown predicate.py # Optimized Glue job
βββ Amazon S3 CLI copy commands.sh # S3 data upload commands
- AWS Account with appropriate IAM permissions
- AWS CLI configured with credentials
- Python 3.8+
- Basic understanding of AWS services (S3, Lambda, Glue, Athena)
-
S3 Buckets
- Raw bucket: s3://bigdata-on-youtube-raw-{region}-{account-id}-{env}/ - Cleansed bucket: s3://bigdata-on-youtube-cleansed-{region}-{account-id}-{env}/ - Analytics bucket: s3://bigdata-on-youtube-analytics-{region}-{account-id}-{env}/ -
IAM Roles
- Lambda execution role with S3 and Glue permissions
- Glue service role with S3 read/write permissions
-
Lambda Configuration
- Runtime: Python 3.9+
- Layer: AWS Data Wrangler
- Environment Variables:
s3_cleansed_layer: Target S3 pathglue_catalog_db_name: Glue database nameglue_catalog_table_name: Glue table namewrite_data_operation: Write mode (append/overwrite)
-
Upload Raw Data to S3
# Copy JSON reference data aws s3 cp . s3://your-raw-bucket/youtube/raw_statistics_reference_data/ \ --recursive --exclude "*" --include "*.json" # Copy CSV files with regional partitioning aws s3 cp CAvideos.csv s3://your-raw-bucket/youtube/raw_statistics/region=ca/ aws s3 cp USvideos.csv s3://your-raw-bucket/youtube/raw_statistics/region=us/ # ... (repeat for other regions)
-
Deploy Lambda Function
- Create Lambda function in AWS Console
- Copy code from
lambda function.py - Add AWS Data Wrangler layer
- Configure environment variables
- Set S3 trigger for JSON file uploads
-
Create Glue Crawler
- Point to S3 raw data location
- Configure to create database and tables
- Schedule or run on-demand
-
Deploy Glue ETL Job
- Create Glue job in AWS Console
- Copy code from
Spark code Glue job.pyor optimized version - Configure job parameters (DPU, timeout, etc.)
- Set S3 output path
-
Configure Athena
- Create workgroup and query results location
- Run queries against Glue Data Catalog tables
| Column | Type | Description |
|---|---|---|
| video_id | string | Unique video identifier |
| trending_date | string | Date video was trending |
| title | string | Video title |
| channel_title | string | Channel name |
| category_id | long | Video category ID |
| publish_time | string | Video publish timestamp |
| tags | string | Video tags |
| views | long | View count |
| likes | long | Like count |
| dislikes | long | Dislike count |
| comment_count | long | Comment count |
| thumbnail_link | string | Thumbnail URL |
| comments_disabled | boolean | Comments status |
| ratings_disabled | boolean | Ratings status |
| video_error_or_removed | boolean | Video availability |
| description | string | Video description |
| region | string | Country/region code |
- Event-driven processing: Automatically triggered on S3 uploads
- Format conversion: JSON to Parquet transformation
- Schema registration: Automatic Glue Catalog updates
- Error handling: Robust exception management
- Serverless processing: No infrastructure management
- Schema mapping: Automatic type conversion and validation
- Data quality: Null field removal and conflict resolution
- Partitioning: Regional data organization for optimized queries
- Predicate pushdown: Filter data at source for improved performance
- File optimization: Coalesce output to reduce small files
-
Pushdown Predicates (Spark code Glue job with pushdown predicate.py)
- Filters data at the source level
- Reduces data transfer and processing time
- Example:
predicate_pushdown = "region in ('ca','gb','us')"
-
Data Partitioning
- Partitioned by region for query efficiency
- Enables partition pruning in Athena
-
File Format
- Parquet columnar format for compression and fast queries
- Reduced storage costs by ~75% compared to CSV
-
File Consolidation
- Uses
.coalesce(1)to reduce small file issues - Improves query performance
- Uses
- Analyze trending YouTube videos across multiple regions
- Track video engagement metrics (views, likes, comments)
- Compare trending patterns between countries
- Identify popular categories and channels
- Time-series analysis of video trends
- Use IAM roles with least privilege principle
- Enable S3 bucket encryption at rest
- Enable S3 bucket versioning for data protection
- Use VPC endpoints for private communication
- Enable AWS CloudTrail for audit logging
- Implement S3 bucket policies and access controls
- Use S3 Intelligent-Tiering for automatic cost savings
- Leverage Glue job bookmarks to avoid reprocessing data
- Use Athena query result caching
- Implement S3 lifecycle policies to archive old data
- Monitor with AWS Cost Explorer and set budgets
- CloudWatch Logs: Lambda and Glue job logs
- CloudWatch Metrics: Execution duration, memory usage
- Glue Job Metrics: DPU utilization, data processed
- S3 Metrics: Request rates, data transfer
- AWS X-Ray: Distributed tracing for Lambda
- Lambda timeout: Increase timeout or optimize data processing
- Glue job failures: Check CloudWatch logs for errors
- S3 permissions: Verify IAM roles have correct policies
- Schema conflicts: Ensure consistent data types across sources
- Partition issues: Verify Hive-style partitioning format
- Understanding ETL on Big Data
- Building Data Lakes with staging layers (raw, cleansed, analytical)
- Creating IAM Roles and Policies for secure access
- Developing Lambda Functions for event-driven processing
- Setting up Glue Jobs for serverless ETL
- Using Glue Crawler and Glue Studio
- Creating and managing Glue Data Catalog
- Converting JSON to Parquet format for optimization
- Performing Data Transformations and Joins with PySpark
- Visualizing insights in QuickSight
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Built as part of an AWS ETL pipeline learning project.
- AWS Documentation and tutorials
- YouTube Trending Dataset
- Open-source community
Note: Remember to replace placeholder values (bucket names, account IDs, regions) with your actual AWS resources before deployment.
