Skip to content

upper-stack/aws-youtube-etl-pipeline

Repository files navigation

AWS ETL Pipeline for YouTube Data Analytics

AWS Python Apache Spark

πŸ“‹ Project Overview

This project implements a complete serverless ETL (Extract, Transform, Load) pipeline on AWS to process and analyze YouTube trending video statistics data from multiple regions. The pipeline leverages AWS managed services to build a scalable, cost-effective data analytics solution.

Business Context

Cloud-based Data Lake solutions help develop rich analytics while organizing data into storage phases (raw, cleansed, and analytical). This project aims to securely manage, streamline, and perform analysis on structured and semi-structured YouTube videos data based on video categories and trending metrics.

Dataset

This Kaggle dataset contains statistics (CSV files) on daily popular YouTube videos over several months. There are up to 200 trending videos published every day across multiple regions. Each region has its own data file containing:

  • Video title, channel title, publication time
  • Tags, views, likes, dislikes
  • Description and comment count
  • Category ID (linked via JSON reference files)

πŸ—οΈ Architecture

The pipeline follows a medallion architecture with three data layers:

AWS ETL Pipeline Architecture

Raw Data (S3) β†’ Lambda β†’ Cleansed Layer (S3) β†’ Glue ETL β†’ Analytics Layer (S3) β†’ Athena/QuickSight

Key Components:

  • Amazon S3: Data lake storage (raw, cleansed, and analytics layers)
  • AWS Lambda: Serverless data processing for JSON transformation
  • AWS Glue: Serverless ETL jobs using Apache Spark
  • AWS Glue Data Catalog: Centralized metadata repository
  • Amazon Athena: SQL-based data querying
  • Amazon QuickSight: Data visualization and dashboards

πŸ“Š Data Flow Workflow

1. Data Ingestion Layer

  • Raw CSV and JSON data files are uploaded to S3 raw bucket
  • Data is organized using Hive-style partitioning (region=ca/, region=us/, etc.)
  • Supports multiple regions: CA, DE, FR, GB, IN, JP, KR, MX, RU, US

2. Data Processing Layer

  • Lambda Function: Triggered by S3 events when JSON files are uploaded
    • Reads JSON data using AWS Data Wrangler
    • Normalizes nested JSON structures using pandas
    • Converts to Parquet format for optimized storage
    • Registers data in Glue Data Catalog

3. Data Transformation Layer

  • Glue ETL Jobs: PySpark-based transformations
    • Reads data from Glue Data Catalog
    • Applies schema mapping and data type conversions
    • Resolves data quality issues (null fields, type conflicts)
    • Partitions data by region
    • Implements predicate pushdown for query optimization

4. Data Analytics Layer

  • Transformed data stored in Parquet format
  • Partitioned by region for efficient querying
  • Queryable via Amazon Athena
  • Visualized using Amazon QuickSight

πŸ“ Project Structure

.
β”œβ”€β”€ README.md
β”œβ”€β”€ Architecture.jpg                     # Pipeline architecture diagram
β”œβ”€β”€ lambda function.py                   # Lambda function for JSON processing
β”œβ”€β”€ Spark code Glue job.py               # Basic Glue ETL job
β”œβ”€β”€ Spark code Glue job with pushdown predicate.py  # Optimized Glue job
└── Amazon S3 CLI copy commands.sh       # S3 data upload commands

πŸš€ Getting Started

Prerequisites

  • AWS Account with appropriate IAM permissions
  • AWS CLI configured with credentials
  • Python 3.8+
  • Basic understanding of AWS services (S3, Lambda, Glue, Athena)

Required AWS Services Setup

  1. S3 Buckets

    - Raw bucket: s3://bigdata-on-youtube-raw-{region}-{account-id}-{env}/
    - Cleansed bucket: s3://bigdata-on-youtube-cleansed-{region}-{account-id}-{env}/
    - Analytics bucket: s3://bigdata-on-youtube-analytics-{region}-{account-id}-{env}/
    
  2. IAM Roles

    • Lambda execution role with S3 and Glue permissions
    • Glue service role with S3 read/write permissions
  3. Lambda Configuration

    • Runtime: Python 3.9+
    • Layer: AWS Data Wrangler
    • Environment Variables:
      • s3_cleansed_layer: Target S3 path
      • glue_catalog_db_name: Glue database name
      • glue_catalog_table_name: Glue table name
      • write_data_operation: Write mode (append/overwrite)

Installation Steps

  1. Upload Raw Data to S3

    # Copy JSON reference data
    aws s3 cp . s3://your-raw-bucket/youtube/raw_statistics_reference_data/ \
      --recursive --exclude "*" --include "*.json"
    
    # Copy CSV files with regional partitioning
    aws s3 cp CAvideos.csv s3://your-raw-bucket/youtube/raw_statistics/region=ca/
    aws s3 cp USvideos.csv s3://your-raw-bucket/youtube/raw_statistics/region=us/
    # ... (repeat for other regions)
  2. Deploy Lambda Function

    • Create Lambda function in AWS Console
    • Copy code from lambda function.py
    • Add AWS Data Wrangler layer
    • Configure environment variables
    • Set S3 trigger for JSON file uploads
  3. Create Glue Crawler

    • Point to S3 raw data location
    • Configure to create database and tables
    • Schedule or run on-demand
  4. Deploy Glue ETL Job

    • Create Glue job in AWS Console
    • Copy code from Spark code Glue job.py or optimized version
    • Configure job parameters (DPU, timeout, etc.)
    • Set S3 output path
  5. Configure Athena

    • Create workgroup and query results location
    • Run queries against Glue Data Catalog tables

πŸ’Ύ Data Schema

YouTube Statistics Table

Column Type Description
video_id string Unique video identifier
trending_date string Date video was trending
title string Video title
channel_title string Channel name
category_id long Video category ID
publish_time string Video publish timestamp
tags string Video tags
views long View count
likes long Like count
dislikes long Dislike count
comment_count long Comment count
thumbnail_link string Thumbnail URL
comments_disabled boolean Comments status
ratings_disabled boolean Ratings status
video_error_or_removed boolean Video availability
description string Video description
region string Country/region code

βš™οΈ Key Features

Lambda Function

  • Event-driven processing: Automatically triggered on S3 uploads
  • Format conversion: JSON to Parquet transformation
  • Schema registration: Automatic Glue Catalog updates
  • Error handling: Robust exception management

Glue ETL Jobs

  • Serverless processing: No infrastructure management
  • Schema mapping: Automatic type conversion and validation
  • Data quality: Null field removal and conflict resolution
  • Partitioning: Regional data organization for optimized queries
  • Predicate pushdown: Filter data at source for improved performance
  • File optimization: Coalesce output to reduce small files

πŸ”§ Optimization Techniques

  1. Pushdown Predicates (Spark code Glue job with pushdown predicate.py)

    • Filters data at the source level
    • Reduces data transfer and processing time
    • Example: predicate_pushdown = "region in ('ca','gb','us')"
  2. Data Partitioning

    • Partitioned by region for query efficiency
    • Enables partition pruning in Athena
  3. File Format

    • Parquet columnar format for compression and fast queries
    • Reduced storage costs by ~75% compared to CSV
  4. File Consolidation

    • Uses .coalesce(1) to reduce small file issues
    • Improves query performance

πŸ“Š Use Cases

  • Analyze trending YouTube videos across multiple regions
  • Track video engagement metrics (views, likes, comments)
  • Compare trending patterns between countries
  • Identify popular categories and channels
  • Time-series analysis of video trends

πŸ” Security Best Practices

  • Use IAM roles with least privilege principle
  • Enable S3 bucket encryption at rest
  • Enable S3 bucket versioning for data protection
  • Use VPC endpoints for private communication
  • Enable AWS CloudTrail for audit logging
  • Implement S3 bucket policies and access controls

πŸ’° Cost Optimization

  • Use S3 Intelligent-Tiering for automatic cost savings
  • Leverage Glue job bookmarks to avoid reprocessing data
  • Use Athena query result caching
  • Implement S3 lifecycle policies to archive old data
  • Monitor with AWS Cost Explorer and set budgets

πŸ“ˆ Monitoring & Logging

  • CloudWatch Logs: Lambda and Glue job logs
  • CloudWatch Metrics: Execution duration, memory usage
  • Glue Job Metrics: DPU utilization, data processed
  • S3 Metrics: Request rates, data transfer
  • AWS X-Ray: Distributed tracing for Lambda

πŸ› οΈ Troubleshooting

Common Issues

  1. Lambda timeout: Increase timeout or optimize data processing
  2. Glue job failures: Check CloudWatch logs for errors
  3. S3 permissions: Verify IAM roles have correct policies
  4. Schema conflicts: Ensure consistent data types across sources
  5. Partition issues: Verify Hive-style partitioning format

🎯 Key Takeaways

  • Understanding ETL on Big Data
  • Building Data Lakes with staging layers (raw, cleansed, analytical)
  • Creating IAM Roles and Policies for secure access
  • Developing Lambda Functions for event-driven processing
  • Setting up Glue Jobs for serverless ETL
  • Using Glue Crawler and Glue Studio
  • Creating and managing Glue Data Catalog
  • Converting JSON to Parquet format for optimization
  • Performing Data Transformations and Joins with PySpark
  • Visualizing insights in QuickSight

πŸ“š Additional Resources

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘€ Author

Built as part of an AWS ETL pipeline learning project.

πŸ™ Acknowledgments

  • AWS Documentation and tutorials
  • YouTube Trending Dataset
  • Open-source community

Note: Remember to replace placeholder values (bucket names, account IDs, regions) with your actual AWS resources before deployment.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors