Pinterest Data Pipeline Project

Overview
Architecture
Features
Getting Started
- Prerequisites
- Installation
Usage
Future Improvements And Features
Queries
License

Overview

An end-to-end data engineering pipeline built on AWS, inspired by Pinterest's infrastructure. The pipeline implements a Lambda architecture to handle both batch and real-time stream processing of simulated user post data across three data entities — posts, geolocation, and user data.

Related Projects

This repo (AWS / managed services version): pinterest-data-pipeline684
Containerised open-source rebuild (in progress): pinterest-pipeline-docker — replaces AWS managed services with Docker, Kafka (KRaft), FastAPI, and self-hosted Airflow

Architecture

Video Walkthrough

A short walkthrough of the architecture and data flow coming soon.

Directory structure:

pinterest-data-pipeline684/
│
├── batch_notebooks/
│   ├── data_exploration_notebooks/
│   │   ├── data_exploration_geo.ipynb
│   │   ├── data_exploration_pin.ipynb
│   │   └── data_exploration_user.ipynb
│   ├── create_dataframes_from_s3.ipynb
│   ├── data_cleaning.ipynb
│   ├── mount_s3_to_databricks.ipynb
│   └── 0e95b18877fd_dag.py
│
├── stream_notebooks/
│   ├── data_cleaning_stream.ipynb
│   └── read_stream.ipynb
│
├── user_post_emulators/
│   ├── user_posting_emulation.py
│   └── user_posting_emulation_streaming.py
│
├── .github/
│   └── workflows/
│       └── lint.yml
│
├── .gitignore
├── Queries.ipynb
├── README.md
└── requirements.txt

Batch processing path

Data Generation with User Posting Emulator: The user posting emulator simulates user activities — posts, comments, likes — generating synthetic data across three entities: Pinterest posts, geolocation, and user data.
Data Ingestion via API Gateway: Data is sent to AWS API Gateway, which routes incoming requests into the AWS ecosystem.
Data Routing to EC2 Instance: Data passes to an EC2 instance running the Confluent Kafka REST proxy, which publishes messages to Kafka topics.
Streaming Data with Kafka and MSK: Data is published to Apache Kafka topics within Amazon MSK (Managed Streaming for Apache Kafka).
Confluent Connect and S3 Sink: A Kafka Connect S3 Sink connector moves data from Kafka topics into an S3 bucket for durable storage.
Kafka Consumers with MSK Connect: Kafka consumers process the streaming data and move it into Databricks for deeper analysis.
Data Processing in Databricks: Using Apache Spark, data undergoes cleaning, transformation, and aggregation through a bronze/silver/gold medallion architecture.
Workflow Orchestration with Apache Airflow: Apache Airflow (MWAA) orchestrates the batch pipeline, scheduling daily runs of the data cleaning and transformation notebooks.

Real-Time Data Streaming Path

Real-Time Data Generation: The streaming emulator generates real-time events and sends them directly to AWS Kinesis Data Streams.
Data Ingestion via AWS Kinesis: Three Kinesis streams (pin, geo, user) capture data in real-time.
Processing with Databricks and Spark Structured Streaming: Data from Kinesis is processed using Spark Structured Streaming, with results written to Delta Lake tables.
Workflow Orchestration with Apache Airflow: A nightly Airflow DAG merges the previous day's streaming Delta tables with the accurate batch silver/gold tables, combining real-time currency with historical accuracy.
Real-Time Analytics: Processed data is available for dashboards and analytics almost immediately via Databricks notebooks.

Features

Data Ingestion and Storage
- Utilizes AWS S3 for secure, scalable storage of batch data.
- Integrates AWS API Gateway and Kafka (via AWS MSK) for efficient real-time data streaming.
Data Processing
- Leverages Databricks for advanced data analytics, employing Apache Spark for both batch and stream processing.
- Implements Lambda architecture for handling vast datasets with a balance of speed and accuracy.
Workflow Orchestration
- Uses Apache Airflow (AWS MWAA) for orchestrating and automating the data pipeline workflows, ensuring timely execution of data processing tasks.
Real-Time Data Streaming
- Incorporates AWS Kinesis for real-time data collection and analysis, enabling immediate insights and responses.
Data Cleaning and Transformation
- Applies comprehensive data cleaning techniques in Databricks notebooks to ensure data quality and reliability for analysis.
- Employs custom Python scripts and Spark SQL for data transformation and preparation.
Analytics and Reporting
- Facilitates advanced data analytics within Databricks, using both PySpark and Spark SQL for deep dives into the data.
CI/CD
- GitHub Actions workflow running flake8 linting on every push, scoped to user_post_emulators/.
Security and Compliance
- Adheres to best practices in cloud security, leveraging AWS's robust security features to protect data integrity and privacy.
Scalability and Performance
- Designed for scalability, easily handling increases in data volume without compromising on processing time or resource efficiency.

Getting Started

Prerequisites

AWS Cloud Account: Sign up here
Databricks Workspace: Start a free trial
Python 3.x: Download Python
Required Python Libraries: Listed in requirements.txt
Git: Install Git
IDE/Code Editor: VSCode recommended

Additional Setup:

Configure AWS CLI: AWS CLI configuration guide
Set Up SSH Key for GitHub: GitHub SSH key setup guide

Installation

Clone the Repository

git clone git@github.com:ASEIcode/pinterest-data-pipeline684.git
cd pinterest-data-pipeline684

Set Up Your AWS Environment
- Create an S3 bucket: Instructions
- Upload Dummy Data to RDS: Instructions and dummy data
- Establish an MSK cluster and EC2 client machine: Instructions
  
  Create your three topics (0e9518877fd.pin, 0e9518877fd.geo, 0e9518877fd.user):

./kafka-topics.sh --bootstrap-server BootstrapServerString --command-config client.properties --create --topic <topic_name>

- **Create a custom Plugin with MSK Connect**: [Instructions](https://colab.research.google.com/drive/1zDDX7S1X2FxQF6Fnw6mmRmriEnTkYw9_?usp=sharing)

- **Set up API Gateway**:
  - [Set up REST API](https://colab.research.google.com/drive/1epCnS6ltyPtciuLG4vgWjte7pxXp-_vV?usp=sharing)
  - [Integrate with Kafka](https://colab.research.google.com/drive/1Zb_BaI8Nv-pL2mvr8yE1Y3d-sf4AP5lg?usp=sharing)

    Modify `user_posting_emulation.py`:

    1. Create `db_creds.yaml`:

HOST : <RDS HOST NAME>
USER : <YOUR USER NAME>
PASSWORD : <RDS PASSWORD>
DATABASE : <DATABASE NAME>
PORT : 3306

    2. Add `db_creds.yaml` to `.gitignore`
    3. Replace invoke URLs for each topic:

https://YourAPIInvokeURL/YourDeploymentStage/topics/YourTopicName

    4. Verify data is flowing by running a Kafka consumer per topic
    5. Confirm data appears in S3 under `topics/<TOPIC_NAME>/partition=0/`

- **Configure Kinesis Data Streams**:

    1. Create three streams (pin, geo, user) in the Kinesis console using On-demand capacity mode
    2. [Configure REST API to invoke Kinesis](https://colab.research.google.com/drive/1GnwFW22hNpslDmq6N73fXc7zjqhebXp-?usp=sharing)
    3. Modify `user_posting_emulation_streaming.py` — replace StreamName, Data fields, PartitionKey, and invoke URLs:

https://YourAPIInvokeURL/<YourDeploymentStage>/streams/<your_stream_name>/record

Configure Databricks Workspace
- Import notebooks
- Set up a cluster
- Mount your S3 bucket using mount_s3_to_databricks.ipynb — update the Delta table path to match your credentials location
Initialise Apache Airflow (MWAA)
- Set up MWAA environment
- Orchestrate a workload

Usage

Batch Processing

SSH into your EC2 instance:

ssh -i /path/to/your-key.pem ec2-user@your-ec2-public-dns.amazonaws.com

Start the Confluent REST API:

./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties

Run the User Posting Emulator:

python user_posting_emulation.py

This runs continuously until interrupted.

Data flows through API Gateway → EC2 → Kafka/MSK → S3.
Use data_cleaning.ipynb in Databricks to clean and transform the data.
Use the provided Airflow DAG to schedule daily runs of the cleaning notebook.

Stream Processing

Run the streaming emulator:

python user_posting_emulation_streaming.py

Data flows into the three Kinesis streams (pin, geo, user).
Run data_cleaning_stream.ipynb in Databricks to read, clean, and write each stream to a Delta table in real-time.
A nightly Airflow DAG merges the streaming Delta tables with the batch silver/gold tables.

Data Exploration

The data_exploration_notebooks/ folder inside batch_notebooks/ contains notebooks documenting the profiling and transformation decisions made for each data entity. Each notebook ends with the final cleaning code that feeds into data_cleaning.ipynb.

Queries

Queries.ipynb contains Spark SQL queries demonstrating insights available from the data:

Category popularity by country and by year
Geolocation insights — user distribution across regions
User engagement — follower counts by age group and join year
Temporal trends — user growth and activity patterns 2015–2020

Future Improvements And Features

Real-time dashboards — Grafana or Amazon QuickSight integrated with Kinesis streams
Advanced analytics — ML models for trend prediction and sentiment analysis
Data quality monitoring — Soda.io or Great Expectations with automated drift alerts
Interactive querying interface — ad-hoc query tool for non-technical users
Full containerised rebuild — see pinterest-pipeline-docker

License

This project is open sourced under the MIT License. Free to use, modify, and share for personal and commercial purposes with appropriate credit to the original author.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pinterest Data Pipeline Project

Table of Contents

Overview

Related Projects

Architecture

Video Walkthrough

Batch processing path

Real-Time Data Streaming Path

Features

Getting Started

Prerequisites

Installation

Usage

Batch Processing

Stream Processing

Data Exploration

Queries

Future Improvements And Features

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Pinterest Data Pipeline Project

Table of Contents

Overview

Related Projects

Architecture

Video Walkthrough

Batch processing path

Real-Time Data Streaming Path

Features

Getting Started

Prerequisites

Installation

Usage

Batch Processing

Stream Processing

Data Exploration

Queries

Future Improvements And Features

License