pinterest-data-pipeline692

Description

This project emulates a data stream from pinterest of pins, their geolocation data and the users posting them.

It then processes this data via Databricks notebooks both as a batch and as a stream.

Skills Covered

Learning achieved through this project includes:

Experience in AWS
- EC2 instances
- MSK
- S3 Buckets
- API setup
- Kinesis
- Plugins
- Connectors
Experience with Python:
- function definition
- class definitions
- class attributes
- SQLAlchemy module
Experience with SQL
Experience with git & GitHub:
- Making local changes
- Committing and pushing changes back to the cloud repository
README creation and editing

Installation Instructions

Download the project directory
Create the necessary infrastructure on AWS and update the scripts accordingly:
- EC2 Instance
- S3 Bucket
- MSK Connector
- Kinesis Streams
- APIs
Upload both notebooks to Databricks
Create and store the required .pem file in the lcoal project directory

Usage Instructions

Via the EC2 instance start the API
Run the emulation scripts
Run the processing notebooks

Project File Structure

Project Directory
- README.md
- user_posting_emulation.py
- user_posting_emulation_streaming.py
- 0a25072a5e0f_dag.py
- batch_processing_notebook.ipynb
- stream_processing_notebook.ipynb

License Information

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pinterest-data-pipeline692

Contents

Description

Skills Covered

Installation Instructions

Usage Instructions

Project File Structure

License Information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
0a25072a5e0f_dag.py		0a25072a5e0f_dag.py
README.md		README.md
batch_processing_notebook.ipynb		batch_processing_notebook.ipynb
databricks_notebook.ipynb		databricks_notebook.ipynb
stream_processing_notebook.ipynb		stream_processing_notebook.ipynb
user_posting_emulation.py		user_posting_emulation.py
user_posting_emulation_streaming.py		user_posting_emulation_streaming.py

Folders and files

Latest commit

History

Repository files navigation

pinterest-data-pipeline692

Contents

Description

Skills Covered

Installation Instructions

Usage Instructions

Project File Structure

License Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages