Skip to content

niki2805/Realtime-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Streaming Data Pipeline

A data pipeline that can ingest streaming analytics data like user interactions using Kafka, Spark, Cassandra and Presto.

Requirements

  1. Docker
  2. Python 3.10

Init

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Run Kafka

Start Kafka and Zookeeper

docker compose up -d zookeeper broker

Run Cassandra

Start a 3-node Cassandra cluster :

docker-compose up -d cassandra-1 cassandra-2 cassandra-3

Check Cassandra cluster status:

docker-compose exec -it cassandra-1 bash -c 'nodetool status'

Execute 'cassandraCQLScript.cql' script to create keyspaces for different categories of events with replication factor 3 and tables for storing the user interaction events.

CASSANDRA_CTR=$(docker container ls | grep 'cassandra-1' | awk '{print $1}')
docker cp cassandraCQLScript.cql $CASSANDRA_CTR:/
docker exec -it $CASSANDRA_CTR cqlsh -f cassandraCQLScript.cql 

Query cassandra via client cqlsh:

docker-compose exec -it cassandra-1 cqlsh

OR directly execute any query:

docker exec -it $CASSANDRA_CTR cqlsh -e 'SELECT * FROM browse_keyspace.events_count'

Run Presto

Start Presto :

docker-compose up -d presto-coordinator presto-worker-1 presto-worker-2

Copy 'cassandra.properties' to Presto container:

PRESTO_CTR=$(docker container ls | grep 'presto-coordinator' | awk '{print $1}') 
docker cp ./presto-config/presto-cassandra-config/cassandra.properties $PRESTO_CTR:/opt/presto-server/etc/catalog/cassandra.properties

Confirm cassandra.properties was moved to Presto container:

docker exec -it $PRESTO_CTR sh -c "ls /opt/presto-server/etc/catalog"

Confirm Presto CLI can see Cassandra catalog:

  1. Start Presto CLI
docker exec -it $PRESTO_CTR presto-cli
  1. Run show command
show catalogs ;

If you do not see cassandra, then we need to restart Presto container:

docker restart $PRESTO_CTR

Repeat 1. and 2. and confirm if you can now see the cassandra catalog

Using Presto-CLI:

docker exec -it $PRESTO_CTR presto-cli

Within Presto CLI, run any query:

SELECT * FROM cassandra.watch_keyspace.events_count;

Run Spark Streaming

TOPIC=browse
./venv/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 src/spark.py $TOPIC

Run Data Generator

python src/data_generator.py

Cleanup

docker compose down

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages