A data pipeline that can ingest streaming analytics data like user interactions using Kafka, Spark, Cassandra and Presto.
- Docker
- Python 3.10
virtualenv venv
source venv/bin/activate
pip install -r requirements.txtStart Kafka and Zookeeper
docker compose up -d zookeeper brokerStart a 3-node Cassandra cluster :
docker-compose up -d cassandra-1 cassandra-2 cassandra-3Check Cassandra cluster status:
docker-compose exec -it cassandra-1 bash -c 'nodetool status'Execute 'cassandraCQLScript.cql' script to create keyspaces for different categories of events with replication factor 3 and tables for storing the user interaction events.
CASSANDRA_CTR=$(docker container ls | grep 'cassandra-1' | awk '{print $1}')
docker cp cassandraCQLScript.cql $CASSANDRA_CTR:/
docker exec -it $CASSANDRA_CTR cqlsh -f cassandraCQLScript.cql Query cassandra via client cqlsh:
docker-compose exec -it cassandra-1 cqlshOR directly execute any query:
docker exec -it $CASSANDRA_CTR cqlsh -e 'SELECT * FROM browse_keyspace.events_count'Start Presto :
docker-compose up -d presto-coordinator presto-worker-1 presto-worker-2Copy 'cassandra.properties' to Presto container:
PRESTO_CTR=$(docker container ls | grep 'presto-coordinator' | awk '{print $1}')
docker cp ./presto-config/presto-cassandra-config/cassandra.properties $PRESTO_CTR:/opt/presto-server/etc/catalog/cassandra.propertiesConfirm cassandra.properties was moved to Presto container:
docker exec -it $PRESTO_CTR sh -c "ls /opt/presto-server/etc/catalog"Confirm Presto CLI can see Cassandra catalog:
- Start Presto CLI
docker exec -it $PRESTO_CTR presto-cli- Run show command
show catalogs ;If you do not see cassandra, then we need to restart Presto container:
docker restart $PRESTO_CTRRepeat 1. and 2. and confirm if you can now see the cassandra catalog
Using Presto-CLI:
docker exec -it $PRESTO_CTR presto-cliWithin Presto CLI, run any query:
SELECT * FROM cassandra.watch_keyspace.events_count;TOPIC=browse
./venv/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 src/spark.py $TOPICpython src/data_generator.pydocker compose down