Skip to content

Meta repository for Streaming + Batch Cloud Independent Sales Analytics

Notifications You must be signed in to change notification settings

gggordon/schad-meta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

S.C.H.A.D. β€” Streaming | Clickstream | Hadoop | Analytics | Datacenter

S.C.H.A.D. is a modular showcase project that tackles business and sales analytics challenges using open-source big data and streaming tools incorporated by Cloud Providers (AWS/Azure/GCP).

πŸ’‘ This repository acts as an overview and central index. Each technical component has its own GitHub repository.

AWS Azure GCP

πŸ’‘ Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.


πŸš€ Project Summary

The project demonstrates an implementation of a modern analytics pipeline using open-source (yet available in cloud providers) distributed and streaming systems such as:

  • Apache Kafka for ingesting clickstream data
  • Apache Spark for real-time and batch processing
  • Hive for SQL analytics on large datasets
  • Zeppelin for visualization and exploration

The goal: to simulate and solve real-world sales analytics problems without relying on a specific managed cloud services β€” giving portability across AWS, Azure, and GCP.


🧩 System Architecture

flowchart TD
    subgraph DataGeneration
        A[Clickstream Generator]
    end

    subgraph Producers
        B1[Kafka Producer]
        B2[Akka Producer]
    end

    subgraph Messaging
        C[Kafka]
    end

    subgraph Processing
        D1[Spark Streaming]
        D2[Spark Batch\n + Hive]
    end

    subgraph Storage
        E1[HDFS / Parquet]
        E2[Hive Tables]
    end

    subgraph Visualization
        F[Zeppelin Dashboard]
    end

    subgraph Orchestration
        G1[Docker Compose]
        G2[Ansible Scripts]
    end

    A --> B1 --> C
    A --> B2 --> C
    C --> D1 --> E1 --> F
    C --> D2 --> E2 --> F
    G1 --> B1
    G1 --> B2
    G1 --> C
    G2 --> D1
    G2 --> D2

Loading

☁️ Cloud Mapping

This project is built with open-source technologies that map directly to cloud-native services across AWS, Azure, and GCP:

S.C.H.A.D. Tool AWS Equivalent Azure Equivalent GCP Equivalent
Kafka Amazon MSK Azure Event Hubs Google Pub/Sub
Spark AWS Glue / EMR Azure Synapse / HDInsight Dataproc / Dataflow
Hive Athena / Glue Catalog Synapse SQL Pools BigQuery
Akka Lambda / ECS Azure Functions Cloud Functions
Docker ECS / EKS AKS / ACI GKE / Cloud Run
Zeppelin SageMaker Studio Synapse Notebooks Colab / Notebooks AI
graph TB
  subgraph Ingestion
    KAFKA[Kafka Open Source] --> AWSMSK[Amazon MSK]
    KAFKA --> AZUREEVENT[Azure Event Hubs]
    KAFKA --> GCPPUBSUB[Google Pub/Sub]
  end
Loading
graph TB
  subgraph Compute
    SPARK[Spark Streaming + Batch] --> AWSGLUE[AWS Glue / EMR]
    SPARK --> AZURESYNAPSE[Azure Synapse]
    SPARK --> GCPDATAPROC[GCP Dataproc]

    AKKA[Akka Producer] --> AWSLAMBDA[AWS Lambda]
    AKKA --> AZUREFUNC[Azure Functions]
    AKKA --> GCPCLOUDRUN[GCP Cloud Run]
  end
Loading
graph TB
  subgraph Storage & Query
    HIVE[Hive Open Source] --> AWSATHENA[Athena / Glue Catalog]
    HIVE --> AZURESQL[Synapse SQL]
    HIVE --> BIGQUERY[BigQuery]
  end
Loading
graph TB
  subgraph Visualization
    ZEPPELIN[Zeppelin Notebook] --> SAGEMAKER[Amazon SageMaker Studio]
    ZEPPELIN --> AZURENOTE[Azure Synapse Notebook]
    ZEPPELIN --> COLAB[GCP Colab]
  end

Loading

πŸ”— Component/Repository Breakdown

Component Description Repo
Clickstream Generator Simulates user activity on a site Repo
Kafka Producer Pushes data into Kafka from simulated input Repo
Akka Producer Actor-based producer using Akka Streams Repo
Spark Applications Real-time + batch ETL and transformation logic Repo
Hive SQL Layer DDL and analytical SQL queries Repo
Zeppelin Notebooks Notebooks for analytics and visualization Private
Data Additional data for batch and lookup queries Private
Orchestations Docker , Ansible and Ambari Management Private

🧠 Why S.C.H.A.D.?

This project is deliberately modular and cloud-agnostic to demonstrate:

πŸ“¦ Tooling knowledge across open-source big data and streaming frameworks

☁️ Cloud fluency via portable architecture mappings

πŸ› οΈ Configuration + Debugging skills with real-world integration issues

πŸ’‘ Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.

βœ‰οΈ Queries

Connect with me!

About

Meta repository for Streaming + Batch Cloud Independent Sales Analytics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published