S.C.H.A.D. — Streaming | Clickstream | Hadoop | Analytics | Datacenter

S.C.H.A.D. is a modular showcase project that tackles business and sales analytics challenges using open-source big data and streaming tools incorporated by Cloud Providers (AWS/Azure/GCP).

💡 This repository acts as an overview and central index. Each technical component has its own GitHub repository.

💡 Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.

🚀 Project Summary

The project demonstrates an implementation of a modern analytics pipeline using open-source (yet available in cloud providers) distributed and streaming systems such as:

Apache Kafka for ingesting clickstream data
Apache Spark for real-time and batch processing
Hive for SQL analytics on large datasets
Zeppelin for visualization and exploration

The goal: to simulate and solve real-world sales analytics problems without relying on a specific managed cloud services — giving portability across AWS, Azure, and GCP.

🧩 System Architecture

flowchart TD
    subgraph DataGeneration
        A[Clickstream Generator]
    end

    subgraph Producers
        B1[Kafka Producer]
        B2[Akka Producer]
    end

    subgraph Messaging
        C[Kafka]
    end

    subgraph Processing
        D1[Spark Streaming]
        D2[Spark Batch\n + Hive]
    end

    subgraph Storage
        E1[HDFS / Parquet]
        E2[Hive Tables]
    end

    subgraph Visualization
        F[Zeppelin Dashboard]
    end

    subgraph Orchestration
        G1[Docker Compose]
        G2[Ansible Scripts]
    end

    A --> B1 --> C
    A --> B2 --> C
    C --> D1 --> E1 --> F
    C --> D2 --> E2 --> F
    G1 --> B1
    G1 --> B2
    G1 --> C
    G2 --> D1
    G2 --> D2

☁️ Cloud Mapping

This project is built with open-source technologies that map directly to cloud-native services across AWS, Azure, and GCP:

S.C.H.A.D. Tool	AWS Equivalent	Azure Equivalent	GCP Equivalent
Kafka	Amazon MSK	Azure Event Hubs	Google Pub/Sub
Spark	AWS Glue / EMR	Azure Synapse / HDInsight	Dataproc / Dataflow
Hive	Athena / Glue Catalog	Synapse SQL Pools	BigQuery
Akka	Lambda / ECS	Azure Functions	Cloud Functions
Docker	ECS / EKS	AKS / ACI	GKE / Cloud Run
Zeppelin	SageMaker Studio	Synapse Notebooks	Colab / Notebooks AI

graph TB
  subgraph Ingestion
    KAFKA[Kafka Open Source] --> AWSMSK[Amazon MSK]
    KAFKA --> AZUREEVENT[Azure Event Hubs]
    KAFKA --> GCPPUBSUB[Google Pub/Sub]
  end

graph TB
  subgraph Compute
    SPARK[Spark Streaming + Batch] --> AWSGLUE[AWS Glue / EMR]
    SPARK --> AZURESYNAPSE[Azure Synapse]
    SPARK --> GCPDATAPROC[GCP Dataproc]

    AKKA[Akka Producer] --> AWSLAMBDA[AWS Lambda]
    AKKA --> AZUREFUNC[Azure Functions]
    AKKA --> GCPCLOUDRUN[GCP Cloud Run]
  end

graph TB
  subgraph Storage & Query
    HIVE[Hive Open Source] --> AWSATHENA[Athena / Glue Catalog]
    HIVE --> AZURESQL[Synapse SQL]
    HIVE --> BIGQUERY[BigQuery]
  end

graph TB
  subgraph Visualization
    ZEPPELIN[Zeppelin Notebook] --> SAGEMAKER[Amazon SageMaker Studio]
    ZEPPELIN --> AZURENOTE[Azure Synapse Notebook]
    ZEPPELIN --> COLAB[GCP Colab]
  end

🔗 Component/Repository Breakdown

Component	Description	Repo
Clickstream Generator	Simulates user activity on a site	Repo
Kafka Producer	Pushes data into Kafka from simulated input	Repo
Akka Producer	Actor-based producer using Akka Streams	Repo
Spark Applications	Real-time + batch ETL and transformation logic	Repo
Hive SQL Layer	DDL and analytical SQL queries	Repo
Zeppelin Notebooks	Notebooks for analytics and visualization	Private
Data	Additional data for batch and lookup queries	Private
Orchestations	Docker , Ansible and Ambari Management	Private

🧠 Why S.C.H.A.D.?

This project is deliberately modular and cloud-agnostic to demonstrate:

📦 Tooling knowledge across open-source big data and streaming frameworks

☁️ Cloud fluency via portable architecture mappings

🛠️ Configuration + Debugging skills with real-world integration issues

💡 Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.

✉️ Queries

Connect with me!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S.C.H.A.D. — Streaming | Clickstream | Hadoop | Analytics | Datacenter

🚀 Project Summary

🧩 System Architecture

☁️ Cloud Mapping

🔗 Component/Repository Breakdown

🧠 Why S.C.H.A.D.?

✉️ Queries

About

Uh oh!

Releases

Packages

gggordon/schad-meta

Folders and files

Latest commit

History

Repository files navigation

S.C.H.A.D. — Streaming | Clickstream | Hadoop | Analytics | Datacenter

🚀 Project Summary

🧩 System Architecture

☁️ Cloud Mapping

🔗 Component/Repository Breakdown

🧠 Why S.C.H.A.D.?

✉️ Queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages