S.C.H.A.D. is a modular showcase project that tackles business and sales analytics challenges using open-source big data and streaming tools incorporated by Cloud Providers (AWS/Azure/GCP).
π‘ This repository acts as an overview and central index. Each technical component has its own GitHub repository.
π‘ Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.
The project demonstrates an implementation of a modern analytics pipeline using open-source (yet available in cloud providers) distributed and streaming systems such as:
- Apache Kafka for ingesting clickstream data
- Apache Spark for real-time and batch processing
- Hive for SQL analytics on large datasets
- Zeppelin for visualization and exploration
The goal: to simulate and solve real-world sales analytics problems without relying on a specific managed cloud services β giving portability across AWS, Azure, and GCP.
flowchart TD
subgraph DataGeneration
A[Clickstream Generator]
end
subgraph Producers
B1[Kafka Producer]
B2[Akka Producer]
end
subgraph Messaging
C[Kafka]
end
subgraph Processing
D1[Spark Streaming]
D2[Spark Batch\n + Hive]
end
subgraph Storage
E1[HDFS / Parquet]
E2[Hive Tables]
end
subgraph Visualization
F[Zeppelin Dashboard]
end
subgraph Orchestration
G1[Docker Compose]
G2[Ansible Scripts]
end
A --> B1 --> C
A --> B2 --> C
C --> D1 --> E1 --> F
C --> D2 --> E2 --> F
G1 --> B1
G1 --> B2
G1 --> C
G2 --> D1
G2 --> D2
This project is built with open-source technologies that map directly to cloud-native services across AWS, Azure, and GCP:
| S.C.H.A.D. Tool | AWS Equivalent | Azure Equivalent | GCP Equivalent |
|---|---|---|---|
| Kafka | Amazon MSK | Azure Event Hubs | Google Pub/Sub |
| Spark | AWS Glue / EMR | Azure Synapse / HDInsight | Dataproc / Dataflow |
| Hive | Athena / Glue Catalog | Synapse SQL Pools | BigQuery |
| Akka | Lambda / ECS | Azure Functions | Cloud Functions |
| Docker | ECS / EKS | AKS / ACI | GKE / Cloud Run |
| Zeppelin | SageMaker Studio | Synapse Notebooks | Colab / Notebooks AI |
graph TB
subgraph Ingestion
KAFKA[Kafka Open Source] --> AWSMSK[Amazon MSK]
KAFKA --> AZUREEVENT[Azure Event Hubs]
KAFKA --> GCPPUBSUB[Google Pub/Sub]
end
graph TB
subgraph Compute
SPARK[Spark Streaming + Batch] --> AWSGLUE[AWS Glue / EMR]
SPARK --> AZURESYNAPSE[Azure Synapse]
SPARK --> GCPDATAPROC[GCP Dataproc]
AKKA[Akka Producer] --> AWSLAMBDA[AWS Lambda]
AKKA --> AZUREFUNC[Azure Functions]
AKKA --> GCPCLOUDRUN[GCP Cloud Run]
end
graph TB
subgraph Storage & Query
HIVE[Hive Open Source] --> AWSATHENA[Athena / Glue Catalog]
HIVE --> AZURESQL[Synapse SQL]
HIVE --> BIGQUERY[BigQuery]
end
graph TB
subgraph Visualization
ZEPPELIN[Zeppelin Notebook] --> SAGEMAKER[Amazon SageMaker Studio]
ZEPPELIN --> AZURENOTE[Azure Synapse Notebook]
ZEPPELIN --> COLAB[GCP Colab]
end
| Component | Description | Repo |
|---|---|---|
| Clickstream Generator | Simulates user activity on a site | Repo |
| Kafka Producer | Pushes data into Kafka from simulated input | Repo |
| Akka Producer | Actor-based producer using Akka Streams | Repo |
| Spark Applications | Real-time + batch ETL and transformation logic | Repo |
| Hive SQL Layer | DDL and analytical SQL queries | Repo |
| Zeppelin Notebooks | Notebooks for analytics and visualization | Private |
| Data | Additional data for batch and lookup queries | Private |
| Orchestations | Docker , Ansible and Ambari Management | Private |
This project is deliberately modular and cloud-agnostic to demonstrate:
π¦ Tooling knowledge across open-source big data and streaming frameworks
βοΈ Cloud fluency via portable architecture mappings
π οΈ Configuration + Debugging skills with real-world integration issues
π‘ Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.