A fully automated, observable, real-time data streaming platform built on a local Kubernetes cluster.
Built with these tools and technologies:
- Overview
- Features
- Architecture
- Project Structure
- Getting Started
- Workflow Commands
- Contributing
- License
This project is a comprehensive Proof-of-Concept (POC) for building an end-to-end, real-time data streaming and observability platform. It demonstrates modern DevOps and Data Engineering practices using Infrastructure as Code (IaC) and automation.
The pipeline ingests a live stream of page-move events from Wikimedia, publishes them into a Kafka cluster running on Kubernetes, and provides a Grafana dashboard to visualize the incoming data rate and total message count in real-time. The entire stack, from the local Kubernetes cluster to the application and dashboards, is managed by a simple, automated command-line interface using just.
- Automated Infrastructure: The entire Kubernetes cluster (K3d) is provisioned and managed as code using Terraform and Terragrunt.
- Real-Time Ingestion: A resilient Python application connects to the live Wikimedia EventStream and publishes data to Kafka.
- Managed Data Bus: A robust, ZooKeeper-based Kafka cluster and Schema Registry are deployed on Kubernetes, managed by the Strimzi operator and Helm.
- Full Observability: The stack includes a complete monitoring suite with Prometheus for metrics collection, a Kafka Exporter, and Grafana for visualization.
- Declarative Dashboards: The Grafana dashboard is also defined as code in a Kubernetes
ConfigMapand deployed automatically. - One-Command Workflow: A
Justfileprovides a simple, powerful interface (just all,just clean,just connect, etc.) to manage the entire lifecycle of the platform.
graph TD
subgraph Internet
Wikimedia[Wikimedia EventStream]
end
subgraph "Kubernetes Cluster (k3d)"
subgraph "kafka Namespace"
Consumer(Python Consumer App)
Kafka_Cluster(Kafka Cluster)
Schema_Registry(Schema Registry)
Kafka_Exporter(Kafka Exporter)
end
subgraph "monitoring Namespace"
Prometheus(Prometheus)
Grafana(Grafana)
end
end
Wikimedia -- HTTP SSE --> Consumer
Consumer -- Produces Events --> Kafka_Cluster
Kafka_Exporter -- Scrapes Metrics --> Kafka_Cluster
Prometheus -- Scrapes Metrics --> Kafka_Exporter
Grafana -- Queries Metrics --> Prometheus
.
├── app/
│ ├── consumer.py
│ ├── Dockerfile
│ ├── deployment.yaml
│ └── requirements.txt
├── environment/
│ └── dev/
│ └── terragrunt.hcl
├── helm/
│ ├── grafana-kafka-dashboard.yaml
│ ├── kafka-cluster.yaml
│ ├── kafka-exporter-deployment.yaml
│ └── prometheus-values.yaml
├── justfile
├── modules/
│ └── k3d-cluster/
│ └── main.tf
└── variables.tf
└── versions.tf
└── root.hclThis stack requires a machine with at least 16GB of RAM. Before you begin, ensure you have the following tools installed in your development environment (e.g., Ubuntu, CentOS).
- Docker: To run the containers. Installation Guide.
- k3d: To run a local Kubernetes cluster.
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash - kubectl: To interact with the Kubernetes cluster.
# Note: Use a version that matches your k3d server version for best results. curl -LO https://dl.k8s.io/release/v1.31.5/bin/linux/amd64/kubectl chmod +x ./kubectl sudo mv ./kubectl /usr/local/bin/kubectl - Helm: The package manager for Kubernetes.
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash - Terraform & Terragrunt: To manage infrastructure as code.
# Install tfenv (for Terraform) & tgenv (for Terragrunt) git clone https://github.com/tfutils/tfenv.git ~/.tfenv echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.bashrc git clone https://github.com/cunymatthieu/tgenv.git ~/.tgenv echo 'export PATH="$HOME/.tgenv/bin:$PATH"' >> ~/.bashrc source ~/.bashrc # Install the latest versions tfenv install latest && tfenv use latest tgenv install latest && tgenv use latest
- just: A handy command runner to automate our workflows.
sudo apt update && sudo apt install just
The entire workflow is managed by the justfile.
-
Clone the repository:
git clone https://github.com/rdrishabh38/wikimedia-analytics.git cd wikimedia-poc -
Start the Entire Stack: This single command builds everything: the Kubernetes cluster, all infrastructure services (Kafka, Monitoring), and the Python application.
just all
Disclaimer: This command starts many services at once. On machines with 16GB or less of RAM, this may cause resource contention. If you experience issues, use the more stable Step-by-Step method below.
-
Step-by-Step Startup (Recommended): Please refer to section Step-by-Step Manual Startup Guide for a step by step startup
Use the just command from the project root to manage your environment. Run just or just help to see all available commands.
just all: Build and deploy the entire stack (infra + app).just clean: Destroy the entire stack completely.just infra-up: Deploy only the infrastructure (K8s, Kafka, Monitoring).just app-deploy: Deploy only the Python application.just status: Check the status of all pods.just connect: Start port-forwarding to Grafana & Prometheus and print login details.just disconnect: Stop all background port-forwarding.just pause: Quickly stop the K3d cluster to save resources.just resume: Quickly resume the paused cluster.just app-logs: View the live logs from the consumer application.
This guide provides the commands to bring up each component of the data platform one by one. Each step includes a verification command to ensure the component is healthy before proceeding to the next.
This method is recommended for debugging or for users on resource-constrained systems.
This command uses Terragrunt to provision the 3-node K3d cluster.
Command:
just clusterVerification:
Wait for the command to complete, then run the following. You must see all three nodes with a STATUS of Ready.
kubectl get nodes -AThis command teaches our Kubernetes cluster about Strimzi resource types like Kafka.
Command:
just crdsVerification:
Run the following command. You should see a list of resources ending in strimzi.io.
kubectl get crds | grep strimzi.ioThis command adds the bitnami and prometheus-community chart repositories to Helm.
Command:
just helm-reposVerification: Run the following command. You should see all three repositories listed.
helm repo listThis command uses Helm to install the Strimzi operator, which will manage our Kafka cluster.
Command:
just operatorVerification:
Watch the pods in the kafka namespace. Wait until the strimzi-cluster-operator-... pod is 1/1 Running.
kubectl get pods -n kafka -w(Press Ctrl+C to stop watching once it's ready.)
This command applies our Kafka custom resource. The operator will see this and build the cluster.
Command:
just kafkaVerification:
Watch the pods again. Wait until my-cluster-zookeeper-0, my-cluster-kafka-0, and my-cluster-entity-operator-... are all Running. This can take a few minutes.
kubectl get pods -n kafka -wThis command installs the Schema Registry, which connects to our Kafka cluster.
Command:
just schema-registryVerification:
Watch the pods. Wait until schema-registry-0 and schema-registry-kafka-controller-0 are Running.
kubectl get pods -n kafka -wThis command installs the core monitoring services into their own monitoring namespace.
Command:
just monitoring-upVerification:
Watch the pods in the monitoring namespace. Wait until all pods (Prometheus, Grafana, Alertmanager, etc.) are Running. This can take several minutes.
kubectl get pods -n monitoring -wThis command deploys a pre-created dashboard Kafka Incoming Data Metrics to Grafana instance.
Command:
just monitoring-dashboardsVerification:
Watch the kafka namespace. Wait until the kafka-exporter-... pod is 1/1 Running.
kubectl get pods -n kafka -wVerification:
At this point, you can use just connect to access the Grafana UI and verify that the "Wikimedia Incoming Metrics" dashboard is present.
Finally, this command builds and deploys our Python application.
Command:
just app-deployVerification:
Watch the kafka namespace. Wait until the wikimedia-consumer-... pod is 1/1 Running.
kubectl get pods -n kafka -wOnce it's running, you can view the live logs with just app-logs.
This command forwards the Grafana port so that we can access Grafana UI in the browser.
Command:
just connectVerification: Watch the username and password as per the command output
✅ Grafana is now available at: http://localhost:8080
Username: admin
Password: grafana_password
✅ Prometheus is now available at: http://localhost:9090
Run 'just disconnect' to stop port forwarding.Once it's running, you can view the Grafana dashboard in the browser.
To validate if all the containers are running properly simply run :
Command:
kubectl get pods -AThis should display below output :
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kafka kafka-exporter-8f7c9b44-xp2kr 1/1 Running 0 13m
kafka my-cluster-entity-operator-5796968999-ph7np 1/1 Running 0 16m
kafka my-cluster-kafka-0 1/1 Running 0 17m
kafka my-cluster-zookeeper-0 1/1 Running 0 18m
kafka schema-registry-0 1/1 Running 0 15m
kafka schema-registry-kafka-controller-0 1/1 Running 0 13m
kafka strimzi-cluster-operator-7d5c858dd8-db7d7 1/1 Running 0 19m
kafka wikimedia-consumer-7798899589-hbqwq 1/1 Running 0 8m26s
kube-system coredns-ccb96694c-twgs4 1/1 Running 0 20m
kube-system helm-install-traefik-4v4qs 0/1 Completed 2 20m
kube-system helm-install-traefik-crd-n9rd2 0/1 Completed 0 20m
kube-system local-path-provisioner-5cf85fd84d-pmbgr 1/1 Running 0 20m
kube-system metrics-server-5985cbc9d7-4ffzc 1/1 Running 0 20m
kube-system svclb-traefik-80f49bfb-9g599 2/2 Running 0 20m
kube-system svclb-traefik-80f49bfb-m4zf5 2/2 Running 0 20m
kube-system svclb-traefik-80f49bfb-rcf4h 2/2 Running 0 20m
kube-system traefik-5d45fc8cc9-gcrm9 1/1 Running 0 20m
monitoring alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 13m
monitoring prometheus-grafana-57f4f984bd-knkzk 3/3 Running 0 13m
monitoring prometheus-kube-prometheus-operator-744b886d69-mslms 1/1 Running 0 13m
monitoring prometheus-kube-state-metrics-5f676f8f8b-xtmsp 1/1 Running 0 13m
monitoring prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 13m
monitoring prometheus-prometheus-node-exporter-dpb7v 1/1 Running 0 13m
monitoring prometheus-prometheus-node-exporter-v2h8v 1/1 Running 0 13m
monitoring prometheus-prometheus-node-exporter-zmrpm 1/1 Running 0 13m
If this is the output you get, then that means that the project is working correctly and all the containers are up and running.
If you see any container stopped or RESTARTS count is gradually increasing, check the container logs for a resolution.
To clean up all the resources simply run :
Command:
just cleanThe command will destroy all the resources created so far.
Contributions are welcome! Please feel free to submit a pull request for any improvements or bug fixes. For major changes, please open an issue first to discuss what you would like to change.
- Fork the Project.
- Create your Feature Branch (
git checkout -b feature/AmazingFeature). - Commit your Changes (
git commit -m 'Add some AmazingFeature'). - Push to the Branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
This project is distributed under the GNU General Public License v3.0. See LICENSE for more information.

