Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ A comprehensive demo and learning observability stack that provides metrics coll
- **🔄 Data Pipeline**: OpenTelemetry Collector for data processing
- **🖥️ System Monitoring**: Node Exporter for host metrics
- **🛠️ Easy Management**: Convenient shell script for operations
- **☸️ Kubernetes Ready**: Kustomize manifests for deploying the full stack + sample app (Kind or any cluster)

## 📋 Stack Components

Expand Down Expand Up @@ -112,6 +113,32 @@ Endpoints: `/`, `/work`, `/error` (http://localhost:8000)

These generate traces (Jaeger), metrics (Prometheus/Grafana), and logs (Kibana) independently of the core compose file.

### Kubernetes Deployment (Alternative Environment)

You can also deploy the same observability toolkit to a Kubernetes cluster (tested with Kind) with namespace separation and auto‑provisioned Grafana dashboards.

Quick Kind demo:
```bash
cd kubernetes/kind
./setup.sh # creates kind cluster + applies kustomize
```

Generic cluster:
```bash
cd kubernetes
./deploy.sh --wait
```

Then port-forward (example):
```bash
kubectl -n observability port-forward svc/grafana 3000:3000 &
kubectl -n observability port-forward svc/prometheus 9090:9090 &
kubectl -n observability port-forward svc/jaeger-query 16686:16686 &
kubectl -n observability port-forward svc/kibana 5601:5601 &
```

Kubernetes docs, build modes (external vs in‑cluster Kaniko), and dashboard provisioning details live in `kubernetes/README.md`.

### Kafka-Based Log Pipeline (Default)

Kafka is enabled by default to demonstrate a decoupled log ingestion flow:
Expand Down Expand Up @@ -436,9 +463,11 @@ This project is actively maintained. We aim to:
- Add new observability tools as they become stable
- Improve documentation and examples
- Enhance security and production readiness
- Evolve Kubernetes deployment (Ingress, persistence, security hardening, optional operator-based stack)

When adding new components or configurations:
1. Update this README
2. Test with the management script
3. Ensure proper service discovery configuration
4. Add appropriate alerting rules
5. If applicable, mirror changes in `kubernetes/` manifests & docs
141 changes: 141 additions & 0 deletions kubernetes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Kubernetes Deployment

This directory contains Kubernetes manifests and Kustomize overlays to deploy the observability toolkit and the example `o11y-python` application onto a Kubernetes cluster (tested with Kind). It mirrors the docker-compose experience while adding namespace separation, optional in‑cluster image build, and auto‑provisioned Grafana dashboards.

## Layout

- `kustomization.yaml` – Root Kustomize entrypoint (deploys observability + app namespaces and aggregates resources).
- `observability/` – Prometheus, Grafana, Jaeger (all-in-one), OpenTelemetry Collector, Alertmanager, Elasticsearch, Kibana, Node Exporter, Kafka (optional), Kafka JMX exporter, internal registry (experimental).
- `observability/grafana/` – Deployment plus ConfigMaps for datasources and dashboards provisioning.
- `applications/o11y-python/` – Example Python service exporting OTLP traces/metrics/logs to the collector.
- `applications/o11y-python/build-job.yaml` – Kaniko Job template (generateName) for optional in‑cluster build.
- `applications/o11y-python/build.sh` – Helper to launch the build Job.
- `kind/` – Local Kind cluster definition & helper script (cluster with extraNodePorts & containerd patches attempt for internal registry).
- `deploy.sh` – Convenience script to apply everything with optional waits.

## Current Status

All core components (Prometheus, Grafana, Alertmanager, OpenTelemetry Collector, Jaeger, Elasticsearch, Kibana, Node Exporter) deploy and run successfully under the `observability` namespace. The sample app runs in its own `o11y-python` namespace.

Grafana dashboards are auto‑provisioned (see below). Kibana memory was tuned to avoid JavaScript heap OOM (increased limits + `NODE_OPTIONS`).

An internal registry + Kaniko build path exists but containerd inside Kind may not resolve the cluster-internal registry DNS without extra configuration; a reliable fallback (pre‑loading the image into Kind) is documented. See Build section & troubleshooting.

## Quick Start (Kind)

```bash
cd kubernetes/kind
./setup.sh
```

After pods are ready, either access via NodePort (if enabled) or port-forward:
```bash
kubectl -n observability port-forward svc/grafana 3000:3000 &
kubectl -n observability port-forward svc/prometheus 9090:9090 &
kubectl -n observability port-forward svc/jaeger-query 16686:16686 &
kubectl -n observability port-forward svc/kibana 5601:5601 &
```

NodePorts (example – adjust if you changed manifests):
- Grafana: 30000
- Prometheus: 30900
- Jaeger UI: 31686

You can instead expose via an Ingress (not yet provided – see Roadmap).

## Generic Cluster

```bash
cd kubernetes
./deploy.sh --wait
```
Then port-forward or expose via Ingress (not provided by default).

## Building the Example App Image

Two options:

1. External build (recommended & current default):
```bash
docker build -t o11y-python:latest ../../o11y-playground/o11y-python
kind load docker-image o11y-python:latest --name observability
```
Deployment `applications/o11y-python/deployment.yaml` should have `image: o11y-python:latest` for this path (default in repo).

2. In-cluster build (Kaniko + internal registry, experimental):
- Registry Service: `registry.observability.svc:5000` (ClusterIP) & DNS FQDN `registry.observability.svc.cluster.local`.
- Launch build using `build.sh` (creates a `kaniko` Job with a unique name via `generateName`).
- The Job contains an inline Dockerfile with pinned dependencies (no context mount needed).
- Limitation: Kind's containerd may not pull from the in‑cluster Service DNS without additional node-level registry mirror configuration. If image pulls fail (`ImagePullBackOff`), use the external build fallback above.

Run build:
```bash
cd kubernetes/applications/o11y-python
./build.sh
kubectl -n o11y-python logs -f job/$(kubectl -n o11y-python get jobs --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].metadata.name}')
```

After success (and once registry pull issues are resolved, if using internal registry), restart app:
```bash
kubectl -n o11y-python rollout restart deployment/o11y-python
```

If the rollout fails to pull, switch the image back to `o11y-python:latest` and pre‑load it with `kind load docker-image`.

### Switching Between Build Modes

Edit `applications/o11y-python/deployment.yaml`:
- Internal registry mode: `image: registry.observability.svc.cluster.local:5000/o11y-python:latest`
- Preloaded local image mode: `image: o11y-python:latest`

Remember to `kubectl apply -k kubernetes` (or re-run `deploy.sh`) after modifying the deployment.

## Grafana Dashboards Provisioning

Grafana is configured with:
- Datasources ConfigMap: automatically sets Prometheus, Elasticsearch, Jaeger, and Loki (if later added) endpoints.
- Dashboards ConfigMaps: JSON files mounted at `/var/lib/grafana/dashboards`.
- Provisioning provider ConfigMap: points Grafana to load all dashboards in that directory and auto‑scan (supports periodic reloads).

Add a new dashboard:
1. Export from the Grafana UI (or author JSON) with a unique `uid`.
2. Append it to the dashboards ConfigMap (or create a new one) under `observability/grafana/`.
3. Re-apply: `kubectl apply -k kubernetes/observability` (or root kustomization).
4. (Optional) Restart Grafana: `kubectl -n observability rollout restart deploy/grafana`.

Dashboards included: observability overview, node exporter overview, Kafka overview.

## Customization

You can adjust component resources or disable optional components by editing `observability/kustomization.yaml` (comment out resources).

## Troubleshooting

| Symptom | Likely Cause | Resolution |
|---------|--------------|-----------|
| `ImagePullBackOff` for app image referencing internal registry | Kind containerd can't resolve/pull from cluster Service DNS | Use external build + `kind load docker-image`, or configure Kind with a registry mirror mapping hostPort to the Service (future improvement) |
| Kibana `JavaScript heap out of memory` crash loops | Default memory limit too low for Kibana / Elasticsearch index patterns migration | Increased memory limit (1Gi) + set `NODE_OPTIONS=--max-old-space-size=1024` (already applied) |
| Grafana dashboards missing | ConfigMap not mounted / provisioning mismatch | Ensure deployment has volumes for dashboards + provider, verify ConfigMap names and re-apply |
| Kaniko Job fails to find context files | Inline Dockerfile expects no external context | Ensure you haven't edited build job to reference local files unless mounting them |

Collect diagnostics:
```bash
kubectl get pods -n observability
kubectl describe pod <pod> -n observability
kubectl logs -n observability <pod> --tail=200
```

## Roadmap / Next Steps

- Ingress / Gateway for external access.
- PersistentVolumeClaims for Prometheus & Grafana (and Elasticsearch tuning / PVC storage class parameterization).
- Optional Prometheus Operator (ServiceMonitor / PodMonitor resources) overlay.
- Refined internal registry solution (node-level mirror or local registry container bound to host network).
- Security hardening: RBAC least privilege, NetworkPolicies, PodSecurity standards, non-root containers, TLS.
- Parameterization via Kustomize vars / Helm chart for easier environment-specific overrides.
- Loki + Tempo optional add-ons for logs/traces alternative backends.

## Contributing

Improvements to this Kubernetes deployment are welcome. Keep the docker-compose and Kubernetes feature sets aligned where practical, document new components, and add minimal sane defaults (avoid over-complicating the demo footprint).

9 changes: 9 additions & 0 deletions kubernetes/applications/namespace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
apiVersion: v1
kind: Namespace
metadata:
name: o11y-python
labels:
name: o11y-python
app-example: "true"
---
# (Optional) Additional namespace-scoped resources (network policies, resource quotas) can be added here later.
102 changes: 102 additions & 0 deletions kubernetes/applications/o11y-python/build-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
apiVersion: batch/v1
kind: Job
metadata:
generateName: build-o11y-python-
namespace: o11y-python
labels:
job: build-o11y-python
annotations:
o11y-python/build-timestamp: "{{BUILD_TIMESTAMP}}" # replace via sed to force new image builds
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:v1.23.2-debug
args:
- '--dockerfile=/workspace/Dockerfile'
- '--context=/workspace'
- '--destination=registry.observability.svc.cluster.local:5000/o11y-python:latest'
- '--insecure'
- '--skip-tls-verify'
volumeMounts:
- name: src
mountPath: /workspace
volumes:
- name: src
configMap:
name: o11y-python-source
items:
- key: Dockerfile
path: Dockerfile
- key: app.py
path: app.py
- key: run.sh
path: run.sh
---
apiVersion: v1
kind: ConfigMap
metadata:
name: o11y-python-source
namespace: o11y-python
labels:
app: o11y-python
component: source
data:
Dockerfile: |
FROM python:3.11-slim
WORKDIR /app
COPY app.py run.sh ./
RUN pip install --no-cache-dir \
fastapi==0.110.0 \
uvicorn[standard]==0.29.0 \
opentelemetry-sdk==1.21.0 \
opentelemetry-exporter-otlp==1.21.0 \
opentelemetry-instrumentation-fastapi==0.42b0 \
opentelemetry-instrumentation-logging==0.42b0
EXPOSE 8000
CMD ["python", "app.py"]
app.py: |
from fastapi import FastAPI
import os
import time
import logging
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

logging.basicConfig(level=logging.INFO)

OTEL_ENDPOINT = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otel-collector.observability.svc.cluster.local:4317")

resource = Resource(attributes={
"service.name": os.getenv("OTEL_SERVICE_NAME", "o11y-python"),
"service.version": "0.1.0"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=OTEL_ENDPOINT, insecure=True))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)
app = FastAPI()

@app.get("/")
async def root():
with tracer.start_as_current_span("root-span"):
logging.info("Handling root request")
time.sleep(0.05)
return {"message": "Hello from o11y-python in Kubernetes!"}

if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
run.sh: |
#!/usr/bin/env bash
set -euo pipefail
exec python app.py
17 changes: 17 additions & 0 deletions kubernetes/applications/o11y-python/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env bash
set -euo pipefail
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
TS=$(date +%s)
TMP=$(mktemp)
# Substitute build timestamp annotation to ensure a unique job name and avoid cache reuse issues
sed "s/{{BUILD_TIMESTAMP}}/${TS}/g" "${DIR}/build-job.yaml" > "${TMP}"
# Ensure namespace exists
kubectl apply -f "${DIR}/../namespace.yaml" >/dev/null 2>&1 || true
# Apply (or re-apply) source ConfigMap (it's embedded in build-job.yaml already but kept separate step if later extracted)
# kubectl apply -f "${DIR}/o11y-python-source-configmap.yaml" 2>/dev/null || true
# Create a new build job (generateName ensures uniqueness)
kubectl create -f "${TMP}"
JOB_NAME=$(kubectl -n o11y-python get jobs --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].metadata.name}')
rm -f "${TMP}"
echo "Created job: ${JOB_NAME}" >&2
echo "Follow logs: kubectl -n o11y-python logs -f job/${JOB_NAME}" >&2
32 changes: 32 additions & 0 deletions kubernetes/applications/o11y-python/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: o11y-python
namespace: o11y-python
labels:
app: o11y-python
spec:
replicas: 1
selector:
matchLabels:
app: o11y-python
template:
metadata:
labels:
app: o11y-python
spec:
containers:
- name: o11y-python
# Using local image loaded into kind node (avoids pulling via cluster-internal service name which containerd on node cannot resolve)
image: o11y-python:latest
imagePullPolicy: IfNotPresent
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector.observability.svc.cluster.local:4317
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: grpc
- name: OTEL_SERVICE_NAME
value: o11y-python
ports:
- containerPort: 8000
name: http
9 changes: 9 additions & 0 deletions kubernetes/applications/o11y-python/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
# Namespace will be created by root kustomization (applications/namespace.yaml)
namespace: o11y-python
resources:
- deployment.yaml
- service.yaml

# Build job intentionally excluded; apply manually when needed to (re)build image.
14 changes: 14 additions & 0 deletions kubernetes/applications/o11y-python/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
apiVersion: v1
kind: Service
metadata:
name: o11y-python
namespace: o11y-python
labels:
app: o11y-python
spec:
selector:
app: o11y-python
ports:
- name: http
port: 8000
targetPort: 8000
Loading
Loading