Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
[![Docker](https://img.shields.io/badge/Docker-20.10%2B-blue)](https://www.docker.com/)
[![Docker Compose](https://img.shields.io/badge/Docker%20Compose-2.0%2B-blue)](https://docs.docker.com/compose/)

A comprehensive, production-ready observability stack that provides metrics collection, log aggregation, distributed tracing, and alerting capabilities using industry-standard open-source tools.
A comprehensive demo and learning observability stack that provides metrics collection, log aggregation, distributed tracing, and alerting capabilities using industry-standard open-source tools.

> **Disclaimer**: This project is intended for demonstration, experimentation, and educational purposes only. It is **NOT production ready**. It runs all components in single containers with minimal configuration and without hardening (no auth, no TLS, single-node Elasticsearch, in-container Prometheus storage, no HA, no backup/restore strategy). Before any production use you must implement security, scaling, persistence, resilience, and operational safeguards.

## ✨ Features

Expand Down Expand Up @@ -84,6 +86,26 @@ After starting the stack, you can access the following services:

> **Note**: These URLs are only accessible when the stack is running locally.

### Demo Application (Independent)

A sample FastAPI + OpenTelemetry app lives under `o11y-playground/o11y-python`.
It runs in its own directory and just needs to share the Docker network named `observability`
so it can reach the toolkit's OpenTelemetry Collector at `otel-collector:4317`.

Run it separately (after starting the stack):
```bash
cd o11y-playground/o11y-python
docker compose up -d --build
```
Stop it:
```bash
docker compose down
```

Endpoints: `/`, `/work`, `/error` (http://localhost:8000)

These generate traces (Jaeger), metrics (Prometheus/Grafana), and logs (Kibana) independently of the core compose file.

## 📁 Configuration Structure

```
Expand Down
106 changes: 106 additions & 0 deletions config/grafana/dashboards/node-exporter-overview.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
{
"id": null,
"uid": "node-exporter-overview",
"title": "Node Exporter Overview",
"tags": ["node", "infrastructure", "overview"],
"style": "dark",
"timezone": "browser",
"editable": false,
"schemaVersion": 39,
"version": 1,
"refresh": "5s",
"time": {"from": "now-1h", "to": "now"},
"panels": [
{
"id": 1,
"title": "CPU Usage % (avg per instance)",
"type": "stat",
"targets": [
{"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "refId": "A"}
],
"fieldConfig": {"defaults": {"unit": "percent", "min": 0, "max": 100, "thresholds": {"steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 60}, {"color": "red", "value": 85}]}}},
"gridPos": {"h": 6, "w": 8, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Memory Usage %",
"type": "stat",
"targets": [
{"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100", "refId": "A"}
],
"fieldConfig": {"defaults": {"unit": "percent", "min": 0, "max": 100, "thresholds": {"steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 70}, {"color": "red", "value": 90}]}}},
"gridPos": {"h": 6, "w": 8, "x": 8, "y": 0}
},
{
"id": 3,
"title": "Load Average (1m)",
"type": "stat",
"targets": [
{"expr": "avg(node_load1)", "refId": "A"}
],
"fieldConfig": {"defaults": {"unit": "none"}},
"gridPos": {"h": 6, "w": 8, "x": 16, "y": 0}
},
{
"id": 4,
"title": "CPU Usage per Instance",
"type": "graph",
"targets": [
{"expr": "100 - (irate(node_cpu_seconds_total{mode=\"idle\"}[5m]) * 100)", "legendFormat": "{{instance}} core={{cpu}}", "refId": "A"}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 6}
},
{
"id": 5,
"title": "Memory Used (bytes)",
"type": "graph",
"targets": [
{"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)", "legendFormat": "{{instance}}", "refId": "A"}
],
"fieldConfig": {"defaults": {"unit": "bytes"}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 6}
},
{
"id": 6,
"title": "Filesystem Usage %",
"type": "graph",
"targets": [
{"expr": "(node_filesystem_size_bytes{fstype!~\"tmpfs|overlay\"} - node_filesystem_free_bytes{fstype!~\"tmpfs|overlay\"}) / node_filesystem_size_bytes{fstype!~\"tmpfs|overlay\"} * 100", "legendFormat": "{{instance}} {{mountpoint}}", "refId": "A"}
],
"fieldConfig": {"defaults": {"unit": "percent"}},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 14}
},
{
"id": 7,
"title": "Disk IO Read/Write (bytes/s)",
"type": "graph",
"targets": [
{"expr": "sum by(instance) (irate(node_disk_read_bytes_total[5m]))", "legendFormat": "{{instance}} read", "refId": "A"},
{"expr": "sum by(instance) (irate(node_disk_written_bytes_total[5m]))", "legendFormat": "{{instance}} write", "refId": "B"}
],
"fieldConfig": {"defaults": {"unit": "Bps"}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 14}
},
{
"id": 8,
"title": "Network RX/TX (bytes/s)",
"type": "graph",
"targets": [
{"expr": "sum by(instance) (irate(node_network_receive_bytes_total{device!~\"lo\"}[5m]))", "legendFormat": "{{instance}} rx", "refId": "A"},
{"expr": "sum by(instance) (irate(node_network_transmit_bytes_total{device!~\"lo\"}[5m]))", "legendFormat": "{{instance}} tx", "refId": "B"}
],
"fieldConfig": {"defaults": {"unit": "Bps"}},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 22}
},
{
"id": 9,
"title": "Uptime (seconds)",
"type": "table",
"targets": [
{"expr": "node_time_seconds - node_boot_time_seconds", "format": "table", "refId": "A"}
],
"fieldConfig": {"defaults": {"unit": "s"}},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 22}
}
]
}
25 changes: 12 additions & 13 deletions config/grafana/dashboards/observability-overview.json
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
{
"dashboard": {
"id": null,
"title": "Observability Stack Overview",
"tags": ["observability", "overview"],
"style": "dark",
"timezone": "browser",
"panels": [
"id": null,
"uid": "obs-overview",
"title": "Observability Stack Overview",
"tags": ["observability", "overview"],
"style": "dark",
"timezone": "browser",
"editable": false,
"schemaVersion": 39,
"version": 1,
"panels": [
{
"id": 1,
"title": "System CPU Usage",
Expand Down Expand Up @@ -100,10 +103,6 @@
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16}
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "5s"
}
"time": {"from": "now-1h", "to": "now"},
"refresh": "5s"
}
5 changes: 5 additions & 0 deletions config/prometheus/prometheus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,11 @@ scrape_configs:
static_configs:
- targets: ['otel-collector:8888']

# OpenTelemetry Collector Prometheus exporter (application & transformed metrics)
- job_name: 'otel-collector-exporter'
static_configs:
- targets: ['otel-collector:8889']

# Example: Application services (uncomment and customize for your services)
# - job_name: 'your-app-service'
# static_configs:
Expand Down
2 changes: 2 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ services:
networks:
- observability


volumes:
prometheus_data:
elasticsearch_data:
Expand All @@ -151,4 +152,5 @@ volumes:

networks:
observability:
name: observability
driver: bridge
12 changes: 12 additions & 0 deletions o11y-playground/o11y-python/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
__pycache__
*.pyc
*.pyo
*.pyd
*.pytest_cache
.idea
.vscode
.env
.cache
.venv
build/
dist/
24 changes: 24 additions & 0 deletions o11y-playground/o11y-python/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Local Python artifacts
__pycache__/
*.py[cod]
*.pyo
*.pyd
*.so

# Virtual environments
.venv/
venv/

# Tooling
.mypy_cache/
.pytest_cache/
.coverage
htmlcov/

# Editors
.vscode/
.idea/

# Env
.env
.env.*
32 changes: 32 additions & 0 deletions o11y-playground/o11y-python/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
FROM python:3.11-slim

LABEL org.opencontainers.image.title="o11y-python demo" \
org.opencontainers.image.description="FastAPI demo app emitting OTLP traces/metrics/logs to an OpenTelemetry Collector" \
org.opencontainers.image.source="https://github.com/vigneshragupathy/observability-toolkit" \
org.opencontainers.image.licenses="Apache-2.0"

WORKDIR /app

# Create non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser

# Install dependencies (copy first for better layer cache)
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app.py .
RUN chown -R appuser:appuser /app

ENV OTEL_EXPORTER_OTLP_ENDPOINT=otel-collector:4317 \
OTEL_EXPORTER_OTLP_PROTOCOL=grpc \
OTEL_SERVICE_NAME=o11y-python \
PYTHONUNBUFFERED=1 \
APP_LOG_LEVEL=info

EXPOSE 8000

USER appuser

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
99 changes: 99 additions & 0 deletions o11y-playground/o11y-python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
> NOTE: This demo app is intentionally independent from the core observability stack.
> Start the toolkit (root directory) and this app separately. They communicate only
> via the shared Docker network `observability` and the OTLP endpoint `otel-collector:4317`.
# Demo Python Observability App

Independent FastAPI application instrumented with OpenTelemetry to emit metrics, logs, and traces to an existing Observability Toolkit stack (Prometheus, Grafana, Jaeger, Elasticsearch/Kibana) via the OpenTelemetry Collector.

It connects by joining the shared external Docker network named `observability` and sending OTLP gRPC data to `otel-collector:4317`.

## Endpoints
* `/` – Simple hello; creates a span with random work.
* `/healthz` – Liveness probe.
* `/readyz` – Readiness probe.
* `/work` – CPU loop; span events + metrics.
* `/error` – Intentionally triggers a division error to showcase error span + logs.

## Data Flow
App -> OTLP gRPC -> OTel Collector ->
* Traces: Jaeger UI + Elasticsearch index `otel-traces`
* Metrics: Prometheus (collector exporter) -> Grafana
* Logs: Elasticsearch index `otel-logs` (Kibana Discover)

## Prerequisites
Start the Observability Toolkit stack (in its own project directory):
```bash
./manage-stack.sh start
```
Ensure the network exists (toolkit compose names it automatically):
```bash
docker network create observability 2>/dev/null || true
```

## Run (Standalone - demo only)
From this directory (creates/uses external network `observability` if present):
```bash
chmod +x run.sh # first time
./run.sh up
```
Stop / remove:
```bash
./run.sh down
```

## Run with Full Observability Stack (Independent)
1. In repository root: `./manage-stack.sh start`
2. In this directory: `docker compose up -d --build`
(or `./run.sh up`)
3. (Optional) If you start the app before the stack, telemetry export will warn until the collector is up.
4. Generate traffic (see below). No edits to the root compose file are required.

## Generate Test Traffic
```bash
curl http://localhost:8000/
curl http://localhost:8000/work
curl http://localhost:8000/error
for i in {1..20}; do curl -s http://localhost:8000/work > /dev/null; done
# or use script helper:
./run.sh traffic
```

## Where to Observe
* Prometheus: http://localhost:9090 (metrics: `demo_requests_total`, `demo_request_latency_ms`, `demo_random_value`)
* Grafana: http://localhost:3000 (build a dashboard with those metrics)
* Jaeger: http://localhost:16686 (service: `o11y-python`)
* Kibana: http://localhost:5601 (Discover index: `otel-logs*`, filter by `resource.service.name`)

## Multiple Demo Apps
To add more sample services independently:
1. Copy this folder (e.g., `cp -r o11y-python o11y-python-2`).
2. Change `OTEL_SERVICE_NAME` in its compose file or Dockerfile env.
3. Bring it up with `docker compose up -d --build` inside the new folder.
4. Ensure it uses the `observability` network (external) so it can reach `otel-collector`.

Each will appear as a distinct service in Jaeger / metrics / logs without modifying the core stack compose.

## Notes
* Runs as non-root user inside container.
* Metrics exported via OTLP to the collector then exposed via Prometheus exporter (port 8889).
* Logs use BatchLogRecordProcessor; gauge emits random values for demonstration.
* Health endpoints (`/healthz`, `/readyz`) provided for orchestration probes.
* Adjustable log level via `APP_LOG_LEVEL` environment variable.

### Environment Overrides
| Variable | Purpose | Default |
|----------|---------|---------|
| OTEL_EXPORTER_OTLP_ENDPOINT | Collector gRPC endpoint | otel-collector:4317 |
| OTEL_EXPORTER_OTLP_PROTOCOL | OTLP protocol | grpc |
| OTEL_SERVICE_NAME | Service name resource attr | o11y-python |
| APP_LOG_LEVEL | Logging level | info |

### Security (Demo Caveats)
No auth/TLS; not for production without hardening (dependency pin review, resource limits, structured logs routing, input validation, etc.).

## Cleanup
```bash
docker compose down
```

This does not stop the core observability stack (managed separately).
Loading
Loading