AIOps observability service — instruments distributed workflow executions, traces each step, detects bottlenecks against historical baselines, exposes Prometheus metrics, visualizes in Grafana, and fires Slack alerts.
In enterprise automation platforms, long-running workflows can silently degrade — a single slow step buries the whole execution. This service gives you full execution visibility:
- Every step is traced with millisecond precision
- Bottlenecks are auto-detected by comparing each step against its own historical average
- Alerts fire to Slack before users notice slowdowns
- Metrics are scraped by Prometheus and visualized in a pre-built Grafana dashboard
Client (automation platform)
│
▼
FastAPI (main.py)
├── POST /workflows → start workflow execution
├── POST /workflows/{id}/steps → start a step
├── PUT /workflows/{id}/steps/{sid} → complete step → bottleneck check → Slack alert
├── PUT /workflows/{id} → complete workflow
├── GET /workflows/{id}/bottlenecks → on-demand bottleneck report
├── GET /stats/{workflow_name} → historical step averages
└── GET /metrics → Prometheus text format
│
▼
Grafana (port 3000) ← scrapes /metrics every 15s
│
PostgreSQL
├── workflow_executions
└── workflow_steps
| Feature | Detail |
|---|---|
| Step-level tracing | Every step traced with start/end timestamps and duration in ms |
| Bottleneck detection | Steps > 2× historical average → WARNING; > 3× → CRITICAL |
| Prometheus metrics | /metrics endpoint — workflow count, step durations, error rates |
| Grafana dashboard | Pre-built dashboard JSON auto-provisioned via Docker Compose |
| Slack alerting | Structured alerts with workflow ID, step name, severity, and duration |
| One-command start | Full stack (API + PostgreSQL + Prometheus + Grafana) via docker-compose up |
# Start the full stack
docker-compose up --build
# Run a simulated workflow with randomized slow steps
curl -X POST "http://localhost:8000/simulate?workflow_name=demo"
# Sample response
{
"workflow_id": "wf_9c3e1a",
"workflow_name": "demo",
"total_duration_ms": 4823,
"steps_completed": 5,
"bottlenecks_detected": [
{
"step_name": "data_transform",
"duration_ms": 2100,
"historical_avg_ms": 420,
"severity": "CRITICAL",
"factor": 5.0
}
]
}Open Grafana → http://localhost:3000 (admin / admin) to see live metrics.
docker-compose up --build| Service | URL | Notes |
|---|---|---|
| API | http://localhost:8000 | FastAPI app |
| API docs | http://localhost:8000/docs | Interactive Swagger UI |
| Prometheus | http://localhost:9090 | Scrapes /metrics every 15s |
| Grafana | http://localhost:3000 | admin / admin — auto dashboard |
# Stop and remove containers
docker-compose down
# Also remove volumes (clears DB data)
docker-compose down -vgit clone https://github.com/shubhamwagdarkar/workflow-execution-tracer.git
cd workflow-execution-tracer
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # fill in PostgreSQL + Slack details
uvicorn main:app --reload| Layer | Technology |
|---|---|
| API | FastAPI + Uvicorn |
| Database | PostgreSQL + psycopg2 |
| Alerting | Slack Incoming Webhooks (slack-sdk) |
| Observability | Prometheus metrics + Grafana dashboard |
| Containerization | Docker + Docker Compose |
| Testing | pytest |