CI/CD Reliability Analytics Engine for Recurring Pipeline Failures
AutoOps Insight analyzes CI/CD execution logs, detects recurring failure patterns, exports reliability metrics, and generates structured summaries to reduce mean time to diagnosis (MTTD).
Designed as a reliability signal extraction layer for DevOps and platform teams.
CI/CD systems fail repeatedly for the same root causes:
- Dependency resolution issues
- Test regressions
- Build tool misconfigurations
- Environment drift
- Resource exhaustion
But failure patterns are often buried inside verbose logs.
AutoOps Insight transforms unstructured logs into:
- Structured failure classification
- Prometheus-exported reliability metrics
- Human-readable failure summaries
- Recurrence detection signals
Focus: make CI failures observable, measurable, and trendable.
Architecture:
- React frontend (log upload + dashboard)
- FastAPI backend (log parsing + classification)
- ML classifier (TF-IDF + Logistic Regression)
- Prometheus metrics endpoint
- Optional LLM-based summarizer
Flow:
Logs → Feature Extraction → Failure Classification → Metrics Export → Dashboard Visualization
- TF-IDF vectorization
- Logistic Regression classifier
- Predicts common CI/CD failure categories
- Extensible label taxonomy
Example failure classes:
- Dependency Error
- Test Failure
- Compilation Error
- Timeout
- Configuration Error
- Aggregates classified failures
- Identifies repeating error signatures
- Enables trend tracking across runs
- Designed for MTTR reduction workflows
/metrics endpoint exposes:
ci_failure_total{type="dependency_error"}ci_pipeline_runs_totalci_failure_rateci_failure_recurring_total
Integrates directly with Grafana dashboards.
Operational value: Convert CI reliability into measurable SLO-aligned signals.
Two modes:
- Deterministic keyword-based summarizer
- Optional LLM-based summarizer (API-key gated)
Goal: compress large CI logs into actionable summaries.
Frontend:
- React (Vite)
- Tailwind CSS
- Axios
Backend:
- FastAPI
- Python
- scikit-learn
- python-dotenv
Machine Learning:
- TF-IDF
- Logistic Regression classifier
Observability:
- Prometheus metrics endpoint
- Docker-ready deployment
- Upload CI/CD log file
- System predicts failure category
- Failure count increments in Prometheus
- Summary generated for fast diagnosis
- Reliability metrics visible via
/metrics
AutoOps Insight was built around:
- Deterministic classification pipeline
- Observable reliability signals
- Low-latency inference
- Extendable failure taxonomy
- Production-friendly integration
It is intentionally structured as an analytics layer, not just a dashboard.
- Inference latency: ~5–20ms per log file (local benchmark, small dataset)
- Stateless classification service
- Constant-time metrics export via Prometheus client
- Handles multi-thousand line logs without blocking frontend
Designed for lightweight integration into existing CI environments.
- Graceful handling of malformed or empty logs
- Classifier fallback to "Unknown" category
- Metrics export isolated from inference logic
- Optional LLM summarizer fully decoupled from classification pipeline
git clone https://github.com/kritibehl/AutoOps-Insight.git
cd AutoOps-Insight
Backend
cd backend
pip install -r requirements.txt
uvicorn main:app --reload
Frontend
cd frontend
npm install
npm run dev
Prometheus Test
curl http://localhost:8000/metrics
Future Extensions
Historical run storage (PostgreSQL)
Failure fingerprinting via hashing
Time-series trend analysis
CI plugin integration (GitHub Actions / Jenkins)
Alerting hooks (Slack / Webhooks)
SLO-based CI reliability scoring