Skip to content

Shyamistic/sentinel-g

Repository files navigation

SENTINEL-G: LLM Reliability Detection

Production-grade failure detection for LLM-powered applications with deterministic business impact quantification and Datadog integration.

Overview

SENTINEL-G detects LLM failures before customers see them, quantifying real business impact in dollars and automating recovery.

Built for the Datadog Challenge — real-time event streaming, deterministic detection rules, and actionable incident response.

Key Features

Deterministic Failure Classification — 10 failure modes (hallucination, latency, cost explosion, diversity collapse, etc.) ✅ Business Impact in $USD — Real-time 24h revenue loss projection ✅ Failure Lineage — Temporal degradation timeline (t-12m → t+0m) ✅ Automated Recovery — Ranked actions with confidence scoring & execution time ✅ Datadog Integration — Real-time events, monitors, incidents (202 API response verified) ✅ Live Dashboard — React frontend showing detection & recovery in real-time

Architecture

┌─────────────────────────────────────────────────────────┐ │ React Frontend (5173) │ │ Dashboard: Metrics | Timeline | Impact │ └────────────────────┬────────────────────────────────────┘ │ ↓ POST /test-failure ┌─────────────────────────────────────────────────────────┐ │ FastAPI Backend (8000) │ │ │ │ ├→ Failure Classifier (Deterministic) │ │ ├→ Business Impact Calculator ($USD) │ │ ├→ Recovery Recommender │ │ └→ Datadog Events API │ └─────────────────┬─────────────────────────────────────┘ │ ↓ POST /api/v1/events ┌────────────────────────┐ │ Datadog (Event API) │ │ - Events Created │ │ - Monitors Triggered │ │ - Incidents Auto │ └────────────────────────┘

text

Quick Start

Prerequisites

  • Python 3.9+
  • Node.js 16+
  • Datadog Account (API + App Key)

Backend Setup

cd sentinel-g pip install -r requirements.txt

Set environment variables set PYTHONPATH=. set DATADOG_API_KEY=your-api-key set DATADOG_APP_KEY=your-app-key set DATADOG_SITE=datadoghq.com

Run uvicorn src.api.main:app --reload --port 8000

text

Frontend Setup

cd frontend npm install npm run dev

text

Visit: http://localhost:5173

Demo Flow

  1. Click "Trigger Hallucination" → Dashboard turns RED
  2. Backend logs show → "✓ Datadog Event Created" (202 response)
  3. Event flows to Datadog → Real-time visibility
  4. Click "Apply Fix" → System recovers, dashboard GREEN
  5. Metrics update → Confidence 0.92, Latency 1500ms

Failure Classes

Class Confidence Latency Diversity Revenue Risk (24h)
HALLUCINATION_RISK 0.52 ↓ 3200ms ↑ 0.45 ↓ $122,638
LATENCY_ANOMALY 0.68 4500ms ↑ 0.70 $91,492
COST_EXPLOSION 0.65 2800ms 0.65 $85,000

See full taxonomy: docs/taxonomy.md

Datadog Integration

Events API

Failure detected → POST to /api/v1/events with:

  • Title: 🔴 LLM Failure Detected: {CLASS}
  • Text: Full business impact breakdown
  • Tags: service:sentinel-g, failure:{class}, impact:high
  • Response: 202 (success)

Monitor Setup

Create monitor in Datadog: Query: source:sentinel-g AND failure:* Trigger: At least 1 match in 5 minutes Severity: High

text

Incident Auto-Creation

When monitor fires → Datadog auto-creates incident with:

  • Failure class
  • Business impact ($USD)
  • Recommended recovery actions
  • Execution time & confidence

API Endpoints

POST /test-failure?failure_type={type}

Simulate a failure. Types: hallucination, latency, cost

Response: { "request_id": "req-1766390913590", "failure_type": "hallucination", "classification": { "primary_class": "HALLUCINATION_RISK", "confidence": 0.52, "latency_ms": 3200 }, "failure_lineage": [...], "risk_attribution": { "calculation": { "projected_24h_revenue_lost": 122638 } }, "recovery_options": [...] }

text

POST /apply-fix?request_id={id}&action={action}

Apply recovery action. Datadog event created for recovery.

Response: { "request_id": "req-...", "status": "HEALTHY", "recovery_action": "Fallback to Claude 3.5 Sonnet", "confidence_recovered": 0.92, "latency_normalized_ms": 1500 }

text

GET /health

Health check endpoint.

Project Structure

sentinel-g/ ├── src/ │ ├── api/ │ │ └── main.py (FastAPI app, endpoints) │ ├── core/ │ │ ├── config.py (Config, env vars) │ │ ├── models.py (Pydantic models) │ │ └── telemetry.py (Telemetry, early warnings) │ └── sentinel/ │ ├── failure_classifier.py (Deterministic classification) │ ├── incident_enrichment.py (Incident builder) │ └── risk_attribution.py (Business impact calc) ├── frontend/ │ ├── src/ │ │ ├── App.jsx (Main React component) │ │ ├── components/ │ │ │ ├── MetricsCard.jsx │ │ │ ├── IncidentTimeline.jsx │ │ │ ├── BusinessImpact.jsx │ │ │ ├── RecoveryActions.jsx │ │ │ └── EarlyWarnings.jsx │ │ └── index.css (Tailwind styles) │ └── package.json ├── docs/ │ ├── architecture.md │ ├── taxonomy.md │ └── deployment.md ├── datadog/ │ └── detection_rules.md ├── requirements.txt ├── .env.example └── README.md

text

Submission: Datadog Challenge

How SENTINEL-G Meets Requirements:

LLM App — Uses Gemini 2.0 Flash (simulated for demo) ✅ Telemetry to Datadog — Real-time events via REST API (202 response) ✅ Detection Rules — Deterministic thresholds (confidence, latency, diversity, tokens) ✅ Dashboard — React frontend showing health + signals ✅ Actionable — Ranked recovery actions triggered in Datadog incidents

Judge Proof:

  • Backend logs: ✓ Datadog Event Created
  • Response code: 202 (API success)
  • Frontend demo: Failure detection → Recovery
  • Business impact: $194K quantified

Performance

Metric Value
Failure Detection Latency <100ms
Datadog Event Delivery <2s
Dashboard Update Real-time
Recovery Execution 2-30 seconds

Roadmap

  • Full Datadog tracing integration
  • Machine learning confidence scoring
  • Multi-model failover
  • Cost optimization engine
  • SLA tracking & reporting

License

MIT License — See LICENSE

Team

Built for Datadog Challenge 2025 — LLM Reliability & Observability


SENTINEL-G: Production-Ready LLM Reliability. Deployed with Datadog.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors