SENTINEL-G: LLM Reliability Detection

Production-grade failure detection for LLM-powered applications with deterministic business impact quantification and Datadog integration.

Overview

SENTINEL-G detects LLM failures before customers see them, quantifying real business impact in dollars and automating recovery.

Built for the Datadog Challenge — real-time event streaming, deterministic detection rules, and actionable incident response.

Key Features

✅ Deterministic Failure Classification — 10 failure modes (hallucination, latency, cost explosion, diversity collapse, etc.) ✅ Business Impact in $USD — Real-time 24h revenue loss projection ✅ Failure Lineage — Temporal degradation timeline (t-12m → t+0m) ✅ Automated Recovery — Ranked actions with confidence scoring & execution time ✅ Datadog Integration — Real-time events, monitors, incidents (202 API response verified) ✅ Live Dashboard — React frontend showing detection & recovery in real-time

Architecture

┌─────────────────────────────────────────────────────────┐ │ React Frontend (5173) │ │ Dashboard: Metrics | Timeline | Impact │ └────────────────────┬────────────────────────────────────┘ │ ↓ POST /test-failure ┌─────────────────────────────────────────────────────────┐ │ FastAPI Backend (8000) │ │ │ │ ├→ Failure Classifier (Deterministic) │ │ ├→ Business Impact Calculator ($USD) │ │ ├→ Recovery Recommender │ │ └→ Datadog Events API │ └─────────────────┬─────────────────────────────────────┘ │ ↓ POST /api/v1/events ┌────────────────────────┐ │ Datadog (Event API) │ │ - Events Created │ │ - Monitors Triggered │ │ - Incidents Auto │ └────────────────────────┘

text

Quick Start

Prerequisites

Python 3.9+
Node.js 16+
Datadog Account (API + App Key)

Backend Setup

cd sentinel-g pip install -r requirements.txt

Set environment variables set PYTHONPATH=. set DATADOG_API_KEY=your-api-key set DATADOG_APP_KEY=your-app-key set DATADOG_SITE=datadoghq.com

Run uvicorn src.api.main:app --reload --port 8000

text

Frontend Setup

cd frontend npm install npm run dev

text

Visit: http://localhost:5173

Demo Flow

Click "Trigger Hallucination" → Dashboard turns RED
Backend logs show → "✓ Datadog Event Created" (202 response)
Event flows to Datadog → Real-time visibility
Click "Apply Fix" → System recovers, dashboard GREEN
Metrics update → Confidence 0.92, Latency 1500ms

Failure Classes

Class	Confidence	Latency	Diversity	Revenue Risk (24h)
HALLUCINATION_RISK	0.52 ↓	3200ms ↑	0.45 ↓	$122,638
LATENCY_ANOMALY	0.68	4500ms ↑	0.70	$91,492
COST_EXPLOSION	0.65	2800ms	0.65	$85,000

See full taxonomy: docs/taxonomy.md

Datadog Integration

Events API

Failure detected → POST to /api/v1/events with:

Title: 🔴 LLM Failure Detected: {CLASS}
Text: Full business impact breakdown
Tags: service:sentinel-g, failure:{class}, impact:high
Response: 202 (success)

Monitor Setup

Create monitor in Datadog: Query: source:sentinel-g AND failure:* Trigger: At least 1 match in 5 minutes Severity: High

text

Incident Auto-Creation

When monitor fires → Datadog auto-creates incident with:

Failure class
Business impact ($USD)
Recommended recovery actions
Execution time & confidence

API Endpoints

`POST /test-failure?failure_type={type}`

Simulate a failure. Types: hallucination, latency, cost

Response: { "request_id": "req-1766390913590", "failure_type": "hallucination", "classification": { "primary_class": "HALLUCINATION_RISK", "confidence": 0.52, "latency_ms": 3200 }, "failure_lineage": [...], "risk_attribution": { "calculation": { "projected_24h_revenue_lost": 122638 } }, "recovery_options": [...] }

text

`POST /apply-fix?request_id={id}&action={action}`

Apply recovery action. Datadog event created for recovery.

Response: { "request_id": "req-...", "status": "HEALTHY", "recovery_action": "Fallback to Claude 3.5 Sonnet", "confidence_recovered": 0.92, "latency_normalized_ms": 1500 }

text

`GET /health`

Health check endpoint.

Project Structure

sentinel-g/ ├── src/ │ ├── api/ │ │ └── main.py (FastAPI app, endpoints) │ ├── core/ │ │ ├── config.py (Config, env vars) │ │ ├── models.py (Pydantic models) │ │ └── telemetry.py (Telemetry, early warnings) │ └── sentinel/ │ ├── failure_classifier.py (Deterministic classification) │ ├── incident_enrichment.py (Incident builder) │ └── risk_attribution.py (Business impact calc) ├── frontend/ │ ├── src/ │ │ ├── App.jsx (Main React component) │ │ ├── components/ │ │ │ ├── MetricsCard.jsx │ │ │ ├── IncidentTimeline.jsx │ │ │ ├── BusinessImpact.jsx │ │ │ ├── RecoveryActions.jsx │ │ │ └── EarlyWarnings.jsx │ │ └── index.css (Tailwind styles) │ └── package.json ├── docs/ │ ├── architecture.md │ ├── taxonomy.md │ └── deployment.md ├── datadog/ │ └── detection_rules.md ├── requirements.txt ├── .env.example └── README.md

text

Submission: Datadog Challenge

How SENTINEL-G Meets Requirements:

✅ LLM App — Uses Gemini 2.0 Flash (simulated for demo) ✅ Telemetry to Datadog — Real-time events via REST API (202 response) ✅ Detection Rules — Deterministic thresholds (confidence, latency, diversity, tokens) ✅ Dashboard — React frontend showing health + signals ✅ Actionable — Ranked recovery actions triggered in Datadog incidents

Judge Proof:

Backend logs: ✓ Datadog Event Created
Response code: 202 (API success)
Frontend demo: Failure detection → Recovery
Business impact: $194K quantified

Performance

Metric	Value
Failure Detection Latency	<100ms
Datadog Event Delivery	<2s
Dashboard Update	Real-time
Recovery Execution	2-30 seconds

Roadmap

License

MIT License — See LICENSE

Team

Built for Datadog Challenge 2025 — LLM Reliability & Observability

SENTINEL-G: Production-Ready LLM Reliability. Deployed with Datadog.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
datadog		datadog
docs		docs
frontend		frontend
infra		infra
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.png		ARCHITECTURE.png
Datadog_Integration_Proof.md		Datadog_Integration_Proof.md
LICENSE		LICENSE
README.md		README.md
datadog-mockup.html		datadog-mockup.html
netlify.toml		netlify.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SENTINEL-G: LLM Reliability Detection

Overview

Key Features

Architecture

Quick Start

Prerequisites

Backend Setup

Frontend Setup

Demo Flow

Failure Classes

Datadog Integration

Events API

Monitor Setup

Incident Auto-Creation

API Endpoints

`POST /test-failure?failure_type={type}`

`POST /apply-fix?request_id={id}&action={action}`

`GET /health`

Project Structure

Submission: Datadog Challenge

Performance

Roadmap

License

Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SENTINEL-G: LLM Reliability Detection

Overview

Key Features

Architecture

Quick Start

Prerequisites

Backend Setup

Frontend Setup

Demo Flow

Failure Classes

Datadog Integration

Events API

Monitor Setup

Incident Auto-Creation

API Endpoints

POST /test-failure?failure_type={type}

POST /apply-fix?request_id={id}&action={action}

GET /health

Project Structure

Submission: Datadog Challenge

Performance

Roadmap

License

Team

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /test-failure?failure_type={type}`

`POST /apply-fix?request_id={id}&action={action}`

`GET /health`

Packages