Skip to content

Dillettant/incidentfox

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

455 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

IncidentFox 🦊

Your AI Copilot for Incident Response

Investigate incidents, find root causes, and suggest fixes — automatically

Try Free in Slack · 5-Min Docker Setup · Deploy for Your Team


IncidentFox is an open-source AI SRE that integrates with your observability stack, infrastructure, and collaboration tools. It automatically forms hypotheses, collects data from your systems, and reasons through to find root causes — all while you focus on the fix.

Built for production on-call — handles log sampling, alert correlation, anomaly detection, and dependency mapping so you don't have to.

Slack Investigation
Investigate incidents directly from Slack


Table of Contents


What is IncidentFox?

An AI SRE that helps root cause and propose mitigations for production on-call issues. It automatically forms hypotheses, collects info from your infrastructure, observability tools, and code, and reasons through to an answer.

Slack-first (see screenshot above), but also works on web UI, GitHub, PagerDuty, and API.

Highly customizable — set up in minutes, and it self-improves by automatically learning and persisting your team's context.


Get Started

IncidentFox is open source (Apache 2.0). You can try it instantly in Slack, or deploy it yourself for full control. Pick the option that fits your needs:

Option Best For Setup Time Cost Privacy
Try Free See it in action Instant Free Our playground environment Join Slack
Local Docker Evaluate with your infra 5 minutes Free Everything local Setup Guide →
Managed (premium features) Production, we handle ops 30 minutes Contact us (7-day free trial) SaaS or on-prem, SOC2 Add to Slack
Self-Host (Open Core) Production, full control 30 minutes Free Everything local Deployment Guide →

New to IncidentFox? We recommend trying it in our Slack first — no setup required, see how it works instantly. Join Slack


Why IncidentFox

For Engineering Leaders: What this means for your team.

Outcome Impact
Faster Incident Resolution Hours → minutes. Auto-correlates alerts, analyzes logs, traces dependencies.
85-95% Less Alert Noise Smart correlation finds root cause. Engineers focus on real problems.
Knowledge Retention Learns your systems and runbooks. Knowledge stays when people leave.
Works on Day One 300+ integrations. No months of setup — connect and go.
No Vendor Lock-In Open source, bring your own LLM keys, deploy anywhere.
Gets Smarter Over Time Learns from every investigation. Your expertise compounds.

The bottom line: Less time firefighting, more time building.


Architecture Overview

┌───────────────────────────────────┐     ┌──────────────────────┐
│ Slack / GitHub / PagerDuty / API  │     │       Web UI         │
└─────────────────┬─────────────────┘     │   (dashboard, team   │
                  │ webhooks              │    management)       │
┌─────────────────▼─────────────────┐     └──────────┬───────────┘
│           Orchestrator            │                │
│  (routes webhooks, team lookup,   │                │
│    token auth, audit logging)     │                │
└────────┬─────────────────┬────────┘                │
         │                 │                         │
┌────────▼────────┐   ┌────▼─────────────────────────▼───┐
│      Agent      │<->│          Config Service          │
│ (Claude/OpenAI, │   │    (multi-tenant cfg, RBAC,      │
│   300+ tools,   │   │     routing, team hierarchy)     │
│   multi-agent)  │   └─────────────────┬────────────────┘
└────┬───────┬────┘                     │
     │       │                          ▼
     │       │              ┌───────────────────────┐
     │       │              │      PostgreSQL       │
     │       │              │    (config, audit,    │
     │       │              │    investigations)    │
     │       │              └───────────────────────┘
     │       │
     ▼       ▼
┌──────────┐ ┌─────────────────────────┐
│ Knowledge│ │      External APIs      │
│   Base   │ │   (K8s, AWS, Datadog,   │
│ (RAPTOR) │ │     Grafana, etc.)      │
└──────────┘ └─────────────────────────┘

Web Console
Web Console — Easiest way to view and customize agents


Under the Hood

The engineering that makes IncidentFox actually work in production:

Capability What It Does Why It Matters
RAPTOR Knowledge Base Hierarchical tree structure (ICLR 2024) — clusters → summarizes → abstracts Standard RAG fails on 100-page runbooks. RAPTOR maintains context across long documents.
Smart Log Sampling Statistics first → sample errors → drill down on anomalies Other tools load 100K lines and hit context limits. We sample intelligently to stay useful.
Alert Correlation Engine 3-layer analysis: temporal + topology + semantic Groups alerts AND finds root cause. Reduces noise by 85-95%.
Prophet Anomaly Detection Meta's Prophet algorithm with seasonality-aware forecasting Detects anomalies that account for daily/weekly patterns, not just static thresholds.
Dependency Discovery Automatic service topology mapping with blast radius analysis Know what's affected before you start investigating. No manual service maps needed.
300+ Built-in Tools Kubernetes, AWS, Azure, GCP, Grafana, Datadog, Prometheus, GitHub, and more No "bring your own tools" setup. Works out of the box with your stack.
MCP Protocol Support Connect to any MCP server for unlimited integrations Add new tools in minutes via config, not code.
Multi-Agent Orchestration Planner routes to specialist agents (K8s, AWS, Metrics, Code, etc.) Complex investigations get handled by the right expert, not a generic agent.
Model Flexibility Supports OpenAI and Claude SDKs — use the model that fits your needs No vendor lock-in. Switch models or use different models for different tasks.
Continuous Self-Improvement Learns from investigations, persists patterns, builds team context Gets smarter over time. Your past incidents inform future investigations.

Knowledge Base
RAPTOR knowledge base storing 50K+ docs as your proprietary knowledge

Full technical details →


Enterprise Ready

Security and compliance for production deployments:

Feature Description
SOC 2 Compliant Audited security controls, data handling, and access management
Claude Sandbox Isolated Kubernetes sandboxes for agent execution — no shared state between runs
Secrets Proxy Credentials never touch the agent. Envoy proxy injects secrets at request time.
Approval Workflows Critical changes (prompts, tools, configs) require review before deployment
SSO/OIDC Google, Azure AD, Okta — per-organization configuration
Hierarchical Config Org → Business Unit → Team inheritance with override capabilities
Audit Logging Full trail of all agent actions, config changes, and investigations
On-Premise Deploy entirely in your environment — air-gapped support available

Enterprise deployment guide →


Documentation

Getting Started Reference Development
Quick Start Features Dev Guide
Deployment Guide Integrations Agent Architecture
Slack Setup (detailed) Architecture Tools Catalog

Contributing

We welcome contributions! See issues labeled good first issue to get started.

For bugs or feature requests, open an issue on GitHub.


License

Apache License 2.0


See Also

Claude Code Plugin — Standalone SRE tools for individual developers using Claude Code CLI. Not connected to the IncidentFox platform above.


Connect with Us

Slack   LinkedIn   X - Jimmy   X - LongYi

Built with ❤️ by the IncidentFox team

About

AI-powered SRE platform for automated incident investigation

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 76.3%
  • TypeScript 14.6%
  • HTML 4.3%
  • Shell 2.1%
  • HCL 1.7%
  • Makefile 0.4%
  • Other 0.6%