A sophisticated, multi-agent system for automated alert processing and incident response using Large Language Models (LLMs) and the Model Context Protocol (MCP).
The NOC Agent implements a supervisor + specialized agents pattern with MCP-based communication:
Alerts β Manager Agent β Metrics Evaluation β RCA Agent β Triage Agent β Notifications
β β β β
MCP Servers Enrichment Analysis Ownership
- Manager Agent: Orchestrates workflow and makes routing decisions
- Metrics Evaluation Agent: Enriches alerts with metrics, logs, and traces
- RCA Agent: Performs root cause analysis using enriched data
- Triage Agent: Determines ownership, escalation, and notifications
- Framework: LangGraph for workflow orchestration
- LLMs: OpenAI GPT-4, Google Gemini
- Communication: Model Context Protocol (MCP)
- API: FastAPI with async support
- Storage: PostgreSQL + Redis
- Monitoring: Grafana integration via MCP
- Core AI (Yellow) - The "brain" of the system
- Integration (Green) - How data flows in/out
- Data (Purple) - Where information is stored/processed
- Infrastructure (Red) - Where everything runs
- Monitoring (Pink) - How system health is tracked
- Python 3.11+
- Docker & Docker Compose
- OpenAI API key
- Google Gemini API key
- Grafana instance (optional)
- Clone and setup:
git clone <repository>
cd noc
cp .env.example .env- Configure environment variables:
# Edit .env with your API keys
OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_key
GRAFANA_URL=your_grafana_url
GRAFANA_API_KEY=your_grafana_key- Install dependencies:
pip install -e .# Start all services
docker-compose up -d
# Check health
curl http://localhost:8000/health
# View logs
docker-compose logs -f noc-agent# Start dependencies
docker-compose up -d redis postgres
# Install dependencies
pip install -r requirements.txt
# Start the API server
noc-agent serve
# Or run directly
python -m noc_agent.api.main# Start the API server
noc-agent serve --host 0.0.0.0 --port 8000
# Process a single alert
noc-agent process-alert "High CPU Usage" "CPU usage exceeded threshold" --severity high
# Test individual agents
noc-agent test-agents
# Show configuration
noc-agent configcurl -X POST http://localhost:8000/alerts \
-H "Content-Type: application/json" \
-d '{
"title": "High Memory Usage",
"description": "Memory usage exceeded 85%",
"severity": "high",
"source_system": "monitoring",
"source_component": "web-server-01",
"labels": {"service": "web", "team": "platform"}
}'curl http://localhost:8000/alerts/{alert_id}/statuscurl http://localhost:8000/alerts/{alert_id}Run the complete demo with sample alerts:
python examples/sample_alerts.pyThis will:
- Send 5 different types of alerts
- Monitor their processing in real-time
- Display a summary of results
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | Required |
GEMINI_API_KEY |
Google Gemini API key | Required |
GRAFANA_URL |
Grafana instance URL | Required for metrics |
GRAFANA_API_KEY |
Grafana API key | Required for metrics |
DATABASE_URL |
PostgreSQL connection string | Local default |
REDIS_URL |
Redis connection string | Local default |
LOG_LEVEL |
Logging level | INFO |
DEBUG |
Enable debug mode | false |
Configure which LLM models each agent uses:
MANAGER_AGENT_MODEL=gpt-4
METRICS_AGENT_MODEL=gemini-1.5-pro
RCA_AGENT_MODEL=gpt-4
TRIAGE_AGENT_MODEL=gpt-3.5-turboThe system uses MCP servers for external data sources:
{
"mcpServers": {
"staging-grafana": {
"type": "stdio",
"command": "docker",
"args": ["run", "-i", "--rm", "-e", "GRAFANA_URL", "-e", "GRAFANA_API_KEY", "mcp/grafana:latest", "-t", "stdio"],
"env": {
"GRAFANA_URL": "https://grafana.internal.example.com",
"GRAFANA_API_KEY": "your_api_key"
}
}
}
}# API health
curl http://localhost:8000/health
# System metrics
curl http://localhost:8000/metricsStructured JSON logging is used throughout:
# View real-time logs
docker-compose logs -f noc-agent
# Filter by agent
docker-compose logs noc-agent | grep "manager"Access Grafana at http://localhost:3000 (if using monitoring profile):
pytest tests/pytest tests/integration/# Test individual agents
noc-agent test-agents
# Test with specific alert
python -c "
from noc_agent.agents.manager import ManagerAgent
import asyncio
async def test():
agent = ManagerAgent()
# Test implementation here
asyncio.run(test())
"- API keys are never logged or exposed
- All external communications use HTTPS
- Input validation on all endpoints
- Rate limiting implemented
- Secrets management via environment variables
Typical processing times per alert:
- Simple alerts: 5-15 seconds
- Complex alerts: 15-45 seconds
- Critical alerts: < 30 seconds (prioritized)
- Horizontal scaling via multiple API instances
- Redis for shared state and queuing
- PostgreSQL for persistent storage
- Async processing throughout
