A production-ready multi-agent framework for autonomous DevOps incident management using CrewAI. Features real kubectl and boto3 integrations with comprehensive error handling, metrics tracking, and self-healing capabilities.
cd AgenticAI
pip install -r requirements.txt
cp .env.example .env
# Add your OPENAI_API_KEY to .env
# Optional: Configure for real integrations
kubectl version # Verify kubectl is installed
aws configure # Configure AWS credentials
# Run examples
python devops_examples/real_world_examples.py- ✅ Kubernetes: Actual
kubectlcommands - ✅ AWS: Real
boto3SDK integration - ✅ Error Handling: Retry logic with exponential backoff
- ✅ Metrics: MTTR, MTTD, SLA tracking
- ✅ Monitoring: Health checks and performance tracking
- ✅ Comprehensive error handling
- ✅ Structured logging
- ✅ Metrics collection and export
- ✅ SLA breach detection
- ✅ Graceful degradation
- ✅ Input validation
- ✅ Timeout handling
Detection & Triage (2 agents)
- Incident Detector - Monitors all systems
- Triage Specialist - Prioritizes incidents
Platform Specialists (4 agents)
- Kubernetes Expert - Troubleshoots K8s
- AWS Fargate Expert - Handles ECS/Fargate
- AWS Lambda Expert - Debugs serverless
- Container Security Expert - Scans images
Self-Healing (3 agents)
- Self-Healing Agent - Auto-fixes issues
- Emergency Responder - Fast incident response
- Preventive Maintenance - Proactive fixes
Support & Analysis (6 agents)
- Observability Expert - Analyzes Dynatrace
- IaC Expert - Manages Terraform/Pulumi
- Incident Commander - Coordinates response
- Communication Specialist - Updates stakeholders
- Postmortem Analyst - Creates reports
- Diagnostic Agent - Recommends fixes
Real Kubernetes Tools (devops_tools/real_k8s_tools.py)
- check_k8s_pod_status - Real kubectl get pods
- get_k8s_logs - Real kubectl logs
- get_k8s_events - Real kubectl get events
- scale_k8s_deployment - Real kubectl scale
- restart_pod - Real kubectl delete pod
- restart_deployment - Real kubectl rollout restart
- adjust_resource_limits - Real kubectl set resources
- update_container_image - Real kubectl set image
- rollback_deployment - Real kubectl rollout undo
- create_horizontal_pod_autoscaler - Real kubectl autoscale
Real AWS Tools (devops_tools/real_aws_tools.py)
- check_lambda_status - Real boto3 Lambda.get_function
- analyze_lambda_errors - Real CloudWatch Logs Insights
- invoke_lambda - Real boto3 Lambda.invoke
- check_fargate_task_status - Real boto3 ECS.describe_tasks
- scan_ecr_image - Real boto3 ECR.describe_image_scan_findings
- list_ecr_images - Real boto3 ECR.describe_images
Placeholder Tools (devops_tools/devops_tools.py)
- Dynatrace integration (placeholder)
- Terraform/Pulumi tools (placeholder)
- Notification tools (placeholder)
Utilities (utils/)
- Error handling with retry logic
- Metrics collection (MTTR, MTTD, SLA)
- Health monitoring
- Structured logging
✅ Kubernetes (Real) ✅ AWS Lambda (Real) ✅ AWS Fargate/ECS (Real) ✅ AWS ECR (Real) ⏳ Dynatrace (Placeholder) ⏳ Terraform (Placeholder) ⏳ Pulumi (Placeholder)
- Quick Start - Get started in 5 minutes
- DevOps Guide - Complete DevOps documentation
- Self-Healing Guide - Auto-fix capabilities
- Visual Guide - Architecture diagrams
- Detailed Guide - AgenticAI concepts
- Examples - Usage examples
- FAQ - Common questions
AgenticAI/
├── agents/ # Base agent framework
├── devops_agents/ # 15 specialized DevOps agents
│ ├── incident_agents.py
│ └── self_healing_agent.py
├── tools/ # Base tools
├── devops_tools/ # DevOps-specific tools
│ ├── devops_tools.py # Placeholder tools
│ ├── real_k8s_tools.py # Real kubectl integration ✅
│ └── real_aws_tools.py # Real boto3 integration ✅
├── utils/ # Utilities
│ ├── error_handling.py # Retry logic, validation ✅
│ ├── metrics.py # MTTR, MTTD, SLA tracking ✅
│ └── logger.py
├── memory/ # Memory systems
├── prompts/ # Prompt templates
├── config/ # Configuration
├── examples/ # General examples
├── devops_examples/ # DevOps workflows
│ ├── incident_response_examples.py
│ └── real_world_examples.py # Real tool examples ✅
├── docs/ # Complete documentation
├── tests/ # Tests
├── data/ # Data storage
└── logs/ # Application logs
from devops_agents.self_healing_agent import SelfHealingAgentFactory
from agents.crew_task import CrewTask
from agents.crew_manager import CrewManager
from utils.metrics import track_incident, resolve_incident
# Track incident
track_incident("INC-001", "P1")
# Create self-healing agent
factory = SelfHealingAgentFactory()
healer = factory.create_self_healing_k8s_agent()
# Define problem (uses real kubectl commands)
task = CrewTask(
description="Pod 'api-pod-12345' in 'production' is crashing. Fix it.",
agent=healer.get_agent(),
expected_output="Issue fixed and pod running"
)
# Execute - agent will:
# 1. Run: kubectl get pods -n production api-pod-12345
# 2. Run: kubectl logs -n production api-pod-12345
# 3. Run: kubectl get events -n production
# 4. Diagnose: OOMKilled
# 5. Run: kubectl set resources deployment/api-deployment --limits=memory=1Gi
# 6. Verify: kubectl get pods -n production
crew = CrewManager(
agents=[healer.get_agent()],
tasks=[task.get_task()]
)
result = crew.kickoff()
resolve_incident("INC-001")
print(result) # "Fixed OOMKilled by increasing memory to 1Gi"from utils.metrics import get_metrics_summary, export_metrics
# Get metrics
summary = get_metrics_summary()
print(f"MTTR: {summary['mttr_seconds']}s")
print(f"MTTD: {summary['mttd_seconds']}s")
print(f"Success Rate: {summary['agent_success_rate']}%")
# Export to file
export_metrics("logs/metrics_report.json")- Kubernetes Pod Crashes - Auto-detect and fix with real kubectl
- Lambda High Error Rate - Analyze with real CloudWatch Logs
- Production Outages - Full incident response workflow
- Infrastructure Drift - Detect and remediate
- Container Security - Scan ECR images with real boto3
- Performance Issues - Analyze metrics
- Post-Incident Analysis - Generate reports with metrics
# Required
OPENAI_API_KEY=sk-your-key-here
# Optional - for real AWS integrations
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
# Optional - for Dynatrace
DYNATRACE_TENANT=your_tenant_id
DYNATRACE_API_TOKEN=your_api_token# Default dummy values in real_k8s_tools.py
DUMMY_NAMESPACE = "production"
DUMMY_POD = "api-pod-12345"
DUMMY_DEPLOYMENT = "api-deployment"
# Default dummy values in real_aws_tools.py
DUMMY_FUNCTION = "api-function"
DUMMY_CLUSTER = "prod-cluster"
DUMMY_REPOSITORY = "my-app"
# Replace with your actual resource names# Navigate to directory
cd AgenticAI
# Install dependencies
pip install -r requirements.txt
# Install kubectl (for K8s tools)
# macOS: brew install kubectl
# Linux: snap install kubectl --classic
# Windows: choco install kubernetes-cli
# Configure AWS (for AWS tools)
aws configure
# Configure environment
cp .env.example .env
# Edit .env and add OPENAI_API_KEY
# Run examples
python devops_examples/real_world_examples.py- ✅ Real Integrations - Actual kubectl and boto3 commands
- ✅ Error Handling - Retry logic with exponential backoff
- ✅ Metrics Tracking - MTTR, MTTD, SLA compliance
- ✅ Self-Healing - Automatic issue remediation
- ✅ Multi-Platform - K8s, AWS, Dynatrace, IaC
- ✅ Production-Ready - Logging, validation, timeouts
- ✅ Graceful Degradation - Works without kubectl/AWS
- ✅ Extensible - Easy to add agents and tools
Real (Production-Ready)
- ✅ Kubernetes tools - Uses actual kubectl
- ✅ AWS Lambda tools - Uses actual boto3
- ✅ AWS Fargate tools - Uses actual boto3
- ✅ AWS ECR tools - Uses actual boto3
Placeholder (Need Implementation)
- ⏳ Dynatrace tools - API integration needed
- ⏳ Terraform tools - CLI execution needed
- ⏳ Pulumi tools - CLI execution needed
- ⏳ Notification tools - Webhook integration needed
- Uses dummy data by default
- Validates inputs before execution
- Logs all actions for audit trail
- Handles errors gracefully
- Supports dry-run mode
Contributions welcome! Priority areas:
- Implement Dynatrace API integration
- Add Terraform/Pulumi CLI execution
- Create notification integrations (Slack, PagerDuty)
- Add more test coverage
- Improve error handling
MIT License
Check documentation in docs/ or review examples in devops_examples/.
Built with CrewAI - Production-ready multi-agent DevOps automation